openai

Software Engineer, Collective Communication

San Franciscofulltimemid

About this role

Design and implement efficient collective communication operations in C++ and CUDA for OpenAI's supercomputer training infrastructure. Work closely with ML researchers to optimize networking performance for large-scale AI model training while leveraging custom hardware capabilities.

What you'll do

Design and implement collective operations in C++ and CUDA integrated with the training stack
Optimize large training jobs to fully utilize different network transports in supercomputers
Collaborate with ML researchers on efficient communication algorithms
Develop network simulations to inform future supercomputer designs
Write and optimize low-level performance-sensitive CPU and GPU code

What they're looking for

C++ and CUDA programming
Distributed algorithms and RDMA experience
Low-level performance optimization
Network simulation techniques
Collective communication systems
GPU programming
High-performance computing
Supercomputer networking

Benefits

Hybrid work model (3 days/week in office)
San Francisco, CA location
Relocation assistance available

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.