openai
Software Engineer, Collective Communication
San Franciscofulltimemid
About this role
Design and implement efficient collective communication operations in C++ and CUDA for OpenAI's supercomputer training infrastructure. Work closely with ML researchers to optimize networking performance for large-scale AI model training while leveraging custom hardware capabilities.
What you'll do
- Design and implement collective operations in C++ and CUDA integrated with the training stack
- Optimize large training jobs to fully utilize different network transports in supercomputers
- Collaborate with ML researchers on efficient communication algorithms
- Develop network simulations to inform future supercomputer designs
- Write and optimize low-level performance-sensitive CPU and GPU code
What they're looking for
- C++ and CUDA programming
- Distributed algorithms and RDMA experience
- Low-level performance optimization
- Network simulation techniques
- Collective communication systems
- GPU programming
- High-performance computing
- Supercomputer networking
Benefits
- Hybrid work model (3 days/week in office)
- San Francisco, CA location
- Relocation assistance available
Opens the official application on the employer’s site. No login required.