Skip to main content

openai

Software Engineer, Collective Communication

San Franciscofulltimemid

About this role

Design and implement efficient collective communication operations in C++ and CUDA for OpenAI's supercomputer training infrastructure. Work closely with ML researchers to optimize networking performance for large-scale AI model training while leveraging custom hardware capabilities.

What you'll do

  • Design and implement collective operations in C++ and CUDA integrated with the training stack
  • Optimize large training jobs to fully utilize different network transports in supercomputers
  • Collaborate with ML researchers on efficient communication algorithms
  • Develop network simulations to inform future supercomputer designs
  • Write and optimize low-level performance-sensitive CPU and GPU code

What they're looking for

  • C++ and CUDA programming
  • Distributed algorithms and RDMA experience
  • Low-level performance optimization
  • Network simulation techniques
  • Collective communication systems
  • GPU programming
  • High-performance computing
  • Supercomputer networking

Benefits

  • Hybrid work model (3 days/week in office)
  • San Francisco, CA location
  • Relocation assistance available
Apply on the employer's site

Opens the official application on the employer’s site. No login required.