openai
Training Performance Engineer
San Francisco (Remote)fulltimemid
About this role
OpenAI seeks a Training Performance Engineer to optimize distributed machine learning training systems. You'll profile large-scale training runs, identify bottlenecks, and implement efficiency improvements across compute, communication, and storage infrastructure to maximize cluster utilization.
What you'll do
- Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
- Optimize GPU utilization and throughput for large-scale distributed model training
- Collaborate with runtime and systems engineers to improve kernel efficiency and collective communication performance
- Implement model graph transforms to improve end-to-end throughput
- Build tooling to monitor and visualize MFU, throughput, and uptime metrics across clusters
- Partner with researchers to ensure new model architectures scale efficiently during pre-training
What they're looking for
- Python and C++ programming
- Distributed training on multi-GPU systems or HPC clusters
- GPU performance profiling and optimization
- PyTorch, JAX, or TensorFlow frameworks
- Debugging complex distributed systems
- CUDA (preferred)
- NCCL, MPI, or UCX communication libraries (preferred)
- Data loading and checkpointing systems (preferred)
Benefits
- Hybrid work model (3 days in office per week)
- San Francisco location
- Relocation assistance provided
Opens the official application on the employer’s site. No login required.