openai

Training Performance Engineer

San Francisco (Remote)fulltimemid

About this role

OpenAI seeks a Training Performance Engineer to optimize distributed machine learning training systems. You'll profile large-scale training runs, identify bottlenecks, and implement efficiency improvements across compute, communication, and storage infrastructure to maximize cluster utilization.

What you'll do

Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
Optimize GPU utilization and throughput for large-scale distributed model training
Collaborate with runtime and systems engineers to improve kernel efficiency and collective communication performance
Implement model graph transforms to improve end-to-end throughput
Build tooling to monitor and visualize MFU, throughput, and uptime metrics across clusters
Partner with researchers to ensure new model architectures scale efficiently during pre-training

What they're looking for

Python and C++ programming
Distributed training on multi-GPU systems or HPC clusters
GPU performance profiling and optimization
PyTorch, JAX, or TensorFlow frameworks
Debugging complex distributed systems
CUDA (preferred)
NCCL, MPI, or UCX communication libraries (preferred)
Data loading and checkpointing systems (preferred)

Benefits

Hybrid work model (3 days in office per week)
San Francisco location
Relocation assistance provided

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.