Skip to main content

openai

Training Performance Engineer

San Francisco (Remote)fulltimemid

About this role

OpenAI seeks a Training Performance Engineer to optimize distributed machine learning training systems. You'll profile large-scale training runs, identify bottlenecks, and implement efficiency improvements across compute, communication, and storage infrastructure to maximize cluster utilization.

What you'll do

  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
  • Optimize GPU utilization and throughput for large-scale distributed model training
  • Collaborate with runtime and systems engineers to improve kernel efficiency and collective communication performance
  • Implement model graph transforms to improve end-to-end throughput
  • Build tooling to monitor and visualize MFU, throughput, and uptime metrics across clusters
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training

What they're looking for

  • Python and C++ programming
  • Distributed training on multi-GPU systems or HPC clusters
  • GPU performance profiling and optimization
  • PyTorch, JAX, or TensorFlow frameworks
  • Debugging complex distributed systems
  • CUDA (preferred)
  • NCCL, MPI, or UCX communication libraries (preferred)
  • Data loading and checkpointing systems (preferred)

Benefits

  • Hybrid work model (3 days in office per week)
  • San Francisco location
  • Relocation assistance provided
Apply on the employer's site

Opens the official application on the employer’s site. No login required.