Skip to main content

openai

Research Infrastructure Engineer, Training Systems

San Francisco (Remote)fulltimemid

About this role

Join OpenAI's research infrastructure team to engineer the systems layer powering frontier model training. You'll design APIs and infrastructure that translate novel research ideas into scalable, reliable training workloads while optimizing performance across distributed systems and GPUs.

What you'll do

  • Build and maintain infrastructure for large-scale model training and experimentation
  • Design APIs and interfaces that simplify complex training workflows
  • Improve reliability, debuggability, and performance of training and data pipelines
  • Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
  • Write tests, benchmarks, and diagnostics to catch meaningful regressions

What they're looking for

  • Systems engineering and performance optimization
  • Python and PyTorch
  • Distributed systems architecture
  • GPU programming and CUDA
  • API and interface design
  • Debugging and profiling tools
  • Data pipeline engineering
  • Testing and benchmarking
Apply on the employer's site

Opens the official application on the employer’s site. No login required.