openai
Research Infrastructure Engineer, Training Systems
San Francisco (Remote)fulltimemid
About this role
Join OpenAI's research infrastructure team to engineer the systems layer powering frontier model training. You'll design APIs and infrastructure that translate novel research ideas into scalable, reliable training workloads while optimizing performance across distributed systems and GPUs.
What you'll do
- Build and maintain infrastructure for large-scale model training and experimentation
- Design APIs and interfaces that simplify complex training workflows
- Improve reliability, debuggability, and performance of training and data pipelines
- Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
- Write tests, benchmarks, and diagnostics to catch meaningful regressions
What they're looking for
- Systems engineering and performance optimization
- Python and PyTorch
- Distributed systems architecture
- GPU programming and CUDA
- API and interface design
- Debugging and profiling tools
- Data pipeline engineering
- Testing and benchmarking
Opens the official application on the employer’s site. No login required.