openai

Research Infrastructure Engineer, Training Systems

San Francisco (Remote)fulltimemid

About this role

Join OpenAI's research infrastructure team to engineer the systems layer powering frontier model training. You'll design APIs and infrastructure that translate novel research ideas into scalable, reliable training workloads while optimizing performance across distributed systems and GPUs.

What you'll do

Build and maintain infrastructure for large-scale model training and experimentation
Design APIs and interfaces that simplify complex training workflows
Improve reliability, debuggability, and performance of training and data pipelines
Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
Write tests, benchmarks, and diagnostics to catch meaningful regressions

What they're looking for

Systems engineering and performance optimization
Python and PyTorch
Distributed systems architecture
GPU programming and CUDA
API and interface design
Debugging and profiling tools
Data pipeline engineering
Testing and benchmarking

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.