openai

Software Engineer, RL Training Infra

San Francisco (Remote)fulltimemid

About this role

Join OpenAI's Post-Training Frontiers team to keep large-scale reinforcement learning training runs operational and efficient. You'll solve critical engineering and infrastructure challenges across training systems, inference, and distributed infrastructure while supporting the development of frontier AI agents shipped in products like ChatGPT and the API.

What you'll do

Debug and resolve urgent engineering and infrastructure issues blocking RL training runs
Troubleshoot failures across training systems, inference, orchestration, scaling, and distributed infrastructure
Improve reliability and efficiency of large-scale model training pipelines
Support researchers developing infrastructure-heavy capabilities like multi-agent systems and memory
Convert recurring operational problems into robust tools, systems, and processes
Collaborate with research and infrastructure teams during tight model training timelines

What they're looking for

ML infrastructure (training systems, RL, or inference)
Distributed systems debugging across GPUs and networking
Scaling and orchestration systems
Performance optimization and production infrastructure
Fast learning and cross-layer problem-solving
Deep debugging and root cause analysis
Strong ownership and communication
Experience with large-scale model training (nice to have)

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.