openai
Software Engineer, RL Training Infra
San Francisco (Remote)fulltimemid
About this role
Join OpenAI's Post-Training Frontiers team to keep large-scale reinforcement learning training runs operational and efficient. You'll solve critical engineering and infrastructure challenges across training systems, inference, and distributed infrastructure while supporting the development of frontier AI agents shipped in products like ChatGPT and the API.
What you'll do
- Debug and resolve urgent engineering and infrastructure issues blocking RL training runs
- Troubleshoot failures across training systems, inference, orchestration, scaling, and distributed infrastructure
- Improve reliability and efficiency of large-scale model training pipelines
- Support researchers developing infrastructure-heavy capabilities like multi-agent systems and memory
- Convert recurring operational problems into robust tools, systems, and processes
- Collaborate with research and infrastructure teams during tight model training timelines
What they're looking for
- ML infrastructure (training systems, RL, or inference)
- Distributed systems debugging across GPUs and networking
- Scaling and orchestration systems
- Performance optimization and production infrastructure
- Fast learning and cross-layer problem-solving
- Deep debugging and root cause analysis
- Strong ownership and communication
- Experience with large-scale model training (nice to have)
Opens the official application on the employer’s site. No login required.