openai
Software Engineer, Fleet Infrastructure
San Franciscofulltimemid
About this role
Join OpenAI's fleet infrastructure team to design and operate systems managing one of the world's largest GPU clusters for AI model training and deployment. You'll build scheduling systems, automate cluster operations, optimize model startup performance, and collaborate across teams to support research at massive scale.
What you'll do
- Design and implement job scheduling, cluster management, and CI/CD systems for GPU fleet
- Build user-friendly scheduling and quota systems to maximize GPU utilization
- Develop automation for Kubernetes cluster provisioning and upgrades
- Optimize model startup times through snapshot delivery and hardware caching
- Interface with researchers and product teams to understand workload requirements
- Collaborate with hardware and infrastructure teams on high-reliability service delivery
What they're looking for
- Strong programming abilities
- Hyperscale compute systems experience
- Kubernetes expertise
- Public cloud platforms (especially Azure)
- Cluster management and orchestration
- Job scheduling systems
- Execution-focused mentality with user-centric approach
- AI/ML workload understanding (bonus)
Benefits
- Hybrid work model: 3 days in-office per week in San Francisco
- Relocation assistance for new employees
- Opportunity to shape critical infrastructure for AI advancement
- Work on world-scale systems with high impact
- Collaborative environment across research and engineering teams
Opens the official application on the employer’s site. No login required.