openai

Software Engineer, Fleet Infrastructure

San Franciscofulltimemid

About this role

Join OpenAI's fleet infrastructure team to design and operate systems managing one of the world's largest GPU clusters for AI model training and deployment. You'll build scheduling systems, automate cluster operations, optimize model startup performance, and collaborate across teams to support research at massive scale.

What you'll do

Design and implement job scheduling, cluster management, and CI/CD systems for GPU fleet
Build user-friendly scheduling and quota systems to maximize GPU utilization
Develop automation for Kubernetes cluster provisioning and upgrades
Optimize model startup times through snapshot delivery and hardware caching
Interface with researchers and product teams to understand workload requirements
Collaborate with hardware and infrastructure teams on high-reliability service delivery

What they're looking for

Strong programming abilities
Hyperscale compute systems experience
Kubernetes expertise
Public cloud platforms (especially Azure)
Cluster management and orchestration
Job scheduling systems
Execution-focused mentality with user-centric approach
AI/ML workload understanding (bonus)

Benefits

Hybrid work model: 3 days in-office per week in San Francisco
Relocation assistance for new employees
Opportunity to shape critical infrastructure for AI advancement
Work on world-scale systems with high impact
Collaborative environment across research and engineering teams

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.