openai
Software Engineer, Frontier Clusters Infrastructure
San Franciscofulltimemid
About this role
OpenAI seeks a Software Engineer to build and operate world-class supercomputer clusters supporting frontier AI model training. This role combines distributed systems expertise with hands-on infrastructure work, requiring you to scale Kubernetes at massive scale, automate bare-metal deployments, and ensure reliability across multi-datacenter environments.
What you'll do
- Design and scale large Kubernetes clusters with automation for provisioning, bootstrapping, and lifecycle management
- Build software abstractions that unify multiple clusters and provide seamless interfaces for training workloads
- Manage bare-metal node bring-up including firmware upgrades and repeatable deployment at scale
- Optimize operational metrics like reducing restart times and accelerating upgrade cycles
- Integrate networking and hardware health systems for end-to-end infrastructure reliability
- Develop monitoring and observability systems to detect issues early and maintain stability under extreme load
What they're looking for
- Kubernetes cluster operations and scaling in hyperscale environments
- Python, Go, or similar programming languages
- Infrastructure-as-Code tools (Terraform, CloudFormation)
- Bare-metal Linux and GPU hardware management
- Large-scale networking and data center operations
- Distributed systems engineering
- Cloud infrastructure (compute, networking, storage, security)
- GPU workloads and high-performance computing (bonus)
Opens the official application on the employer’s site. No login required.