openai
Site Reliability Engineer, Frontier Systems Infrastructure
San Franciscofulltimemid
About this role
OpenAI seeks a Site Reliability Engineer to build and operate hyperscale supercomputer clusters that power frontier AI model training. You'll manage massive Kubernetes deployments, automate bare-metal infrastructure, and ensure reliability across distributed data centers.
What you'll do
- Scale and provision large Kubernetes clusters with automation for lifecycle management
- Develop software abstractions that unify multiple clusters for training workloads
- Manage bare-metal node bring-up, firmware upgrades, and deployment automation
- Optimize operational metrics like cluster restart times and upgrade cycles
- Build monitoring and observability systems for early issue detection
- Integrate networking and hardware health systems for end-to-end infrastructure reliability
What they're looking for
- Kubernetes cluster operations and scaling at hyperscale
- Python, Go, or similar programming/scripting languages
- Infrastructure-as-Code tools (Terraform, CloudFormation)
- Linux, bare-metal systems, and GPU hardware
- Large-scale networking and cloud infrastructure
- Distributed systems engineering
- Automation and DevOps practices
- High-availability system design
Opens the official application on the employer’s site. No login required.