openai

Site Reliability Engineer, Frontier Systems Infrastructure

San Franciscofulltimemid

About this role

OpenAI seeks a Site Reliability Engineer to build and operate hyperscale supercomputer clusters that power frontier AI model training. You'll manage massive Kubernetes deployments, automate bare-metal infrastructure, and ensure reliability across distributed data centers.

What you'll do

Scale and provision large Kubernetes clusters with automation for lifecycle management
Develop software abstractions that unify multiple clusters for training workloads
Manage bare-metal node bring-up, firmware upgrades, and deployment automation
Optimize operational metrics like cluster restart times and upgrade cycles
Build monitoring and observability systems for early issue detection
Integrate networking and hardware health systems for end-to-end infrastructure reliability

What they're looking for

Kubernetes cluster operations and scaling at hyperscale
Python, Go, or similar programming/scripting languages
Infrastructure-as-Code tools (Terraform, CloudFormation)
Linux, bare-metal systems, and GPU hardware
Large-scale networking and cloud infrastructure
Distributed systems engineering
Automation and DevOps practices
High-availability system design

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.