Skip to main content

openai

Site Reliability Engineer, Frontier Systems Infrastructure

San Franciscofulltimemid

About this role

OpenAI seeks a Site Reliability Engineer to build and operate hyperscale supercomputer clusters that power frontier AI model training. You'll manage massive Kubernetes deployments, automate bare-metal infrastructure, and ensure reliability across distributed data centers.

What you'll do

  • Scale and provision large Kubernetes clusters with automation for lifecycle management
  • Develop software abstractions that unify multiple clusters for training workloads
  • Manage bare-metal node bring-up, firmware upgrades, and deployment automation
  • Optimize operational metrics like cluster restart times and upgrade cycles
  • Build monitoring and observability systems for early issue detection
  • Integrate networking and hardware health systems for end-to-end infrastructure reliability

What they're looking for

  • Kubernetes cluster operations and scaling at hyperscale
  • Python, Go, or similar programming/scripting languages
  • Infrastructure-as-Code tools (Terraform, CloudFormation)
  • Linux, bare-metal systems, and GPU hardware
  • Large-scale networking and cloud infrastructure
  • Distributed systems engineering
  • Automation and DevOps practices
  • High-availability system design
Apply on the employer's site

Opens the official application on the employer’s site. No login required.