Skip to main content

openai

Software Engineer, Frontier Clusters Infrastructure

San Franciscofulltimemid

About this role

OpenAI seeks a Software Engineer to build and operate world-class supercomputer clusters supporting frontier AI model training. This role combines distributed systems expertise with hands-on infrastructure work, requiring you to scale Kubernetes at massive scale, automate bare-metal deployments, and ensure reliability across multi-datacenter environments.

What you'll do

  • Design and scale large Kubernetes clusters with automation for provisioning, bootstrapping, and lifecycle management
  • Build software abstractions that unify multiple clusters and provide seamless interfaces for training workloads
  • Manage bare-metal node bring-up including firmware upgrades and repeatable deployment at scale
  • Optimize operational metrics like reducing restart times and accelerating upgrade cycles
  • Integrate networking and hardware health systems for end-to-end infrastructure reliability
  • Develop monitoring and observability systems to detect issues early and maintain stability under extreme load

What they're looking for

  • Kubernetes cluster operations and scaling in hyperscale environments
  • Python, Go, or similar programming languages
  • Infrastructure-as-Code tools (Terraform, CloudFormation)
  • Bare-metal Linux and GPU hardware management
  • Large-scale networking and data center operations
  • Distributed systems engineering
  • Cloud infrastructure (compute, networking, storage, security)
  • GPU workloads and high-performance computing (bonus)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.