Skip to main content

openai

Software Engineer, Hardware Health

San Franciscofulltimemid

About this role

Join OpenAI's Hardware Health and Observability team to build infrastructure that monitors and maintains the health of millions of GPUs across global compute clusters. You'll own end-to-end systems for detecting hardware failures, automating remediation, and ensuring reliable operation of frontier model training and inference workloads at hyperscale.

What you'll do

  • Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure
  • Build health checks that detect, remediate, and verify failures at scale with minimal latency
  • Investigate hardware failures and system-level issues in large-scale compute environments
  • Own node lifecycle workflows including drain, quarantine, repair, and return-to-service processes
  • Build automation and tooling for global cluster management with minimal manual intervention
  • Partner with workload and reliability teams to integrate health signals into production systems

What they're looking for

  • Python and shell scripting
  • Large-scale distributed systems design
  • Systems debugging and operational troubleshooting
  • SQL or PromQL for data analysis
  • Linux and low-level hardware systems (PCIe, InfiniBand, networking)
  • GPU or accelerator cluster operations
  • Fleet lifecycle management and automated remediation
  • Network operations and observability
Apply on the employer's site

Opens the official application on the employer’s site. No login required.