openai

Software Engineer, Hardware Health

San Franciscofulltimemid

About this role

Join OpenAI's Hardware Health and Observability team to build infrastructure that monitors and maintains the health of millions of GPUs across global compute clusters. You'll own end-to-end systems for detecting hardware failures, automating remediation, and ensuring reliable operation of frontier model training and inference workloads at hyperscale.

What you'll do

Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure
Build health checks that detect, remediate, and verify failures at scale with minimal latency
Investigate hardware failures and system-level issues in large-scale compute environments
Own node lifecycle workflows including drain, quarantine, repair, and return-to-service processes
Build automation and tooling for global cluster management with minimal manual intervention
Partner with workload and reliability teams to integrate health signals into production systems

What they're looking for

Python and shell scripting
Large-scale distributed systems design
Systems debugging and operational troubleshooting
SQL or PromQL for data analysis
Linux and low-level hardware systems (PCIe, InfiniBand, networking)
GPU or accelerator cluster operations
Fleet lifecycle management and automated remediation
Network operations and observability

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.