OpenAI

Hardware Operations Engineer

Remote - US (Remote)$86.4k–$228kfulltimemid

About this role

OpenAI seeks an experienced Hardware Operations Engineer to lead on-site hardware reliability and fleet health at a flagship AI campus. You'll diagnose and resolve complex hardware failures, drive root cause analyses, establish maintenance standards, and mentor teams while collaborating across engineering, operations, and vendor organizations to support world-class AI infrastructure at scale.

What you'll do

Lead technical triage and resolution of complex hardware failures affecting production GPU and server systems
Conduct root cause analyses for critical incidents and develop corrective action plans with cross-functional teams
Establish and improve hardware maintenance procedures, runbooks, and operational standards for scalability
Partner with Fleet Health Engineering to identify failure patterns and drive reliability improvement initiatives
Coordinate with Cloud Service Providers and OEM vendors on repairs, replacements, and hardware lifecycle management
Analyze failure trends and metrics to identify risks; mentor on-site technicians on troubleshooting excellence

What they're looking for

Large-scale datacenter hardware infrastructure management (8+ years)
Server platforms, GPU systems, and storage infrastructure expertise
Complex hardware failure diagnosis and repair in production environments
Root cause analysis and reliability engineering principles
Cross-functional leadership and vendor coordination
Technical documentation and operational process development
Fleet health management and monitoring
Hardware validation and production readiness assessment

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.