OpenAI
Hardware Operations Engineer
Remote - US (Remote)$86.4k–$228kfulltimemid
About this role
OpenAI seeks an experienced Hardware Operations Engineer to lead on-site hardware reliability and fleet health at a flagship AI campus. You'll diagnose and resolve complex hardware failures, drive root cause analyses, establish maintenance standards, and mentor teams while collaborating across engineering, operations, and vendor organizations to support world-class AI infrastructure at scale.
What you'll do
- Lead technical triage and resolution of complex hardware failures affecting production GPU and server systems
- Conduct root cause analyses for critical incidents and develop corrective action plans with cross-functional teams
- Establish and improve hardware maintenance procedures, runbooks, and operational standards for scalability
- Partner with Fleet Health Engineering to identify failure patterns and drive reliability improvement initiatives
- Coordinate with Cloud Service Providers and OEM vendors on repairs, replacements, and hardware lifecycle management
- Analyze failure trends and metrics to identify risks; mentor on-site technicians on troubleshooting excellence
What they're looking for
- Large-scale datacenter hardware infrastructure management (8+ years)
- Server platforms, GPU systems, and storage infrastructure expertise
- Complex hardware failure diagnosis and repair in production environments
- Root cause analysis and reliability engineering principles
- Cross-functional leadership and vendor coordination
- Technical documentation and operational process development
- Fleet health management and monitoring
- Hardware validation and production readiness assessment
Opens the official application on the employer’s site. No login required.