openai
Software Engineer, Fleet Hardware Health
San Franciscofulltimemid
About this role
Join OpenAI's Fleet Hardware team to ensure the reliability and uptime of large-scale compute infrastructure powering AI research and products. You'll build automation systems and monitoring tools for server fleet management, troubleshoot cutting-edge hardware at scale, and collaborate across infrastructure teams to maintain peak performance and minimize costly hardware failures.
What you'll do
- Build and maintain automation for server fleet provisioning and lifecycle management
- Develop monitoring tools to track server health, performance, and lifecycle events
- Identify and resolve performance bottlenecks and hardware inefficiencies
- Collaborate with clusters, networking, and infrastructure teams on system reliability
- Partner with external operators to ensure quality standards
- Automate detection and remediation solutions to reduce manual operational work
What they're looking for
- Python, Go, or similar programming languages
- Linux system administration and kernel tuning
- Server hardware and networking knowledge
- Data analysis with SQL, PromQL, Pandas, or similar tools
- Large-scale infrastructure management experience
- Hardware management protocols (IPMI, Redfish)
- Monitoring tools (Prometheus, Grafana)
- High-performance computing or distributed systems experience
Opens the official application on the employer’s site. No login required.