Skip to main content

openai

Software Engineer, Fleet Hardware Health

San Franciscofulltimemid

About this role

Join OpenAI's Fleet Hardware team to ensure the reliability and uptime of large-scale compute infrastructure powering AI research and products. You'll build automation systems and monitoring tools for server fleet management, troubleshoot cutting-edge hardware at scale, and collaborate across infrastructure teams to maintain peak performance and minimize costly hardware failures.

What you'll do

  • Build and maintain automation for server fleet provisioning and lifecycle management
  • Develop monitoring tools to track server health, performance, and lifecycle events
  • Identify and resolve performance bottlenecks and hardware inefficiencies
  • Collaborate with clusters, networking, and infrastructure teams on system reliability
  • Partner with external operators to ensure quality standards
  • Automate detection and remediation solutions to reduce manual operational work

What they're looking for

  • Python, Go, or similar programming languages
  • Linux system administration and kernel tuning
  • Server hardware and networking knowledge
  • Data analysis with SQL, PromQL, Pandas, or similar tools
  • Large-scale infrastructure management experience
  • Hardware management protocols (IPMI, Redfish)
  • Monitoring tools (Prometheus, Grafana)
  • High-performance computing or distributed systems experience
Apply on the employer's site

Opens the official application on the employer’s site. No login required.