Skip to main content

openai

Software Engineer, GPU Infrastructure - HPC

San Franciscofulltimemid

About this role

OpenAI seeks a Software Engineer for its Fleet HPC team to ensure reliability and uptime of large-scale GPU compute infrastructure supporting AI research and products. You'll build automation systems, monitor server health, and collaborate across infrastructure teams to keep supercomputing systems running efficiently at unprecedented scale.

What you'll do

  • Build and maintain automation for server fleet provisioning and management
  • Develop monitoring tools for server health, performance, and lifecycle events
  • Investigate system-level issues and devise solutions for hardware reliability
  • Collaborate with clusters, networking, and infrastructure teams
  • Identify and fix performance bottlenecks in large-scale environments
  • Partner with external operators to maintain service quality

What they're looking for

  • Python, Go, or similar programming languages
  • Linux administration and server hardware knowledge
  • Networking fundamentals
  • Data analysis with SQL, PromQL, Pandas, or similar tools
  • Large-scale infrastructure management experience
  • Hardware management protocols (IPMI, Redfish preferred)
  • Monitoring tools like Prometheus and Grafana (bonus)
  • HPC or distributed systems experience (bonus)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.