openai
Software Engineer, GPU Infrastructure - HPC
San Franciscofulltimemid
About this role
OpenAI seeks a Software Engineer for its Fleet HPC team to ensure reliability and uptime of large-scale GPU compute infrastructure supporting AI research and products. You'll build automation systems, monitor server health, and collaborate across infrastructure teams to keep supercomputing systems running efficiently at unprecedented scale.
What you'll do
- Build and maintain automation for server fleet provisioning and management
- Develop monitoring tools for server health, performance, and lifecycle events
- Investigate system-level issues and devise solutions for hardware reliability
- Collaborate with clusters, networking, and infrastructure teams
- Identify and fix performance bottlenecks in large-scale environments
- Partner with external operators to maintain service quality
What they're looking for
- Python, Go, or similar programming languages
- Linux administration and server hardware knowledge
- Networking fundamentals
- Data analysis with SQL, PromQL, Pandas, or similar tools
- Large-scale infrastructure management experience
- Hardware management protocols (IPMI, Redfish preferred)
- Monitoring tools like Prometheus and Grafana (bonus)
- HPC or distributed systems experience (bonus)
Opens the official application on the employer’s site. No login required.