openai

Software Engineer, GPU Infrastructure - HPC

San Franciscofulltimemid

About this role

OpenAI seeks a Software Engineer for its Fleet HPC team to ensure reliability and uptime of large-scale GPU compute infrastructure supporting AI research and products. You'll build automation systems, monitor server health, and collaborate across infrastructure teams to keep supercomputing systems running efficiently at unprecedented scale.

What you'll do

Build and maintain automation for server fleet provisioning and management
Develop monitoring tools for server health, performance, and lifecycle events
Investigate system-level issues and devise solutions for hardware reliability
Collaborate with clusters, networking, and infrastructure teams
Identify and fix performance bottlenecks in large-scale environments
Partner with external operators to maintain service quality

What they're looking for

Python, Go, or similar programming languages
Linux administration and server hardware knowledge
Networking fundamentals
Data analysis with SQL, PromQL, Pandas, or similar tools
Large-scale infrastructure management experience
Hardware management protocols (IPMI, Redfish preferred)
Monitoring tools like Prometheus and Grafana (bonus)
HPC or distributed systems experience (bonus)

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.