openai

Software Engineer, Fleet Management

San Franciscofulltimemid

About this role

OpenAI is seeking a Software Engineer to build and manage large-scale computing infrastructure supporting AI research and products. You'll design systems for cloud and bare-metal fleet orchestration, integrate hardware with job scheduling, and automate infrastructure workflows to enhance reliability and efficiency.

What you'll do

Design and build management systems for cloud and bare-metal computing fleets at scale
Develop tools connecting hardware metrics with job scheduling and cluster management
Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
Automate infrastructure processes to reduce manual work and improve system reliability
Collaborate with hardware, infrastructure, and research teams on cross-stack integration
Improve tools, automation, processes, and documentation for operational efficiency

What they're looking for

Large-scale infrastructure systems design
Cluster management (Kubernetes, CI/CD pipelines, Terraform)
Server-level systems (Linux kernels, containerization, firmware management)
Cloud provider platforms
Systems automation and scripting
Performance optimization and troubleshooting
Hardware and networking knowledge
Configuration management tools (Chef or similar)

Benefits

Hybrid work model (3 days in-office per week)
Relocation assistance for new employees
Work on cutting-edge AI infrastructure
Exposure to large-scale computing challenges
Collaborative environment across multiple teams

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.