openai
Software Engineer, Fleet Management
San Franciscofulltimemid
About this role
OpenAI is seeking a Software Engineer to build and manage large-scale computing infrastructure supporting AI research and products. You'll design systems for cloud and bare-metal fleet orchestration, integrate hardware with job scheduling, and automate infrastructure workflows to enhance reliability and efficiency.
What you'll do
- Design and build management systems for cloud and bare-metal computing fleets at scale
- Develop tools connecting hardware metrics with job scheduling and cluster management
- Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
- Automate infrastructure processes to reduce manual work and improve system reliability
- Collaborate with hardware, infrastructure, and research teams on cross-stack integration
- Improve tools, automation, processes, and documentation for operational efficiency
What they're looking for
- Large-scale infrastructure systems design
- Cluster management (Kubernetes, CI/CD pipelines, Terraform)
- Server-level systems (Linux kernels, containerization, firmware management)
- Cloud provider platforms
- Systems automation and scripting
- Performance optimization and troubleshooting
- Hardware and networking knowledge
- Configuration management tools (Chef or similar)
Benefits
- Hybrid work model (3 days in-office per week)
- Relocation assistance for new employees
- Work on cutting-edge AI infrastructure
- Exposure to large-scale computing challenges
- Collaborative environment across multiple teams
Opens the official application on the employer’s site. No login required.