Skip to main content

openai

Software Engineer, Fleet Management

San Franciscofulltimemid

About this role

OpenAI is seeking a Software Engineer to build and manage large-scale computing infrastructure supporting AI research and products. You'll design systems for cloud and bare-metal fleet orchestration, integrate hardware with job scheduling, and automate infrastructure workflows to enhance reliability and efficiency.

What you'll do

  • Design and build management systems for cloud and bare-metal computing fleets at scale
  • Develop tools connecting hardware metrics with job scheduling and cluster management
  • Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
  • Automate infrastructure processes to reduce manual work and improve system reliability
  • Collaborate with hardware, infrastructure, and research teams on cross-stack integration
  • Improve tools, automation, processes, and documentation for operational efficiency

What they're looking for

  • Large-scale infrastructure systems design
  • Cluster management (Kubernetes, CI/CD pipelines, Terraform)
  • Server-level systems (Linux kernels, containerization, firmware management)
  • Cloud provider platforms
  • Systems automation and scripting
  • Performance optimization and troubleshooting
  • Hardware and networking knowledge
  • Configuration management tools (Chef or similar)

Benefits

  • Hybrid work model (3 days in-office per week)
  • Relocation assistance for new employees
  • Work on cutting-edge AI infrastructure
  • Exposure to large-scale computing challenges
  • Collaborative environment across multiple teams
Apply on the employer's site

Opens the official application on the employer’s site. No login required.