Skip to main content

openai

Software Engineer, Agent Infrastructure

San Franciscofulltimemid

About this role

Join OpenAI's Agent Infrastructure team to build and scale systems that train and deploy AI agents at massive scale. You'll work with researchers and product teams to develop training environments and production platforms powering products like Operator and ChatGPT's tool use, handling some of the world's largest compute clusters.

What you'll do

  • Scale and optimize custom container orchestration platforms beyond traditional Kubernetes capabilities
  • Develop and maintain FastAPI and gRPC APIs for training and production agentic infrastructure
  • Use Terraform to design and evolve complex infrastructure for research and production environments
  • Collaborate with research teams to optimize systems for novel AI training runs and experimental applications
  • Identify and resolve performance bottlenecks in large-scale machine learning systems
  • Build new infrastructure capabilities from concept through production at extreme scale

What they're looking for

  • Large-scale machine learning infrastructure experience
  • FastAPI and gRPC API development
  • Terraform and infrastructure-as-code
  • Containerization and virtualization technologies (Kata, Firecracker, gVisor, Sysbox)
  • Cloud platform expertise
  • Performance optimization and distributed systems
  • 0-1 product development and rapid scaling
  • System-level debugging and bottleneck analysis

Benefits

  • Hybrid work model (3 days in office per week)
  • San Francisco or New York City location options
  • Relocation assistance for new employees
Apply on the employer's site

Opens the official application on the employer’s site. No login required.