openai

Software Engineer, Agent Infrastructure

San Franciscofulltimemid

About this role

Join OpenAI's Agent Infrastructure team to build and scale systems that train and deploy AI agents at massive scale. You'll work with researchers and product teams to develop training environments and production platforms powering products like Operator and ChatGPT's tool use, handling some of the world's largest compute clusters.

What you'll do

Scale and optimize custom container orchestration platforms beyond traditional Kubernetes capabilities
Develop and maintain FastAPI and gRPC APIs for training and production agentic infrastructure
Use Terraform to design and evolve complex infrastructure for research and production environments
Collaborate with research teams to optimize systems for novel AI training runs and experimental applications
Identify and resolve performance bottlenecks in large-scale machine learning systems
Build new infrastructure capabilities from concept through production at extreme scale

What they're looking for

Large-scale machine learning infrastructure experience
FastAPI and gRPC API development
Terraform and infrastructure-as-code
Containerization and virtualization technologies (Kata, Firecracker, gVisor, Sysbox)
Cloud platform expertise
Performance optimization and distributed systems
0-1 product development and rapid scaling
System-level debugging and bottleneck analysis

Benefits

Hybrid work model (3 days in office per week)
San Francisco or New York City location options
Relocation assistance for new employees

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.