Skip to main content

openai

Software Engineer, Reliability

San Franciscofulltimemid

About this role

OpenAI is seeking a Software Engineer focused on Reliability to ensure their systems scale safely and performantly as they expand globally. You'll design infrastructure solutions, build testing and automation tools, and collaborate across teams to maintain system stability while handling millions of users.

What you'll do

  • Design and implement scalable infrastructure solutions to meet growing demands
  • Build and maintain load testing, chaos testing, and synthetic testing software for development teams
  • Create automation tools and resource lifecycle management platforms for CPU, storage, GPU, and network
  • Develop service level objectives (SLOs) and indicators (SLIs) to measure system reliability
  • Implement fault-tolerant design patterns to minimize service disruptions
  • Participate in on-call rotation for incident response and 24/7 system availability

What they're looking for

  • Cloud infrastructure (proven experience)
  • Kubernetes and container orchestration
  • Infrastructure as Code (Terraform, CloudFormation)
  • Observability tools (DataDog, Prometheus, Grafana, Splunk)
  • Microservices architecture and service mesh technologies
  • Programming languages (specific ones not listed)
  • Problem-solving and troubleshooting
  • Cloud security best practices

Benefits

  • Work on AI systems deployed to millions of users globally
  • Collaborate with cross-functional teams including researchers and product managers
  • Fast-paced, iterative environment with emphasis on learning from deployment
  • Relocation assistance provided for new employees
  • Based in San Francisco HQ
  • Focus on safety and responsible AI deployment
Apply on the employer's site

Opens the official application on the employer’s site. No login required.