Skip to main content

openai

Software Engineer, Platform Systems

San Franciscofulltimemid

About this role

Join OpenAI's Platform Systems team to build distributed systems that monitor and operate large-scale AI training workloads. You'll develop failure detection, tracing, and observability infrastructure that enables reliable training on massive supercomputers, working at the core of OpenAI's training platform.

What you'll do

  • Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training
  • Develop tooling to identify and diagnose slow, faulty, or misbehaving nodes in distributed systems
  • Improve observability, reliability, and performance across the training platform
  • Debug and resolve issues in complex, high-throughput distributed systems
  • Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
  • Adapt failure detection and tracing systems to support new training paradigms and workloads

What they're looking for

  • Distributed systems design and debugging
  • Low-level systems programming and performance optimization
  • Hardware, operating systems, and networking knowledge
  • Concurrency and multi-threaded programming
  • High-performance computing experience
  • Observability and monitoring systems
  • Failure detection and tracing systems
  • System-level troubleshooting and automation
Apply on the employer's site

Opens the official application on the employer’s site. No login required.