openai
Software Engineer, Platform Systems
San Franciscofulltimemid
About this role
Join OpenAI's Platform Systems team to build distributed systems that monitor and operate large-scale AI training workloads. You'll develop failure detection, tracing, and observability infrastructure that enables reliable training on massive supercomputers, working at the core of OpenAI's training platform.
What you'll do
- Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training
- Develop tooling to identify and diagnose slow, faulty, or misbehaving nodes in distributed systems
- Improve observability, reliability, and performance across the training platform
- Debug and resolve issues in complex, high-throughput distributed systems
- Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
- Adapt failure detection and tracing systems to support new training paradigms and workloads
What they're looking for
- Distributed systems design and debugging
- Low-level systems programming and performance optimization
- Hardware, operating systems, and networking knowledge
- Concurrency and multi-threaded programming
- High-performance computing experience
- Observability and monitoring systems
- Failure detection and tracing systems
- System-level troubleshooting and automation
Opens the official application on the employer’s site. No login required.