openai

Software Engineer, Workload Enablement

San Francisco (Remote)fulltimemid

About this role

OpenAI is seeking a Software Engineer to validate and optimize AI training and inference workloads on new hardware platforms. You'll create benchmarks, port production workloads, analyze performance bottlenecks, and collaborate with systems teams to ensure platforms meet production readiness standards.

What you'll do

Port and validate inference and training workloads on new platforms, ensuring correctness, performance, and stability
Build comprehensive benchmarks and stress tests that exercise compute, memory, networking, storage, and failure modes
Perform deep-dive performance analysis on distributed training, including collective communications and compute-communication overlap
Create repeatable test harnesses for CI/lab environments with clear pass/fail and regression detection outputs
Partner with systems and fleet engineers on platform stability, operability, and Kubernetes integration
Work with vendors and stakeholders by producing bug reports, minimal reproductions, and prioritized issue lists

What they're looking for

PyTorch and modern LLM training/inference stacks
Distributed systems and large-scale training concepts (data/model/pipeline parallelism)
RDMA networking and communications library optimization (NCCL/RCCL)
Performance profiling and debugging tools (Nsight, rocprof, perf, flamegraphs)
Python and performance-critical code (C++/CUDA/HIP)
Kubernetes and container orchestration
Hardware bring-up and early platform validation
ML systems or HPC engineering (5+ years experience)

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.