Skip to main content

openai

Software Engineer, Workload Enablement

San Francisco (Remote)fulltimemid

About this role

OpenAI is seeking a Software Engineer to validate and optimize AI training and inference workloads on new hardware platforms. You'll create benchmarks, port production workloads, analyze performance bottlenecks, and collaborate with systems teams to ensure platforms meet production readiness standards.

What you'll do

  • Port and validate inference and training workloads on new platforms, ensuring correctness, performance, and stability
  • Build comprehensive benchmarks and stress tests that exercise compute, memory, networking, storage, and failure modes
  • Perform deep-dive performance analysis on distributed training, including collective communications and compute-communication overlap
  • Create repeatable test harnesses for CI/lab environments with clear pass/fail and regression detection outputs
  • Partner with systems and fleet engineers on platform stability, operability, and Kubernetes integration
  • Work with vendors and stakeholders by producing bug reports, minimal reproductions, and prioritized issue lists

What they're looking for

  • PyTorch and modern LLM training/inference stacks
  • Distributed systems and large-scale training concepts (data/model/pipeline parallelism)
  • RDMA networking and communications library optimization (NCCL/RCCL)
  • Performance profiling and debugging tools (Nsight, rocprof, perf, flamegraphs)
  • Python and performance-critical code (C++/CUDA/HIP)
  • Kubernetes and container orchestration
  • Hardware bring-up and early platform validation
  • ML systems or HPC engineering (5+ years experience)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.