Adaption Labs
Distributed Systems Engineer, Data & Inference Platform
San Francisco (Remote)fulltimemidAdded 2 days ago
About this role
Build and operate distributed systems that serve LLMs at scale and power large-scale data pipelines. You'll optimize inference services for throughput and cost, debug complex production failures, and partner with researchers to turn experimental workloads into reliable systems.
What you'll do
- Design and operate distributed inference systems for LLMs, managing batching, scheduling, KV cache, and autoscaling across GPU fleets
- Build large-scale data pipelines using frameworks like Ray Data or Spark for training and evaluation datasets
- Identify and resolve production failure modes including stragglers, memory fragmentation, and data corruption
- Define SLOs, build observability infrastructure, and own on-call rotation for production systems
- Partner directly with ML engineers and researchers to scale experimental workloads to production
- Write postmortems and implement durable fixes to prevent recurring incidents
What they're looking for
- Distributed systems design and production operations (5+ years)
- Large-scale data/compute frameworks (Ray, Spark, Flink, Beam, or Dask)
- Python and at least one systems language (Go, Rust, C++)
- GPU/accelerator stack knowledge (CUDA, NCCL, mixed precision, memory layout)
- Kubernetes infrastructure and custom operators/schedulers
- Production incident diagnosis and resolution
- LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI) — bonus
- Modern lakehouse formats (Iceberg, Delta, Hudi) — bonus
Benefits
- Flexible work with Bay Area collaboration and global team options
- Annual travel stipend (Adaption Passport) to explore new countries
- Weekly meal allowance for take-out or grocery delivery
- Comprehensive medical benefits
- Generous paid time off
Opens the official application on the employer’s site. No login required.