Adaption Labs

Distributed Systems Engineer, Data & Inference Platform

San Francisco (Remote)fulltimemidAdded 2 days ago

About this role

Build and operate distributed systems that serve LLMs at scale and power large-scale data pipelines. You'll optimize inference services for throughput and cost, debug complex production failures, and partner with researchers to turn experimental workloads into reliable systems.

What you'll do

Design and operate distributed inference systems for LLMs, managing batching, scheduling, KV cache, and autoscaling across GPU fleets
Build large-scale data pipelines using frameworks like Ray Data or Spark for training and evaluation datasets
Identify and resolve production failure modes including stragglers, memory fragmentation, and data corruption
Define SLOs, build observability infrastructure, and own on-call rotation for production systems
Partner directly with ML engineers and researchers to scale experimental workloads to production
Write postmortems and implement durable fixes to prevent recurring incidents

What they're looking for

Distributed systems design and production operations (5+ years)
Large-scale data/compute frameworks (Ray, Spark, Flink, Beam, or Dask)
Python and at least one systems language (Go, Rust, C++)
GPU/accelerator stack knowledge (CUDA, NCCL, mixed precision, memory layout)
Kubernetes infrastructure and custom operators/schedulers
Production incident diagnosis and resolution
LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI) — bonus
Modern lakehouse formats (Iceberg, Delta, Hudi) — bonus

Benefits

Flexible work with Bay Area collaboration and global team options
Annual travel stipend (Adaption Passport) to explore new countries
Weekly meal allowance for take-out or grocery delivery
Comprehensive medical benefits
Generous paid time off

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.