Skip to main content

Adaption Labs

Distributed Systems Engineer, Data & Inference Platform

San Francisco (Remote)fulltimemidAdded 2 days ago

About this role

Build and operate distributed systems that serve LLMs at scale and power large-scale data pipelines. You'll optimize inference services for throughput and cost, debug complex production failures, and partner with researchers to turn experimental workloads into reliable systems.

What you'll do

  • Design and operate distributed inference systems for LLMs, managing batching, scheduling, KV cache, and autoscaling across GPU fleets
  • Build large-scale data pipelines using frameworks like Ray Data or Spark for training and evaluation datasets
  • Identify and resolve production failure modes including stragglers, memory fragmentation, and data corruption
  • Define SLOs, build observability infrastructure, and own on-call rotation for production systems
  • Partner directly with ML engineers and researchers to scale experimental workloads to production
  • Write postmortems and implement durable fixes to prevent recurring incidents

What they're looking for

  • Distributed systems design and production operations (5+ years)
  • Large-scale data/compute frameworks (Ray, Spark, Flink, Beam, or Dask)
  • Python and at least one systems language (Go, Rust, C++)
  • GPU/accelerator stack knowledge (CUDA, NCCL, mixed precision, memory layout)
  • Kubernetes infrastructure and custom operators/schedulers
  • Production incident diagnosis and resolution
  • LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI) — bonus
  • Modern lakehouse formats (Iceberg, Delta, Hudi) — bonus

Benefits

  • Flexible work with Bay Area collaboration and global team options
  • Annual travel stipend (Adaption Passport) to explore new countries
  • Weekly meal allowance for take-out or grocery delivery
  • Comprehensive medical benefits
  • Generous paid time off
Apply on the employer's site

Opens the official application on the employer’s site. No login required.