Skip to main content

FluidStack

Production Engineer, Compute (GPU)

San Francisco, CA (Remote)$175k–$300kfulltimemidAdded 2 days ago

About this role

Fluidstack seeks a Production Engineer to build automation and observability systems for managing one of the world's largest GPU compute fleets. You'll own end-to-end fleet health, design repair pipelines, qualify new hardware generations, and scale infrastructure that grows by entire sites every few months.

What you'll do

  • Build metrics pipelines and alerting to track GPU fleet health across Kubernetes and bare metal at scale
  • Develop automation to manage GPU failures from detection through triage, parts management, and return to service
  • Design and expand GPU qualification platform including burn-in testing and performance baselining
  • Own Redfish and BMC tooling for firmware-level telemetry and fleet-scale log collection
  • Migrate live compute across production sites and bring new sites online sustainably
  • Ensure reliability and operational discipline for rapidly expanding GPU fleet infrastructure

What they're looking for

  • Hardware troubleshooting and understanding of firmware/silicon-level failure modes
  • Kubernetes orchestration and bare metal infrastructure management
  • Automation and scripting to eliminate manual operational toil
  • Observability and metrics pipeline development
  • Redfish/BMC and low-level hardware access protocols
  • Rapid learning in unfamiliar technical domains
  • Systems thinking at hyperscale
  • Incident response and operational discipline

Benefits

  • Work on civilization-scale AI infrastructure
  • Extreme ownership with full autonomy and end-to-end scope
  • High-velocity environment pushing technological frontiers
  • Based in San Francisco, CA
Apply on the employer's site

Opens the official application on the employer’s site. No login required.