FluidStack

Production Engineer, Compute (GPU)

San Francisco, CA (Remote)$175k–$300kfulltimemidAdded 2 days ago

About this role

Fluidstack seeks a Production Engineer to build automation and observability systems for managing one of the world's largest GPU compute fleets. You'll own end-to-end fleet health, design repair pipelines, qualify new hardware generations, and scale infrastructure that grows by entire sites every few months.

What you'll do

Build metrics pipelines and alerting to track GPU fleet health across Kubernetes and bare metal at scale
Develop automation to manage GPU failures from detection through triage, parts management, and return to service
Design and expand GPU qualification platform including burn-in testing and performance baselining
Own Redfish and BMC tooling for firmware-level telemetry and fleet-scale log collection
Migrate live compute across production sites and bring new sites online sustainably
Ensure reliability and operational discipline for rapidly expanding GPU fleet infrastructure

What they're looking for

Hardware troubleshooting and understanding of firmware/silicon-level failure modes
Kubernetes orchestration and bare metal infrastructure management
Automation and scripting to eliminate manual operational toil
Observability and metrics pipeline development
Redfish/BMC and low-level hardware access protocols
Rapid learning in unfamiliar technical domains
Systems thinking at hyperscale
Incident response and operational discipline

Benefits

Work on civilization-scale AI infrastructure
Extreme ownership with full autonomy and end-to-end scope
High-velocity environment pushing technological frontiers
Based in San Francisco, CA

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.