FluidStack
Production Engineer, Compute (GPU)
San Francisco, CA (Remote)$175k–$300kfulltimemidAdded 2 days ago
About this role
Fluidstack seeks a Production Engineer to build automation and observability systems for managing one of the world's largest GPU compute fleets. You'll own end-to-end fleet health, design repair pipelines, qualify new hardware generations, and scale infrastructure that grows by entire sites every few months.
What you'll do
- Build metrics pipelines and alerting to track GPU fleet health across Kubernetes and bare metal at scale
- Develop automation to manage GPU failures from detection through triage, parts management, and return to service
- Design and expand GPU qualification platform including burn-in testing and performance baselining
- Own Redfish and BMC tooling for firmware-level telemetry and fleet-scale log collection
- Migrate live compute across production sites and bring new sites online sustainably
- Ensure reliability and operational discipline for rapidly expanding GPU fleet infrastructure
What they're looking for
- Hardware troubleshooting and understanding of firmware/silicon-level failure modes
- Kubernetes orchestration and bare metal infrastructure management
- Automation and scripting to eliminate manual operational toil
- Observability and metrics pipeline development
- Redfish/BMC and low-level hardware access protocols
- Rapid learning in unfamiliar technical domains
- Systems thinking at hyperscale
- Incident response and operational discipline
Benefits
- Work on civilization-scale AI infrastructure
- Extreme ownership with full autonomy and end-to-end scope
- High-velocity environment pushing technological frontiers
- Based in San Francisco, CA
Opens the official application on the employer’s site. No login required.