FluidStack
Site Reliability Engineer, Compute
San Francisco, CA$175k–$300kfulltimemidAdded today
About this role
Fluidstack seeks a Site Reliability Engineer to own the health, reliability, and automation of a massive GPU compute fleet. You'll build metrics pipelines, repair automation, GPU qualification platforms, and low-level hardware tooling to keep pace with rapid datacenter expansion and AI infrastructure demands.
What you'll do
- Build metrics pipelines, alerting, and unified health monitoring across Kubernetes and bare metal GPU fleets at scale
- Design and automate repair workflows from GPU failure detection through triage, parts management, and return to service
- Develop and expand GPU qualification platform including burn-in, performance baselining, and new hardware certification
- Own Redfish and BMC tooling for firmware telemetry, log collection, and fleet-level hardware access
- Establish incident discipline and operational excellence for one of the world's largest GPU fleets
- Eliminate manual toil by converting ad-hoc procedures into reliable, scalable automation
What they're looking for
- Kubernetes orchestration and bare metal cluster management
- Hardware observability (Redfish, BMC, firmware-level telemetry)
- Metrics and alerting systems design (Prometheus, observability platforms)
- Automation and infrastructure-as-code development
- GPU hardware knowledge and failure mode analysis
- Systems thinking and troubleshooting at scale
- Python or similar systems programming languages
- Production incident response and postmortem discipline
Benefits
- Work on civilization-scale AI infrastructure at the frontier of the industry
- Extreme ownership and autonomy with end-to-end scope
- High-velocity, first-principles engineering culture
- Opportunity to define industry standards for hyperscale GPU operations
Likely interview questions
- Describe your experience managing observability at scale. How would you design a unified health view across thousands of GPUs in production?
- Tell us about a time you transformed a manual, toil-heavy process into reliable automation. What was the biggest challenge?
Opens the official application on the employer’s site. No login required.