FluidStack

Site Reliability Engineer, Compute

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Site Reliability Engineer to own the health, reliability, and automation of a massive GPU compute fleet. You'll build metrics pipelines, repair automation, GPU qualification platforms, and low-level hardware tooling to keep pace with rapid datacenter expansion and AI infrastructure demands.

What you'll do

Build metrics pipelines, alerting, and unified health monitoring across Kubernetes and bare metal GPU fleets at scale
Design and automate repair workflows from GPU failure detection through triage, parts management, and return to service
Develop and expand GPU qualification platform including burn-in, performance baselining, and new hardware certification
Own Redfish and BMC tooling for firmware telemetry, log collection, and fleet-level hardware access
Establish incident discipline and operational excellence for one of the world's largest GPU fleets
Eliminate manual toil by converting ad-hoc procedures into reliable, scalable automation

What they're looking for

Kubernetes orchestration and bare metal cluster management
Hardware observability (Redfish, BMC, firmware-level telemetry)
Metrics and alerting systems design (Prometheus, observability platforms)
Automation and infrastructure-as-code development
GPU hardware knowledge and failure mode analysis
Systems thinking and troubleshooting at scale
Python or similar systems programming languages
Production incident response and postmortem discipline

Benefits

Work on civilization-scale AI infrastructure at the frontier of the industry
Extreme ownership and autonomy with end-to-end scope
High-velocity, first-principles engineering culture
Opportunity to define industry standards for hyperscale GPU operations

Likely interview questions

Describe your experience managing observability at scale. How would you design a unified health view across thousands of GPUs in production?
Tell us about a time you transformed a manual, toil-heavy process into reliable automation. What was the biggest challenge?

What's your experience with low-level hardware tooling like Redfish or BMC? How comfortable are you reasoning about firmware-level failures?
How do you approach learning an unfamiliar domain quickly? Give an example of a steep learning curve you've navigated in your career.
Walk us through how you'd design a GPU qualification and certification pipeline for new hardware generations at speed.
Describe your experience with Kubernetes in production. How would you handle cluster-wide compute migrations without downtime?
Tell us about your most complex production incident. How did you approach diagnosis and what systems thinking did you apply?

Unlock all 7 questions free — and practice them live →

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.