Skip to main content

FluidStack

Site Reliability Engineer, Compute

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Site Reliability Engineer to own the health, reliability, and automation of a massive GPU compute fleet. You'll build metrics pipelines, repair automation, GPU qualification platforms, and low-level hardware tooling to keep pace with rapid datacenter expansion and AI infrastructure demands.

What you'll do

  • Build metrics pipelines, alerting, and unified health monitoring across Kubernetes and bare metal GPU fleets at scale
  • Design and automate repair workflows from GPU failure detection through triage, parts management, and return to service
  • Develop and expand GPU qualification platform including burn-in, performance baselining, and new hardware certification
  • Own Redfish and BMC tooling for firmware telemetry, log collection, and fleet-level hardware access
  • Establish incident discipline and operational excellence for one of the world's largest GPU fleets
  • Eliminate manual toil by converting ad-hoc procedures into reliable, scalable automation

What they're looking for

  • Kubernetes orchestration and bare metal cluster management
  • Hardware observability (Redfish, BMC, firmware-level telemetry)
  • Metrics and alerting systems design (Prometheus, observability platforms)
  • Automation and infrastructure-as-code development
  • GPU hardware knowledge and failure mode analysis
  • Systems thinking and troubleshooting at scale
  • Python or similar systems programming languages
  • Production incident response and postmortem discipline

Benefits

  • Work on civilization-scale AI infrastructure at the frontier of the industry
  • Extreme ownership and autonomy with end-to-end scope
  • High-velocity, first-principles engineering culture
  • Opportunity to define industry standards for hyperscale GPU operations

Likely interview questions

  • Describe your experience managing observability at scale. How would you design a unified health view across thousands of GPUs in production?
  • Tell us about a time you transformed a manual, toil-heavy process into reliable automation. What was the biggest challenge?
Apply on the employer's site

Opens the official application on the employer’s site. No login required.