Skip to main content

FluidStack

Software Engineer, Compute (GPU)

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Software Engineer to own GPU fleet health and reliability across massive AI compute infrastructure. You'll build automation pipelines for GPU repair, qualification, and fleet observability while working on civilization-scale problems in AI compute.

What you'll do

  • Design and maintain metrics pipelines, alerting systems, and unified health dashboards for GPU fleet visibility across Kubernetes and bare metal
  • Develop end-to-end automation for GPU failure detection, triage, parts management, and return to service
  • Build and expand GPU qualification platform covering burn-in testing, performance baselining, and new hardware NPI execution
  • Own Redfish and BMC firmware tooling for telemetry, logging, and low-level hardware access at fleet scale
  • Debug and optimize infrastructure performance across multiple production sites scaling by GWs annually
  • Drive incident response discipline and operational reliability for one of the world's largest GPU fleets

What they're looking for

  • Systems software engineering and distributed systems design
  • Hardware firmware knowledge (Redfish, BMC, IPMI, or equivalent)
  • Kubernetes orchestration and container infrastructure
  • Python or Go for infrastructure automation
  • Observability and monitoring systems (metrics, logging, alerting)
  • GPU hardware fundamentals and failure mode analysis
  • Database design for high-scale telemetry pipelines
  • Linux kernel and systems-level debugging

Benefits

  • Work on civilization-scale infrastructure for AI alignment
  • Extreme autonomy and end-to-end ownership of major systems
  • High-velocity environment pushing the frontier of compute infrastructure
  • Opportunity to shape how frontier AI compute is deployed
  • Learning steep curve with unfamiliar domains and cutting-edge problems
  • San Francisco-based team with focus on impact over process

Likely interview questions

  • Walk us through a time you've debugged a production incident involving hardware or firmware—what was your process and what did you learn?
  • How would you approach designing a metrics pipeline to track GPU health across thousands of devices with minimal latency?
Apply on the employer's site

Opens the official application on the employer’s site. No login required.