FluidStack

Software Engineer, Compute (GPU)

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Software Engineer to own GPU fleet health and reliability across massive AI compute infrastructure. You'll build automation pipelines for GPU repair, qualification, and fleet observability while working on civilization-scale problems in AI compute.

What you'll do

Design and maintain metrics pipelines, alerting systems, and unified health dashboards for GPU fleet visibility across Kubernetes and bare metal
Develop end-to-end automation for GPU failure detection, triage, parts management, and return to service
Build and expand GPU qualification platform covering burn-in testing, performance baselining, and new hardware NPI execution
Own Redfish and BMC firmware tooling for telemetry, logging, and low-level hardware access at fleet scale
Debug and optimize infrastructure performance across multiple production sites scaling by GWs annually
Drive incident response discipline and operational reliability for one of the world's largest GPU fleets

What they're looking for

Systems software engineering and distributed systems design
Hardware firmware knowledge (Redfish, BMC, IPMI, or equivalent)
Kubernetes orchestration and container infrastructure
Python or Go for infrastructure automation
Observability and monitoring systems (metrics, logging, alerting)
GPU hardware fundamentals and failure mode analysis
Database design for high-scale telemetry pipelines
Linux kernel and systems-level debugging

Benefits

Work on civilization-scale infrastructure for AI alignment
Extreme autonomy and end-to-end ownership of major systems
High-velocity environment pushing the frontier of compute infrastructure
Opportunity to shape how frontier AI compute is deployed
Learning steep curve with unfamiliar domains and cutting-edge problems
San Francisco-based team with focus on impact over process

Likely interview questions

Walk us through a time you've debugged a production incident involving hardware or firmware—what was your process and what did you learn?
How would you approach designing a metrics pipeline to track GPU health across thousands of devices with minimal latency?

Describe your experience with hardware-level tooling like Redfish, BMC, or IPMI. What have you built with them?
Tell us about a manual, toil-heavy workflow you've automated—what made it worth the effort and how did you prioritize it?
How do you reason about failure modes at the silicon and firmware level, not just the software stack?
This role requires rapid onboarding in unfamiliar domains like GPU qualification or fleet repair automation. Give an example of when you learned a complex technical domain quickly.
What's your experience operating infrastructure at scale (thousands of machines), and how have you approached observability challenges?

Unlock all 7 questions free — and practice them live →

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.