FluidStack
Software Engineer, Compute (GPU)
San Francisco, CA$175k–$300kfulltimemidAdded today
About this role
Fluidstack seeks a Software Engineer to own GPU fleet health and reliability across massive AI compute infrastructure. You'll build automation pipelines for GPU repair, qualification, and fleet observability while working on civilization-scale problems in AI compute.
What you'll do
- Design and maintain metrics pipelines, alerting systems, and unified health dashboards for GPU fleet visibility across Kubernetes and bare metal
- Develop end-to-end automation for GPU failure detection, triage, parts management, and return to service
- Build and expand GPU qualification platform covering burn-in testing, performance baselining, and new hardware NPI execution
- Own Redfish and BMC firmware tooling for telemetry, logging, and low-level hardware access at fleet scale
- Debug and optimize infrastructure performance across multiple production sites scaling by GWs annually
- Drive incident response discipline and operational reliability for one of the world's largest GPU fleets
What they're looking for
- Systems software engineering and distributed systems design
- Hardware firmware knowledge (Redfish, BMC, IPMI, or equivalent)
- Kubernetes orchestration and container infrastructure
- Python or Go for infrastructure automation
- Observability and monitoring systems (metrics, logging, alerting)
- GPU hardware fundamentals and failure mode analysis
- Database design for high-scale telemetry pipelines
- Linux kernel and systems-level debugging
Benefits
- Work on civilization-scale infrastructure for AI alignment
- Extreme autonomy and end-to-end ownership of major systems
- High-velocity environment pushing the frontier of compute infrastructure
- Opportunity to shape how frontier AI compute is deployed
- Learning steep curve with unfamiliar domains and cutting-edge problems
- San Francisco-based team with focus on impact over process
Likely interview questions
- Walk us through a time you've debugged a production incident involving hardware or firmware—what was your process and what did you learn?
- How would you approach designing a metrics pipeline to track GPU health across thousands of devices with minimal latency?
Opens the official application on the employer’s site. No login required.