FluidStack

Software Engineer, GPU Infrastructure

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Software Engineer to build automation and observability systems for managing a massive GPU compute fleet. You'll own health monitoring pipelines, repair automation, GPU qualification platforms, and BMC/Redfish tooling across Kubernetes and bare metal infrastructure scaling to multiple GWs annually.

What you'll do

Design and implement fleet-wide GPU health metrics, alerting, and unified monitoring across Kubernetes and bare metal at scale
Build end-to-end repair automation pipelines from failure detection through triage, RMA, and return to service
Develop and expand GPU qualification platform for burn-in testing, performance baselining, and new hardware NPI execution
Own Redfish and BMC tooling for firmware telemetry and low-level fleet access
Drive production cluster migrations and Kubernetes orchestration during rapid site expansions
Ensure reliability and operability of one of the world's largest GPU fleets

What they're looking for

Kubernetes and container orchestration
Systems automation and infrastructure-as-code
Hardware fundamentals (firmware, BMC, silicon-level failure modes)
Observability and monitoring systems (metrics, alerting, logging)
Redfish and IPMI/BMC APIs
Go, Python, or similar systems programming languages
Distributed systems and fleet management at scale
Debugging and performance tuning under ambiguity

Benefits

Work on civilization-scale AI infrastructure
Extreme ownership and autonomy over end-to-end systems
Fast-paced, high-intensity environment with clear impact
Opportunity to set production standards for new hardware generations
Exposure to frontier compute challenges across hardware and software
Team culture emphasizing first-principles thinking and velocity

Likely interview questions

Tell us about a time you automated a manual, repetitive process. What made it successful?
How would you approach building observability for a fleet you've never seen before, with hardware you don't fully understand?

Describe your experience with BMC, Redfish, or similar low-level hardware management tools.
Walk us through how you'd design a GPU failure detection and triage pipeline that scales to thousands of devices.
Tell us about the most ambiguous problem you've solved without clear requirements or precedent.
How do you balance building robust automation versus moving fast when time-to-production is critical?
What's your experience with Kubernetes in production, and have you managed bare metal alongside orchestrated workloads?

Unlock all 7 questions free — and practice them live →

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.