Skip to main content

FluidStack

Software Engineer, GPU Infrastructure

San Francisco, CA$175k–$300kfulltimemidAdded today

About this role

Fluidstack seeks a Software Engineer to build automation and observability systems for managing a massive GPU compute fleet. You'll own health monitoring pipelines, repair automation, GPU qualification platforms, and BMC/Redfish tooling across Kubernetes and bare metal infrastructure scaling to multiple GWs annually.

What you'll do

  • Design and implement fleet-wide GPU health metrics, alerting, and unified monitoring across Kubernetes and bare metal at scale
  • Build end-to-end repair automation pipelines from failure detection through triage, RMA, and return to service
  • Develop and expand GPU qualification platform for burn-in testing, performance baselining, and new hardware NPI execution
  • Own Redfish and BMC tooling for firmware telemetry and low-level fleet access
  • Drive production cluster migrations and Kubernetes orchestration during rapid site expansions
  • Ensure reliability and operability of one of the world's largest GPU fleets

What they're looking for

  • Kubernetes and container orchestration
  • Systems automation and infrastructure-as-code
  • Hardware fundamentals (firmware, BMC, silicon-level failure modes)
  • Observability and monitoring systems (metrics, alerting, logging)
  • Redfish and IPMI/BMC APIs
  • Go, Python, or similar systems programming languages
  • Distributed systems and fleet management at scale
  • Debugging and performance tuning under ambiguity

Benefits

  • Work on civilization-scale AI infrastructure
  • Extreme ownership and autonomy over end-to-end systems
  • Fast-paced, high-intensity environment with clear impact
  • Opportunity to set production standards for new hardware generations
  • Exposure to frontier compute challenges across hardware and software
  • Team culture emphasizing first-principles thinking and velocity

Likely interview questions

  • Tell us about a time you automated a manual, repetitive process. What made it successful?
  • How would you approach building observability for a fleet you've never seen before, with hardware you don't fully understand?
Apply on the employer's site

Opens the official application on the employer’s site. No login required.