andromeda

Site Reliability Engineer - AI Infrastructure

Global Remote / San Francisco, CA (Remote)fulltimemidAdded 2 days ago

About this role

Andromeda is seeking a Site Reliability Engineer to provision, operate, and optimize Kubernetes-based AI infrastructure clusters across multiple cloud providers. You'll build automation tooling, debug complex system issues, and lead incident response for a platform that routes training and inference jobs globally.

What you'll do

Provision and configure Kubernetes clusters for customers across multiple providers
Build automation and tooling to streamline cluster deployments and integrations
Debug customer issues spanning networking, storage, scheduling, and system layers
Design and implement monitoring, alerting, and observability systems for critical infrastructure
Lead on-call rotations, incident response, and reliability postmortems
Collaborate with engineering and product teams on infrastructure planning for new services

What they're looking for

Kubernetes and container orchestration at scale
Linux systems and networking fundamentals
Infrastructure-as-Code tools (Terraform, Helm, Ansible)
Scripting and automation (Python, Go, or Bash)
Observability stacks (Prometheus, Grafana, Loki, Datadog)
Production systems operations and incident response
ML/AI infrastructure experience (CUDA, Slurm, Triton) - preferred
High-performance networking or distributed storage - preferred

Benefits

Ownership and autonomy in shaping system reliability and scalability
Direct collaboration with customers and infrastructure providers
Work on foundational AI compute infrastructure
Global remote or San Francisco office options
Full-time position

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.