Skip to main content

andromeda

Site Reliability Engineer - AI Infrastructure

Global Remote / San Francisco, CA (Remote)fulltimemidAdded 2 days ago

About this role

Andromeda is seeking a Site Reliability Engineer to provision, operate, and optimize Kubernetes-based AI infrastructure clusters across multiple cloud providers. You'll build automation tooling, debug complex system issues, and lead incident response for a platform that routes training and inference jobs globally.

What you'll do

  • Provision and configure Kubernetes clusters for customers across multiple providers
  • Build automation and tooling to streamline cluster deployments and integrations
  • Debug customer issues spanning networking, storage, scheduling, and system layers
  • Design and implement monitoring, alerting, and observability systems for critical infrastructure
  • Lead on-call rotations, incident response, and reliability postmortems
  • Collaborate with engineering and product teams on infrastructure planning for new services

What they're looking for

  • Kubernetes and container orchestration at scale
  • Linux systems and networking fundamentals
  • Infrastructure-as-Code tools (Terraform, Helm, Ansible)
  • Scripting and automation (Python, Go, or Bash)
  • Observability stacks (Prometheus, Grafana, Loki, Datadog)
  • Production systems operations and incident response
  • ML/AI infrastructure experience (CUDA, Slurm, Triton) - preferred
  • High-performance networking or distributed storage - preferred

Benefits

  • Ownership and autonomy in shaping system reliability and scalability
  • Direct collaboration with customers and infrastructure providers
  • Work on foundational AI compute infrastructure
  • Global remote or San Francisco office options
  • Full-time position
Apply on the employer's site

Opens the official application on the employer’s site. No login required.