andromeda
Site Reliability Engineer - AI Infrastructure
Global Remote / San Francisco, CA (Remote)fulltimemidAdded 2 days ago
About this role
Andromeda is seeking a Site Reliability Engineer to provision, operate, and optimize Kubernetes-based AI infrastructure clusters across multiple cloud providers. You'll build automation tooling, debug complex system issues, and lead incident response for a platform that routes training and inference jobs globally.
What you'll do
- Provision and configure Kubernetes clusters for customers across multiple providers
- Build automation and tooling to streamline cluster deployments and integrations
- Debug customer issues spanning networking, storage, scheduling, and system layers
- Design and implement monitoring, alerting, and observability systems for critical infrastructure
- Lead on-call rotations, incident response, and reliability postmortems
- Collaborate with engineering and product teams on infrastructure planning for new services
What they're looking for
- Kubernetes and container orchestration at scale
- Linux systems and networking fundamentals
- Infrastructure-as-Code tools (Terraform, Helm, Ansible)
- Scripting and automation (Python, Go, or Bash)
- Observability stacks (Prometheus, Grafana, Loki, Datadog)
- Production systems operations and incident response
- ML/AI infrastructure experience (CUDA, Slurm, Triton) - preferred
- High-performance networking or distributed storage - preferred
Benefits
- Ownership and autonomy in shaping system reliability and scalability
- Direct collaboration with customers and infrastructure providers
- Work on foundational AI compute infrastructure
- Global remote or San Francisco office options
- Full-time position
Opens the official application on the employer’s site. No login required.