Baseten
SRE
San Francisco (Remote)$165k–$330kfulltimemidAdded 2 days ago
About this role
Baseten is seeking a Site Reliability Engineer to build and maintain robust infrastructure, processes, and observability systems for their ML inference platform serving major AI companies. You'll own multi-cloud Kubernetes reliability, establish SLOs, automate incident response, and empower the organization through operational excellence.
What you'll do
- Own reliability of multi-cloud Kubernetes infrastructure including incident response, post-mortems, and remediation
- Build and maintain observability infrastructure including metrics, logging, dashboards, and alerting as code
- Author and improve runbooks for recurring failure patterns with structured, low-context execution
- Identify high-frequency failures and convert them into automated mitigations or self-healing systems
- Diagnose runtime issues related to latency, memory, GPU utilization, concurrency, and model lifecycle
- Define and instrument SLOs and SLIs across customer workloads and internal services
What they're looking for
- Kubernetes (multi-cloud: EKS, GKE)
- Observability tooling (Prometheus, VictoriaMetrics, Loki, Grafana)
- Infrastructure-as-code (Terraform, Helm)
- GitOps workflows (Flux CD, ArgoCD)
- Incident response and post-mortem analysis
- Scalable infrastructure design and maintenance
- Code writing with operational mindset
- Incident management platforms
Benefits
- Competitive compensation with meaningful equity
- 100% medical, dental, and vision insurance coverage for employee and dependents
- Flexible PTO with company-wide Winter Break closure
- Paid parental leave
- Fertility and family-building stipend through Carrot
- Company-facilitated 401(k)
Opens the official application on the employer’s site. No login required.