Skip to main content

Baseten

SRE

San Francisco (Remote)$165k–$330kfulltimemidAdded 2 days ago

About this role

Baseten is seeking a Site Reliability Engineer to build and maintain robust infrastructure, processes, and observability systems for their ML inference platform serving major AI companies. You'll own multi-cloud Kubernetes reliability, establish SLOs, automate incident response, and empower the organization through operational excellence.

What you'll do

  • Own reliability of multi-cloud Kubernetes infrastructure including incident response, post-mortems, and remediation
  • Build and maintain observability infrastructure including metrics, logging, dashboards, and alerting as code
  • Author and improve runbooks for recurring failure patterns with structured, low-context execution
  • Identify high-frequency failures and convert them into automated mitigations or self-healing systems
  • Diagnose runtime issues related to latency, memory, GPU utilization, concurrency, and model lifecycle
  • Define and instrument SLOs and SLIs across customer workloads and internal services

What they're looking for

  • Kubernetes (multi-cloud: EKS, GKE)
  • Observability tooling (Prometheus, VictoriaMetrics, Loki, Grafana)
  • Infrastructure-as-code (Terraform, Helm)
  • GitOps workflows (Flux CD, ArgoCD)
  • Incident response and post-mortem analysis
  • Scalable infrastructure design and maintenance
  • Code writing with operational mindset
  • Incident management platforms

Benefits

  • Competitive compensation with meaningful equity
  • 100% medical, dental, and vision insurance coverage for employee and dependents
  • Flexible PTO with company-wide Winter Break closure
  • Paid parental leave
  • Fertility and family-building stipend through Carrot
  • Company-facilitated 401(k)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.