Skip to main content

Deepgram

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

USA | Remote (Remote)$150k–$220kfulltimemidAdded 2 days ago

About this role

Deepgram seeks an experienced Site Reliability Engineer to design and operate hybrid cloud and on-premise infrastructure supporting AI/ML research and product development. You'll build a scalable, self-service platform using Kubernetes, AWS, and Terraform while orchestrating GPU workloads across distributed environments.

What you'll do

  • Architect and maintain Kubernetes-based computing platforms across AWS and on-premise data centers
  • Develop and manage infrastructure-as-code using Terraform for reproducible, automated environments
  • Design AI/ML job scheduling systems integrating Slurm with Kubernetes to optimize GPU resource management
  • Provision and maintain on-premise bare metal servers for high-performance GPU computing
  • Implement observability, monitoring, and logging solutions to ensure platform health and performance
  • Collaborate with AI researchers to understand infrastructure needs and accelerate development workflows

What they're looking for

  • Kubernetes architecture and operations at scale
  • Terraform and Infrastructure-as-Code practices
  • AWS cloud infrastructure
  • Slurm or HPC job scheduling systems
  • Hybrid cloud and on-premise infrastructure management
  • Platform engineering and DevOps
  • Networking and storage solutions (CNI, CSI, S3)
  • Observability and incident response automation
Apply on the employer's site

Opens the official application on the employer’s site. No login required.