Deepgram

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

USA | Remote (Remote)$150k–$220kfulltimemidAdded 2 days ago

About this role

Deepgram seeks an experienced Site Reliability Engineer to design and operate hybrid cloud and on-premise infrastructure supporting AI/ML research and product development. You'll build a scalable, self-service platform using Kubernetes, AWS, and Terraform while orchestrating GPU workloads across distributed environments.

What you'll do

Architect and maintain Kubernetes-based computing platforms across AWS and on-premise data centers
Develop and manage infrastructure-as-code using Terraform for reproducible, automated environments
Design AI/ML job scheduling systems integrating Slurm with Kubernetes to optimize GPU resource management
Provision and maintain on-premise bare metal servers for high-performance GPU computing
Implement observability, monitoring, and logging solutions to ensure platform health and performance
Collaborate with AI researchers to understand infrastructure needs and accelerate development workflows

What they're looking for

Kubernetes architecture and operations at scale
Terraform and Infrastructure-as-Code practices
AWS cloud infrastructure
Slurm or HPC job scheduling systems
Hybrid cloud and on-premise infrastructure management
Platform engineering and DevOps
Networking and storage solutions (CNI, CSI, S3)
Observability and incident response automation

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.