Deepgram
Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)
USA | Remote (Remote)$150k–$220kfulltimemidAdded 2 days ago
About this role
Deepgram seeks an experienced Site Reliability Engineer to design and operate hybrid cloud and on-premise infrastructure supporting AI/ML research and product development. You'll build a scalable, self-service platform using Kubernetes, AWS, and Terraform while orchestrating GPU workloads across distributed environments.
What you'll do
- Architect and maintain Kubernetes-based computing platforms across AWS and on-premise data centers
- Develop and manage infrastructure-as-code using Terraform for reproducible, automated environments
- Design AI/ML job scheduling systems integrating Slurm with Kubernetes to optimize GPU resource management
- Provision and maintain on-premise bare metal servers for high-performance GPU computing
- Implement observability, monitoring, and logging solutions to ensure platform health and performance
- Collaborate with AI researchers to understand infrastructure needs and accelerate development workflows
What they're looking for
- Kubernetes architecture and operations at scale
- Terraform and Infrastructure-as-Code practices
- AWS cloud infrastructure
- Slurm or HPC job scheduling systems
- Hybrid cloud and on-premise infrastructure management
- Platform engineering and DevOps
- Networking and storage solutions (CNI, CSI, S3)
- Observability and incident response automation
Opens the official application on the employer’s site. No login required.