Avride

Software Engineer – ML Platform

Austin, TXmidAdded 2 days ago

About this role

Join Avride's ML Platform team to build infrastructure that powers large-scale ML training for autonomous driving. You'll design and optimize the orchestration, distributed compute, and resource governance systems that enable ML teams to train models efficiently at scale on Kubernetes.

What you'll do

Build and scale ML compute platform on Kubernetes using Argo Workflows for training and data processing orchestration
Design resource governance systems including scheduling, quotas, and policy enforcement across GPU, CPU, memory, and IO
Optimize end-to-end training throughput by improving data access patterns, caching, and removing infrastructure bottlenecks
Partner with ML teams to debug complex workload issues and implement platform-level solutions
Evaluate and integrate open-source tools like Argo Workflows, Ray, and Kubernetes ecosystem components

What they're looking for

Python or Go (C++ a plus)
Kubernetes architecture and scheduling
Distributed systems design and implementation
Linux systems debugging and performance optimization
Production service operation and observability
Networking and storage/IO troubleshooting
Argo Workflows, Ray, or similar ML tooling
GPU scheduling and distributed training experience

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.