Skip to main content

Avride

Software Engineer – ML Platform

Austin, TXmidAdded 2 days ago

About this role

Join Avride's ML Platform team to build infrastructure that powers large-scale ML training for autonomous driving. You'll design and optimize the orchestration, distributed compute, and resource governance systems that enable ML teams to train models efficiently at scale on Kubernetes.

What you'll do

  • Build and scale ML compute platform on Kubernetes using Argo Workflows for training and data processing orchestration
  • Design resource governance systems including scheduling, quotas, and policy enforcement across GPU, CPU, memory, and IO
  • Optimize end-to-end training throughput by improving data access patterns, caching, and removing infrastructure bottlenecks
  • Partner with ML teams to debug complex workload issues and implement platform-level solutions
  • Evaluate and integrate open-source tools like Argo Workflows, Ray, and Kubernetes ecosystem components

What they're looking for

  • Python or Go (C++ a plus)
  • Kubernetes architecture and scheduling
  • Distributed systems design and implementation
  • Linux systems debugging and performance optimization
  • Production service operation and observability
  • Networking and storage/IO troubleshooting
  • Argo Workflows, Ray, or similar ML tooling
  • GPU scheduling and distributed training experience
Apply on the employer's site

Opens the official application on the employer’s site. No login required.