Dyna Robotics
ML Infrastructure Engineer, Training
Redwood City, CA$220k–$320kfulltimemidAdded 2 days ago
About this role
Dyna Robotics seeks an ML Infrastructure Engineer to design and operate the training systems powering their embodied AI foundation models. You'll architect GPU cluster infrastructure, optimize researcher workflows, handle massive multimodal datasets, and deploy low-latency inference pipelines for real-time robot control.
What you'll do
- Architect and scale distributed training infrastructure for large GPU clusters with memory optimization techniques
- Build job scheduling and research codebase systems to enable fast iteration and failure recovery
- Design high-throughput data pipelines for multimodal robot data (video, proprioception, 3D signals)
- Develop production inference pipelines with quantization, distillation, and model compilation for real-time robot control
- Profile and optimize GPU utilization, I/O bottlenecks, and memory fragmentation across compute fleet
- Own training infrastructure end-to-end to maximize GPU efficiency and reproducibility
What they're looking for
- PyTorch and distributed training frameworks (DeepSpeed, Accelerate)
- Cloud GPU environments (GCP/AWS) and Kubernetes orchestration
- Distributed systems and inter-node communication (NCCL)
- Memory management and mixed precision training
- High-performance computing (HPC) systems design
- Low-latency inference optimization (TensorRT, Triton)
- Systems profiling and performance optimization
- Multimodal model architecture (bonus)
Opens the official application on the employer’s site. No login required.