Dyna Robotics

ML Infrastructure Engineer, Training

Redwood City, CA$220k–$320kfulltimemidAdded 2 days ago

About this role

Dyna Robotics seeks an ML Infrastructure Engineer to design and operate the training systems powering their embodied AI foundation models. You'll architect GPU cluster infrastructure, optimize researcher workflows, handle massive multimodal datasets, and deploy low-latency inference pipelines for real-time robot control.

What you'll do

Architect and scale distributed training infrastructure for large GPU clusters with memory optimization techniques
Build job scheduling and research codebase systems to enable fast iteration and failure recovery
Design high-throughput data pipelines for multimodal robot data (video, proprioception, 3D signals)
Develop production inference pipelines with quantization, distillation, and model compilation for real-time robot control
Profile and optimize GPU utilization, I/O bottlenecks, and memory fragmentation across compute fleet
Own training infrastructure end-to-end to maximize GPU efficiency and reproducibility

What they're looking for

PyTorch and distributed training frameworks (DeepSpeed, Accelerate)
Cloud GPU environments (GCP/AWS) and Kubernetes orchestration
Distributed systems and inter-node communication (NCCL)
Memory management and mixed precision training
High-performance computing (HPC) systems design
Low-latency inference optimization (TensorRT, Triton)
Systems profiling and performance optimization
Multimodal model architecture (bonus)

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.