Skip to main content

Institute of Foundation Models

Machine Learning Infrastructure Engineer

Sunnyvale, CA$150k–$450kfull-timemidAdded 2 days ago

About this role

Join a research institute focused on foundation models to build and scale distributed ML training infrastructure. You'll extend frameworks like DeepSpeed and FSDP, implement distributed optimizers, and develop robust systems for multi-GPU cluster training alongside world-class researchers.

What you'll do

  • Extend and modify distributed training frameworks to support new architectures and use cases
  • Implement distributed optimizers from mathematical specifications
  • Design and debug multi-node launch configurations with flexible parallelism strategies
  • Build experiment tracking, metrics logging, and job monitoring systems
  • Write production-quality infrastructure code with comprehensive testing
  • Improve training system reliability, performance, and maintainability at scale

What they're looking for

  • Distributed ML frameworks (DeepSpeed, FSDP, FairScale, Horovod)
  • Python and strong software engineering fundamentals
  • Multi-node cluster orchestration (Slurm, Kubernetes, Ray)
  • Distributed debugging (NCCL, GLOO)
  • PyTorch or JAX
  • GPU/systems performance optimization
  • ML systems and infrastructure design
  • Large-scale distributed training experience
Apply on the employer's site

Opens the official application on the employer’s site. No login required.