Institute of Foundation Models
Machine Learning Infrastructure Engineer
Sunnyvale, CA$150k–$450kfull-timemidAdded 2 days ago
About this role
Join a research institute focused on foundation models to build and scale distributed ML training infrastructure. You'll extend frameworks like DeepSpeed and FSDP, implement distributed optimizers, and develop robust systems for multi-GPU cluster training alongside world-class researchers.
What you'll do
- Extend and modify distributed training frameworks to support new architectures and use cases
- Implement distributed optimizers from mathematical specifications
- Design and debug multi-node launch configurations with flexible parallelism strategies
- Build experiment tracking, metrics logging, and job monitoring systems
- Write production-quality infrastructure code with comprehensive testing
- Improve training system reliability, performance, and maintainability at scale
What they're looking for
- Distributed ML frameworks (DeepSpeed, FSDP, FairScale, Horovod)
- Python and strong software engineering fundamentals
- Multi-node cluster orchestration (Slurm, Kubernetes, Ray)
- Distributed debugging (NCCL, GLOO)
- PyTorch or JAX
- GPU/systems performance optimization
- ML systems and infrastructure design
- Large-scale distributed training experience
Opens the official application on the employer’s site. No login required.