Basis Research Institute
ML Systems Engineer, Infrastructure & Cloud
New York OfficefulltimemidAdded 2 days ago
About this role
Basis seeks an ML Systems Engineer to build and maintain scalable training infrastructure for a nonprofit AI research organization. You'll manage distributed GPU clusters, optimize cloud resources, and ensure reliable, reproducible ML experiments from development through production.
What you'll do
- Own distributed training infrastructure including job launchers, checkpointing, and recovery mechanisms
- Debug and resolve training failures across GPUs, networking, numerics, and data pipelines
- Profile and optimize training performance and resource utilization
- Manage GPU clusters and cloud infrastructure with cost optimization and security best practices
- Build reproducible experiment infrastructure and monitoring systems
- Maintain comprehensive documentation of issues, solutions, and operational lessons learned
What they're looking for
- Distributed training frameworks (PyTorch DDP/FSDP, JAX)
- Cloud administration (AWS/GCP/Azure, Kubernetes, Terraform)
- GPU cluster management and distributed systems debugging
- Mixed precision training, gradient accumulation, and checkpoint/recovery systems
- Full ML stack understanding (hardware to training loops)
- Infrastructure as code and CI/CD practices
- Debugging numerical instabilities and convergence problems
- Knowledge of optimization theory and numerical methods
Opens the official application on the employer’s site. No login required.