Basis Research Institute

ML Systems Engineer, Infrastructure & Cloud

New York OfficefulltimemidAdded 2 days ago

About this role

Basis seeks an ML Systems Engineer to build and maintain scalable training infrastructure for a nonprofit AI research organization. You'll manage distributed GPU clusters, optimize cloud resources, and ensure reliable, reproducible ML experiments from development through production.

What you'll do

Own distributed training infrastructure including job launchers, checkpointing, and recovery mechanisms
Debug and resolve training failures across GPUs, networking, numerics, and data pipelines
Profile and optimize training performance and resource utilization
Manage GPU clusters and cloud infrastructure with cost optimization and security best practices
Build reproducible experiment infrastructure and monitoring systems
Maintain comprehensive documentation of issues, solutions, and operational lessons learned

What they're looking for

Distributed training frameworks (PyTorch DDP/FSDP, JAX)
Cloud administration (AWS/GCP/Azure, Kubernetes, Terraform)
GPU cluster management and distributed systems debugging
Mixed precision training, gradient accumulation, and checkpoint/recovery systems
Full ML stack understanding (hardware to training loops)
Infrastructure as code and CI/CD practices
Debugging numerical instabilities and convergence problems
Knowledge of optimization theory and numerical methods

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.