Skip to main content

Basis Research Institute

ML Systems Engineer, Infrastructure & Cloud

New York OfficefulltimemidAdded 2 days ago

About this role

Basis seeks an ML Systems Engineer to build and maintain scalable training infrastructure for a nonprofit AI research organization. You'll manage distributed GPU clusters, optimize cloud resources, and ensure reliable, reproducible ML experiments from development through production.

What you'll do

  • Own distributed training infrastructure including job launchers, checkpointing, and recovery mechanisms
  • Debug and resolve training failures across GPUs, networking, numerics, and data pipelines
  • Profile and optimize training performance and resource utilization
  • Manage GPU clusters and cloud infrastructure with cost optimization and security best practices
  • Build reproducible experiment infrastructure and monitoring systems
  • Maintain comprehensive documentation of issues, solutions, and operational lessons learned

What they're looking for

  • Distributed training frameworks (PyTorch DDP/FSDP, JAX)
  • Cloud administration (AWS/GCP/Azure, Kubernetes, Terraform)
  • GPU cluster management and distributed systems debugging
  • Mixed precision training, gradient accumulation, and checkpoint/recovery systems
  • Full ML stack understanding (hardware to training loops)
  • Infrastructure as code and CI/CD practices
  • Debugging numerical instabilities and convergence problems
  • Knowledge of optimization theory and numerical methods
Apply on the employer's site

Opens the official application on the employer’s site. No login required.