Anthropic

Research Engineer, Pretraining Scaling

San Francisco, CAFrom $850kmidAdded 2 days ago

About this role

Anthropic seeks a Research Engineer to own critical aspects of production pretraining pipelines, balancing deep technical work on model training systems with operational responsibilities during launches. You'll debug complex full-stack issues, optimize training efficiency, and collaborate across teams to ensure frontier models train reliably at scale.

What you'll do

Own production pretraining pipeline including model operations, performance optimization, observability, and reliability
Debug and resolve issues across hardware, networking, training dynamics, and evaluation infrastructure
Design and run experiments to improve training efficiency, reduce step time, and enhance model performance
Respond to on-call incidents during model launches with rapid diagnosis and cross-team coordination
Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
Add new capabilities to training codebase such as long context support or novel architectures

What they're looking for

Large-scale machine learning systems and distributed training
JAX, TPU, PyTorch, or equivalent ML frameworks at scale
Full-stack debugging across hardware, networking, and software layers
Production ML systems and observability tools
Experimental design and systems optimization
Clear communication and cross-team collaboration
LLM pretraining experience
Systems engineering or operational excellence background

Benefits

Hands-on experience with some of the largest training runs in the industry
Work alongside world-class researchers and engineers at a mission-driven company
Unique learning opportunities and institutional knowledge building
5 days per week in-office at San Francisco headquarters
Involvement in work directly shaping safe and beneficial AI systems

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.