Anthropic
Research Engineer, Pretraining Scaling
San Francisco, CAFrom $850kmidAdded 2 days ago
About this role
Anthropic seeks a Research Engineer to own critical aspects of production pretraining pipelines, balancing deep technical work on model training systems with operational responsibilities during launches. You'll debug complex full-stack issues, optimize training efficiency, and collaborate across teams to ensure frontier models train reliably at scale.
What you'll do
- Own production pretraining pipeline including model operations, performance optimization, observability, and reliability
- Debug and resolve issues across hardware, networking, training dynamics, and evaluation infrastructure
- Design and run experiments to improve training efficiency, reduce step time, and enhance model performance
- Respond to on-call incidents during model launches with rapid diagnosis and cross-team coordination
- Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
- Add new capabilities to training codebase such as long context support or novel architectures
What they're looking for
- Large-scale machine learning systems and distributed training
- JAX, TPU, PyTorch, or equivalent ML frameworks at scale
- Full-stack debugging across hardware, networking, and software layers
- Production ML systems and observability tools
- Experimental design and systems optimization
- Clear communication and cross-team collaboration
- LLM pretraining experience
- Systems engineering or operational excellence background
Benefits
- Hands-on experience with some of the largest training runs in the industry
- Work alongside world-class researchers and engineers at a mission-driven company
- Unique learning opportunities and institutional knowledge building
- 5 days per week in-office at San Francisco headquarters
- Involvement in work directly shaping safe and beneficial AI systems
Opens the official application on the employer’s site. No login required.