Periodic Labs
Distributed Training Engineer
Menlo Park, Remote (Remote)fulltimemidAdded 2 days ago
About this role
Periodic Labs is seeking a Distributed Training Engineer to enhance and manage large-scale distributed LLM training systems for scientific research. The ideal candidate will collaborate with researchers and contribute to open-source frameworks while optimizing mid-training workflows.
What you'll do
- Optimize and operate large-scale distributed LLM training systems
- Collaborate with researchers on debugging and maintaining workflows
- Support frontier-scale experiments in AI and science
- Contribute to open-source large scale LLM training frameworks
- Develop tools for distributed training operations
What they're looking for
- Experience with clusters of ≥5,000 GPUs
- 5D parallel LLM training
- Familiarity with distributed training frameworks
- Proficient in optimizing training throughput
- Knowledge of Mixture-of-Expert models
Benefits
- Well-funded and rapidly growing company
- Opportunity for ownership and problem-solving
- Exposure to new tools and scientific advancements
Opens the official application on the employer’s site. No login required.