xAI
Network Engineer - AI/HPC
Memphis, TN; Palo Alto, CAmidAdded 2 days ago
About this role
xAI is seeking a Network Engineer experienced in AI and HPC to optimize network performance for large-scale GPU infrastructure. The ideal candidate will work on NCCL, develop performance metrics, and play a key role in enhancing network capabilities.
What you'll do
- Optimize network performance and availability for AI training and inference workloads
- Develop metric dashboards to analyze network performance
- Design backend and frontend networks for new GPU infrastructure
- Participate in team on-call rotations
- Support network scaling and maintenance efforts
- Collaborate closely with team members on projects
What they're looking for
- 10+ years in large scale network design and operation
- 5+ years in ethernet AI/HPC
- Expertise in congestion control on ethernet
- Proficient in AI training and inference workload management
- Experience with NCCL debugging
- Strong Python programming for automation
- Ability to create performance metrics portfolios
- Effective communication skills
Benefits
- Dynamic and challenging work environment
- Opportunity for professional growth
- Flat organizational structure promotes initiative
- Collaborative team culture
- Significant travel for capacity building
- [unknown]
Opens the official application on the employer’s site. No login required.