Skip to main content

xAI

Network Engineer - AI/HPC

Memphis, TN; Palo Alto, CAmidAdded 2 days ago

About this role

xAI is seeking a Network Engineer experienced in AI and HPC to optimize network performance for large-scale GPU infrastructure. The ideal candidate will work on NCCL, develop performance metrics, and play a key role in enhancing network capabilities.

What you'll do

  • Optimize network performance and availability for AI training and inference workloads
  • Develop metric dashboards to analyze network performance
  • Design backend and frontend networks for new GPU infrastructure
  • Participate in team on-call rotations
  • Support network scaling and maintenance efforts
  • Collaborate closely with team members on projects

What they're looking for

  • 10+ years in large scale network design and operation
  • 5+ years in ethernet AI/HPC
  • Expertise in congestion control on ethernet
  • Proficient in AI training and inference workload management
  • Experience with NCCL debugging
  • Strong Python programming for automation
  • Ability to create performance metrics portfolios
  • Effective communication skills

Benefits

  • Dynamic and challenging work environment
  • Opportunity for professional growth
  • Flat organizational structure promotes initiative
  • Collaborative team culture
  • Significant travel for capacity building
  • [unknown]
Apply on the employer's site

Opens the official application on the employer’s site. No login required.