Skip to main content

42dot

AI Infrastructure Engineer

Pangyo (Software Dream Center), South Korea (Remote)fulltimemidAdded 2 days ago

About this role

42dot seeks an AI Infrastructure Engineer to manage and orchestrate thousands of GPUs across multiple data centers, ensuring high-performance computing for autonomous driving model training. You'll handle cluster operations, troubleshooting, automation, and resource optimization while supporting ML researchers.

What you'll do

  • Operate and maintain large-scale GPU clusters with thousands of GPUs across data centers using Kubernetes and Slurm
  • Monitor and diagnose hardware and software failures to ensure high availability and rapid recovery
  • Develop automation scripts and tools in Python or Shell to streamline infrastructure management tasks
  • Manage GPU resource quotas and provide technical support to ML researchers
  • Participate in designing and tuning distributed training environments for autonomous driving models

What they're looking for

  • Linux operating systems and kernel operations
  • Docker and Kubernetes containerization and orchestration
  • Network fundamentals (TCP/IP, HTTP/S) and troubleshooting
  • Python or Shell scripting for automation
  • Logical problem-solving in complex systems
  • GPU cluster management
  • Slurm workload management
  • Communication and cross-functional collaboration
Apply on the employer's site

Opens the official application on the employer’s site. No login required.