42dot
AI Infrastructure Engineer
Pangyo (Software Dream Center), South Korea (Remote)fulltimemidAdded 2 days ago
About this role
42dot seeks an AI Infrastructure Engineer to manage and orchestrate thousands of GPUs across multiple data centers, ensuring high-performance computing for autonomous driving model training. You'll handle cluster operations, troubleshooting, automation, and resource optimization while supporting ML researchers.
What you'll do
- Operate and maintain large-scale GPU clusters with thousands of GPUs across data centers using Kubernetes and Slurm
- Monitor and diagnose hardware and software failures to ensure high availability and rapid recovery
- Develop automation scripts and tools in Python or Shell to streamline infrastructure management tasks
- Manage GPU resource quotas and provide technical support to ML researchers
- Participate in designing and tuning distributed training environments for autonomous driving models
What they're looking for
- Linux operating systems and kernel operations
- Docker and Kubernetes containerization and orchestration
- Network fundamentals (TCP/IP, HTTP/S) and troubleshooting
- Python or Shell scripting for automation
- Logical problem-solving in complex systems
- GPU cluster management
- Slurm workload management
- Communication and cross-functional collaboration
Opens the official application on the employer’s site. No login required.