Cognition

Research Engineer, Infrastructure

San FranciscofulltimemidAdded 2 days ago

About this role

Join an applied AI lab as a Research Engineer to build and maintain the infrastructure that powers large-scale AI model training and agent development. You'll own distributed training systems, experiment orchestration, and data pipelines that enable researchers to push the boundaries of AI reasoning and autonomy.

What you'll do

Design and operate distributed training infrastructure for large-scale GPU clusters with fault tolerance and checkpointing
Build systems for running hundreds of thousands of concurrent agent rollouts in virtualized sandbox environments
Profile and optimize training throughput across data loading, communication, memory, and compute efficiency
Develop experiment orchestration tools and dashboards to streamline researcher workflows
Engineer high-throughput, reliable data pipelines for training and evaluation at scale
Implement and optimize parallelism strategies (data, tensor, pipeline, sequence) for maximum hardware utilization

What they're looking for

Distributed systems and large-scale training infrastructure
Python and C++ programming
PyTorch or equivalent deep learning framework internals
GPU performance profiling and optimization
Networking, storage, and hardware-software stack reasoning
Debugging complex, non-deterministic distributed systems
Machine learning fundamentals and model architecture understanding
Parallelism strategies and computational efficiency

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.