Skip to main content

Cognition

Research Engineer, Infrastructure

San FranciscofulltimemidAdded 2 days ago

About this role

Join an applied AI lab as a Research Engineer to build and maintain the infrastructure that powers large-scale AI model training and agent development. You'll own distributed training systems, experiment orchestration, and data pipelines that enable researchers to push the boundaries of AI reasoning and autonomy.

What you'll do

  • Design and operate distributed training infrastructure for large-scale GPU clusters with fault tolerance and checkpointing
  • Build systems for running hundreds of thousands of concurrent agent rollouts in virtualized sandbox environments
  • Profile and optimize training throughput across data loading, communication, memory, and compute efficiency
  • Develop experiment orchestration tools and dashboards to streamline researcher workflows
  • Engineer high-throughput, reliable data pipelines for training and evaluation at scale
  • Implement and optimize parallelism strategies (data, tensor, pipeline, sequence) for maximum hardware utilization

What they're looking for

  • Distributed systems and large-scale training infrastructure
  • Python and C++ programming
  • PyTorch or equivalent deep learning framework internals
  • GPU performance profiling and optimization
  • Networking, storage, and hardware-software stack reasoning
  • Debugging complex, non-deterministic distributed systems
  • Machine learning fundamentals and model architecture understanding
  • Parallelism strategies and computational efficiency
Apply on the employer's site

Opens the official application on the employer’s site. No login required.