Skip to main content

openai

Software Engineer, Data Infrastructure - Research

San Franciscofulltimemid

About this role

OpenAI seeks a Software Engineer to build dataset infrastructure for large-scale LLM training. You'll design standardized dataset APIs, optimize data loading pipelines across thousands of GPUs, and collaborate with research teams to ensure efficient, reproducible data handling for multimodal and traditional datasets.

What you'll do

  • Design and maintain standardized dataset APIs supporting multimodal data that exceeds memory constraints
  • Develop testing and validation pipelines for dataset loading at GPU scale
  • Integrate datasets into training and inference pipelines with seamless user experience
  • Debug and resolve performance bottlenecks in distributed dataset loading systems
  • Create visualization and inspection tools to identify dataset errors and issues
  • Establish safeguards to ensure dataset reproducibility and consistency

What they're looking for

  • Distributed systems design and implementation
  • Data pipeline architecture and optimization
  • API design and scalable abstractions
  • Large-scale fleet debugging and troubleshooting
  • Infrastructure reliability and performance optimization
  • Multimodal data handling
  • Probability or distributed data theory (bonus)
  • GPU-scale system experience (bonus)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.