openai
Software Engineer, Data Infrastructure - Research
San Franciscofulltimemid
About this role
OpenAI seeks a Software Engineer to build dataset infrastructure for large-scale LLM training. You'll design standardized dataset APIs, optimize data loading pipelines across thousands of GPUs, and collaborate with research teams to ensure efficient, reproducible data handling for multimodal and traditional datasets.
What you'll do
- Design and maintain standardized dataset APIs supporting multimodal data that exceeds memory constraints
- Develop testing and validation pipelines for dataset loading at GPU scale
- Integrate datasets into training and inference pipelines with seamless user experience
- Debug and resolve performance bottlenecks in distributed dataset loading systems
- Create visualization and inspection tools to identify dataset errors and issues
- Establish safeguards to ensure dataset reproducibility and consistency
What they're looking for
- Distributed systems design and implementation
- Data pipeline architecture and optimization
- API design and scalable abstractions
- Large-scale fleet debugging and troubleshooting
- Infrastructure reliability and performance optimization
- Multimodal data handling
- Probability or distributed data theory (bonus)
- GPU-scale system experience (bonus)
Opens the official application on the employer’s site. No login required.