Exa Labs
Software Engineer, Distributed Data Systems
San Francisco, California$180k–$350kfulltimemidAdded 2 days ago
About this role
Exa, a venture-backed AI search company, is seeking a Software Engineer to design and build large-scale distributed data infrastructure supporting web crawling, embedding model training, and vector database retrieval. You'll architect data systems handling hundreds of petabytes while maintaining high reliability and performance.
What you'll do
- Design lakehouse architectures to handle massive web crawl datasets (100+ PB scale)
- Build and operate distributed data processing pipelines for billions of daily documents
- Architect data layers for embedding training infrastructure using Ray and similar frameworks
- Develop streaming pipelines for real-time indexing and search
- Scale analytical query systems (ClickHouse) across petabyte-scale search logs
- Ensure system reliability and operational excellence
What they're looking for
- Lakehouse architecture (Delta Lake, Iceberg, Hudi)
- Distributed data processing (Spark, Ray, ClickHouse)
- Streaming systems (Kafka, Flink)
- Large-scale infrastructure design and operations
- Vector database and storage formats (bonus: Lance)
- GPU-accelerated data processing (bonus: RAPIDS, cuDF)
- Systems reliability and observability
- Web-scale data engineering
Benefits
- Premium healthcare (medical, dental, vision)
- Fertility benefits
- 16 weeks fully paid parental leave
- Monthly wellness stipend
- Visa sponsorship available (STEM OPT, H1B, O1, E3)
- In-person role in San Francisco
Opens the official application on the employer’s site. No login required.