Skip to main content

Exa Labs

Software Engineer, Distributed Data Systems

San Francisco, California$180k–$350kfulltimemidAdded 2 days ago

About this role

Exa, a venture-backed AI search company, is seeking a Software Engineer to design and build large-scale distributed data infrastructure supporting web crawling, embedding model training, and vector database retrieval. You'll architect data systems handling hundreds of petabytes while maintaining high reliability and performance.

What you'll do

  • Design lakehouse architectures to handle massive web crawl datasets (100+ PB scale)
  • Build and operate distributed data processing pipelines for billions of daily documents
  • Architect data layers for embedding training infrastructure using Ray and similar frameworks
  • Develop streaming pipelines for real-time indexing and search
  • Scale analytical query systems (ClickHouse) across petabyte-scale search logs
  • Ensure system reliability and operational excellence

What they're looking for

  • Lakehouse architecture (Delta Lake, Iceberg, Hudi)
  • Distributed data processing (Spark, Ray, ClickHouse)
  • Streaming systems (Kafka, Flink)
  • Large-scale infrastructure design and operations
  • Vector database and storage formats (bonus: Lance)
  • GPU-accelerated data processing (bonus: RAPIDS, cuDF)
  • Systems reliability and observability
  • Web-scale data engineering

Benefits

  • Premium healthcare (medical, dental, vision)
  • Fertility benefits
  • 16 weeks fully paid parental leave
  • Monthly wellness stipend
  • Visa sponsorship available (STEM OPT, H1B, O1, E3)
  • In-person role in San Francisco
Apply on the employer's site

Opens the official application on the employer’s site. No login required.