Skip to main content

Basis Research Institute

Data Engineer, Platform

New York OfficefulltimemidAdded 2 days ago

About this role

Basis, a nonprofit AI research organization, seeks a Data Engineer to build reliable data pipelines with strong provenance tracking and quality controls. This platform role involves curating datasets for ML training, managing cross-project data infrastructure, and enabling both commercial products and internal research with trustworthy data systems.

What you'll do

  • Design and build data pipelines with comprehensive provenance, lineage tracking, and quality gates
  • Curate documented datasets for model training, evaluation, and experimentation
  • Develop data quality frameworks and governance systems for internal and external use
  • Support scaling of data infrastructure to handle medium-scale models and multiple teams
  • Coordinate shared datasets across Platform and Research teams to prevent duplication
  • Ensure data systems enable reproducible ML experiments and research outcomes

What they're looking for

  • Expert SQL and Python for data processing
  • Distributed computing frameworks (Spark, Dask)
  • Workflow orchestration tools (Airflow, Dagster, Prefect)
  • Cloud data platforms (Snowflake, BigQuery, Redshift, S3)
  • ML data requirements and feature engineering
  • Data quality, validation, and governance implementation
  • Data modeling and schema design optimization
  • Batch and streaming systems (Kafka, Kinesis, Flink)
Apply on the employer's site

Opens the official application on the employer’s site. No login required.