Basis Research Institute
Data Engineer, Platform
New York OfficefulltimemidAdded 2 days ago
About this role
Basis, a nonprofit AI research organization, seeks a Data Engineer to build reliable data pipelines with strong provenance tracking and quality controls. This platform role involves curating datasets for ML training, managing cross-project data infrastructure, and enabling both commercial products and internal research with trustworthy data systems.
What you'll do
- Design and build data pipelines with comprehensive provenance, lineage tracking, and quality gates
- Curate documented datasets for model training, evaluation, and experimentation
- Develop data quality frameworks and governance systems for internal and external use
- Support scaling of data infrastructure to handle medium-scale models and multiple teams
- Coordinate shared datasets across Platform and Research teams to prevent duplication
- Ensure data systems enable reproducible ML experiments and research outcomes
What they're looking for
- Expert SQL and Python for data processing
- Distributed computing frameworks (Spark, Dask)
- Workflow orchestration tools (Airflow, Dagster, Prefect)
- Cloud data platforms (Snowflake, BigQuery, Redshift, S3)
- ML data requirements and feature engineering
- Data quality, validation, and governance implementation
- Data modeling and schema design optimization
- Batch and streaming systems (Kafka, Kinesis, Flink)
Opens the official application on the employer’s site. No login required.