Benchmarking AI Performance on End-to-End Data Science Projects

Evelyn Hughes; Rohan Alexander

arXiv:2602.14284·stat.OT·February 17, 2026

Benchmarking AI Performance on End-to-End Data Science Projects

Evelyn Hughes, Rohan Alexander

PDF

Open Access

TL;DR

This paper introduces a benchmark to evaluate AI models on their ability to generate complete end-to-end data science projects, revealing strengths in structured tasks and limitations in judgment-based tasks.

Contribution

It creates a comprehensive benchmark and automated evaluation pipeline for assessing AI performance on full data science workflows, a novel approach in the field.

Findings

01

Recent models perform well on structured tasks

02

Significant variation exists in models' ability to handle judgment tasks

03

AI can approximate entry-level data scientists on routine tasks

Abstract

Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do this we create a benchmark of 40 end-to-end data science projects with associated rubric evaluations. We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models. We find the extent to which generative AI models can complete end-to-end data science projects varies considerably by model. Most recent models did well on structured tasks, but there were considerable differences on tasks that needed judgment. These findings suggest that while AI models could approximate entry-level data scientists on routine tasks, they require verification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)