Benchmarking AI Performance on End-to-End Data Science Projects
Evelyn Hughes, Rohan Alexander

TL;DR
This paper introduces a benchmark to evaluate AI models on their ability to generate complete end-to-end data science projects, revealing strengths in structured tasks and limitations in judgment-based tasks.
Contribution
It creates a comprehensive benchmark and automated evaluation pipeline for assessing AI performance on full data science workflows, a novel approach in the field.
Findings
Recent models perform well on structured tasks
Significant variation exists in models' ability to handle judgment tasks
AI can approximate entry-level data scientists on routine tasks
Abstract
Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do this we create a benchmark of 40 end-to-end data science projects with associated rubric evaluations. We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models. We find the extent to which generative AI models can complete end-to-end data science projects varies considerably by model. Most recent models did well on structured tasks, but there were considerable differences on tasks that needed judgment. These findings suggest that while AI models could approximate entry-level data scientists on routine tasks, they require verification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
