DataSciBench: An LLM Agent Benchmark for Data Science
Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang, Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue

TL;DR
DataSciBench is a new comprehensive benchmark designed to evaluate LLMs in data science tasks using challenging prompts, a semi-automated ground truth generation pipeline, and a novel Task-Function-Code framework, revealing model strengths and weaknesses.
Contribution
The paper introduces DataSciBench, a comprehensive data science benchmark with a semi-automated GT pipeline and TFC framework, enabling rigorous evaluation of diverse LLMs.
Findings
API-based models outperform open-source models on all metrics
Deepseek-Coder-33B-Instruct achieves top score among open-source models
The benchmark reveals specific strengths and weaknesses of different LLMs in data science tasks
Abstract
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC)…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This work presents a new benchmark dataset for evaluating LLMs on data science tasks, which is a meaningful contribution to the community. 2. The benchmark covers representative task types in data science, from data processing to data mining and report generation. 3. The evaluation setup includes popular open-sourced and proprietary LLMs.
While this work has the potential to contribute a valuable benchmark to the community, several key issues need to be addressed: 1. The semi-automated pipeline uses a self-consistency strategy to generate ground truth for a portion of the tasks. However, there lacks detail on further quality control. Also, I think the difficulty and authenticity of model generated tasks is questionable. 2. DataSciBench employs instance-specific evaluation scripts that are both generated and verified by LLMs. The
• This paper is timely, as there have been considerable discussions about current evaluations becoming overly simplistic for modern LLMs. • The study is fairly comprehensive, featuring a large evaluation body over various data science tasks, testing across six APIs, eight open generation models, and nine open-source code generation models. • A new benchmark is appreciated, especially when well-motivated. Some readers may find the new insights from Section 5.1/5.4 valuable. It is indeed rather
• The primary motivation behind this paper is the observation that existing research often relies on easily obtainable ground truths and straightforward evaluation metrics on LLM’s data science capabilities. The authors surmise that existing benchmarks are lacking as they focus on “narrower tasks” and “with easy to obtain ground truth and straightforward evaluation metrics” (line 045-051). But the examples given, eg MLAgentBench and SWE-Bench does not seems to be particularly “narrow”. Also, eas
1. This paper presents DataSciBench, a comprehensive benchmark for assessing large language models (LLMs) in data science applications. I looked at several questions in the attached zip file. The questions are indeed complex enough. Figure 5 / Table 3 provides evidence for data contamination risks and correlation with LiveCodeBench and BigCodeBench. 2. The authors propose a semi-automated Task-Function-Code (TFC) framework to generate ground truth and obtain evaluation metrics for each subtask
1. It's good to see such a comprehensive benchmark for data science released, but it seems somewhat trivial to me for collecting existing prompts in BigCodeBench or LLM-synthesized instructions. Essentially, what's the biggest difference between DataSciBench and previous code benchmarks for data science? 2. The ground truths were generated by LLMs via self-consistency, which might contain false positive ground truths. 3. The experimental analysis part concludes the overall performance (closed-
1. **Comprehensive Experiments**: The design of DataSciBench is comprehensive, encompassing multiple facets of data science tasks with varied complexity levels and multiple open- and closed-source models. 2. **Empirical Evaluation**: The semi-automated evaluation approach provides a unified and granular evaluation.
1. **Limited Significance**: While DataSciBench claims to assess data science abilities, the paper does not provide enough evidence that the chosen tasks reflect realistic data science challenges. Real-world data science often requires domain knowledge, iterative hypothesis testing, and adaptability to complex, often messy datasets. In contrast, the tasks presented here appear to lack such depth, instead focusing on simpler, predefined tasks that may not mirror the complexity of real data scienc
None.
1. The writing of this manuscript is not clear and extremely hard to follow. For example, it is unclear to me what the tasks are, how many samples are there in the benchmark, and how does the TFC work, etc. The authors may consider re-write the manuscript, and add some examples of the samples for better comprehension. 2. The benchmark seems not novel. There exist many "data science" or coding-related benchmarks for LLMs. The authors claim that previous studies "focusing on single tasks, simplist
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications
MethodsSparse Evolutionary Training
