DataSciBench: An LLM Agent Benchmark for Data Science

Dan Zhang; Sining Zhoubian; Min Cai; Fengzu Li; Lekang; Yang; Wei Wang; Tianjiao Dong; Ziniu Hu; Jie Tang; Yisong Yue

arXiv:2502.13897·cs.CL·February 20, 2025

DataSciBench: An LLM Agent Benchmark for Data Science

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang, Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue

PDF

Open Access 1 Repo 5 Reviews

TL;DR

DataSciBench is a new comprehensive benchmark designed to evaluate LLMs in data science tasks using challenging prompts, a semi-automated ground truth generation pipeline, and a novel Task-Function-Code framework, revealing model strengths and weaknesses.

Contribution

The paper introduces DataSciBench, a comprehensive data science benchmark with a semi-automated GT pipeline and TFC framework, enabling rigorous evaluation of diverse LLMs.

Findings

01

API-based models outperform open-source models on all metrics

02

Deepseek-Coder-33B-Instruct achieves top score among open-source models

03

The benchmark reveals specific strengths and weaknesses of different LLMs in data science tasks

Abstract

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC)…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. This work presents a new benchmark dataset for evaluating LLMs on data science tasks, which is a meaningful contribution to the community. 2. The benchmark covers representative task types in data science, from data processing to data mining and report generation. 3. The evaluation setup includes popular open-sourced and proprietary LLMs.

Weaknesses

While this work has the potential to contribute a valuable benchmark to the community, several key issues need to be addressed: 1. The semi-automated pipeline uses a self-consistency strategy to generate ground truth for a portion of the tasks. However, there lacks detail on further quality control. Also, I think the difficulty and authenticity of model generated tasks is questionable. 2. DataSciBench employs instance-specific evaluation scripts that are both generated and verified by LLMs. The

Reviewer 02Rating 3Confidence 4

Strengths

• This paper is timely, as there have been considerable discussions about current evaluations becoming overly simplistic for modern LLMs. • The study is fairly comprehensive, featuring a large evaluation body over various data science tasks, testing across six APIs, eight open generation models, and nine open-source code generation models. • A new benchmark is appreciated, especially when well-motivated. Some readers may find the new insights from Section 5.1/5.4 valuable. It is indeed rather

Weaknesses

• The primary motivation behind this paper is the observation that existing research often relies on easily obtainable ground truths and straightforward evaluation metrics on LLM’s data science capabilities. The authors surmise that existing benchmarks are lacking as they focus on “narrower tasks” and “with easy to obtain ground truth and straightforward evaluation metrics” (line 045-051). But the examples given, eg MLAgentBench and SWE-Bench does not seems to be particularly “narrow”. Also, eas

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper presents DataSciBench, a comprehensive benchmark for assessing large language models (LLMs) in data science applications. I looked at several questions in the attached zip file. The questions are indeed complex enough. Figure 5 / Table 3 provides evidence for data contamination risks and correlation with LiveCodeBench and BigCodeBench. 2. The authors propose a semi-automated Task-Function-Code (TFC) framework to generate ground truth and obtain evaluation metrics for each subtask

Weaknesses

1. It's good to see such a comprehensive benchmark for data science released, but it seems somewhat trivial to me for collecting existing prompts in BigCodeBench or LLM-synthesized instructions. Essentially, what's the biggest difference between DataSciBench and previous code benchmarks for data science? 2. The ground truths were generated by LLMs via self-consistency, which might contain false positive ground truths. 3. The experimental analysis part concludes the overall performance (closed-

Reviewer 04Rating 3Confidence 4

Strengths

1. **Comprehensive Experiments**: The design of DataSciBench is comprehensive, encompassing multiple facets of data science tasks with varied complexity levels and multiple open- and closed-source models. 2. **Empirical Evaluation**: The semi-automated evaluation approach provides a unified and granular evaluation.

Weaknesses

1. **Limited Significance**: While DataSciBench claims to assess data science abilities, the paper does not provide enough evidence that the chosen tasks reflect realistic data science challenges. Real-world data science often requires domain knowledge, iterative hypothesis testing, and adaptability to complex, often messy datasets. In contrast, the tasks presented here appear to lack such depth, instead focusing on simpler, predefined tasks that may not mirror the complexity of real data scienc

Reviewer 05Rating 1Confidence 5

Strengths

None.

Weaknesses

1. The writing of this manuscript is not clear and extremely hard to follow. For example, it is unclear to me what the tasks are, how many samples are there in the benchmark, and how does the TFC work, etc. The authors may consider re-write the manuscript, and add some examples of the samples for better comprehension. 2. The benchmark seems not novel. There exist many "data science" or coding-related benchmarks for LLMs. The authors claim that previous studies "focusing on single tasks, simplist

Code & Models

Repositories

thudm/datascibench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications

MethodsSparse Evolutionary Training