Benchmarking Data Science Agents

Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren

arXiv:2402.17168·cs.AI·February 28, 2024·1 cites

Benchmarking Data Science Agents

Yuge Zhang, Qiyang Jiang, Xingyu Han, Nan Chen, Yuqing Yang, Kan Ren

PDF

Open Access 1 Repo

TL;DR

This paper introduces DSEval, a comprehensive benchmarking framework for evaluating data science agents, particularly LLMs, across the entire data science lifecycle, addressing practical challenges and improving assessment methods.

Contribution

The paper presents DSEval, a new evaluation paradigm with innovative benchmarks and a bootstrapped annotation method for assessing data science agents' performance.

Findings

01

Identifies key obstacles faced by data science agents.

02

Provides insights to guide future improvements in data science automation.

03

Enhances benchmarking coverage and evaluation accuracy.

Abstract

In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

metacopilot/dseval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Business Intelligence