DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan, Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu

TL;DR
DA-Code is a challenging new benchmark for evaluating large language models on complex, real-world data science tasks that require advanced coding, grounding, and planning skills.
Contribution
The paper introduces DA-Code, a novel benchmark for agent-based data science code generation, with diverse real data tasks and a new baseline model.
Findings
Current LLMs achieve only 30.5% accuracy on DA-Code
DA-Code covers complex data wrangling and analytics tasks
Benchmark is scalable and aligned with real-world scenarios
Abstract
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsSparse Evolutionary Training
