DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Maojun Sun; Yifei Xie; Yue Wu; Ruijian Han; Binyan Jiang; Defeng Sun; Yancheng Yuan; Jian Huang

arXiv:2601.13591·cs.AI·January 21, 2026

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

PDF

Open Access

TL;DR

DSAEval is a comprehensive benchmark for evaluating data science agents on diverse real-world problems, emphasizing multimodal perception, iterative interactions, and holistic assessment, revealing current strengths and challenges in AI-driven data science automation.

Contribution

Introduces DSAEval, a large-scale benchmark with multimodal, multi-query, and multi-dimensional evaluation features for assessing data science agents on real-world tasks.

Findings

01

Multimodal perception improves vision task performance by up to 11.30%.

02

Claude-Sonnet-4.5 achieves the best overall performance.

03

Current agents excel in structured data but struggle with unstructured domains.

Abstract

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Multimodal Machine Learning Applications