DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

TL;DR
DSAEval is a comprehensive benchmark for evaluating data science agents on diverse real-world problems, emphasizing multimodal perception, iterative interactions, and holistic assessment, revealing current strengths and challenges in AI-driven data science automation.
Contribution
Introduces DSAEval, a large-scale benchmark with multimodal, multi-query, and multi-dimensional evaluation features for assessing data science agents on real-world tasks.
Findings
Multimodal perception improves vision task performance by up to 11.30%.
Claude-Sonnet-4.5 achieves the best overall performance.
Current agents excel in structured data but struggle with unstructured domains.
Abstract
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Multimodal Machine Learning Applications
