DABstep: Data Agent Benchmark for Multi-step Reasoning
Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, Thomas Wolf

TL;DR
DABstep is a new benchmark for testing AI agents on complex, multi-step data analysis tasks using real-world financial data, highlighting current limitations of large language models in this domain.
Contribution
We introduce DABstep, a comprehensive benchmark with over 450 real-world challenges for evaluating multi-step data analysis capabilities of AI agents.
Findings
Leading LLM-based agents achieve only 14.55% accuracy on hardest tasks.
DABstep's design enables scalable, automatic scoring of complex data analysis tasks.
Benchmark release includes a leaderboard and toolkit to foster further research.
Abstract
We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi-step data analysis tasks. DABstep comprises over 450 real-world challenges derived from a financial analytics platform, requiring models to combine code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands an iterative, multi-step problem-solving approach, testing capabilities in data manipulation, cross-referencing multiple sources, and precise result reporting. The benchmark provides a factoid-style answer format with automatic correctness checks for objective scoring at scale. We evaluate leading LLM-based agents, revealing a substantial performance gap: even the best agent achieves only 14.55% accuracy on the hardest tasks. We detail our benchmark's design, dataset composition, task formulation, evaluation protocol, report baseline results and analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
