DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
Qiaohong Zhang, Weihao Ye, Jialong Chen, Yi Luo, BoYuan Li, Bowen Deng, Zibin Zheng, Jianhao Lin, Wei-Shi Zheng, Chuan Chen

TL;DR
DataClawBench is a new benchmark for evaluating autonomous financial data analysis agents in real-world, noisy environments with limited prior guidance, highlighting challenges in exploratory reasoning.
Contribution
Introduces DataClawBench, a large-scale, real-world financial data benchmark with diverse tasks and failure diagnostics for assessing exploratory data analysis agents.
Findings
Eight advanced LLMs struggle with exploration in financial data analysis.
More exploration does not always lead to better task outcomes.
Exploratory analysis can undermine agent reliability.
Abstract
Autonomous data analysis agents are increasingly expected to conduct exploratory analysis over underexplored data environments. This burden is especially salient in complex financial analytics, where relevant evidence is rarely pre-specified. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. We introduce DataClawBench, a benchmark for exploratory real-world financial data analysis under limited prior guidance. DataClawBench contains approximately 2.06 million real-world records across enterprise, industry, and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones that diagnose exploration and reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
