DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Qiaohong Zhang; Weihao Ye; Jialong Chen; Yi Luo; BoYuan Li; Bowen Deng; Zibin Zheng; Jianhao Lin; Wei-Shi Zheng; Chuan Chen

arXiv:2605.02503·cs.AI·May 19, 2026

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Qiaohong Zhang, Weihao Ye, Jialong Chen, Yi Luo, BoYuan Li, Bowen Deng, Zibin Zheng, Jianhao Lin, Wei-Shi Zheng, Chuan Chen

PDF

1 Datasets

TL;DR

DataClawBench is a new benchmark for evaluating autonomous financial data analysis agents in real-world, noisy environments with limited prior guidance, highlighting challenges in exploratory reasoning.

Contribution

Introduces DataClawBench, a large-scale, real-world financial data benchmark with diverse tasks and failure diagnostics for assessing exploratory data analysis agents.

Findings

01

Eight advanced LLMs struggle with exploration in financial data analysis.

02

More exploration does not always lead to better task outcomes.

03

Exploratory analysis can undermine agent reliability.

Abstract

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis over underexplored data environments. This burden is especially salient in complex financial analytics, where relevant evidence is rarely pre-specified. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. We introduce DataClawBench, a benchmark for exploratory real-world financial data analysis under limited prior guidance. DataClawBench contains approximately 2.06 million real-world records across enterprise, industry, and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones that diagnose exploration and reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

GTML-LAB/DataClaw
dataset· 1.9k dl
1.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.