Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

TL;DR
This paper introduces Deep Data Research (DDR), a benchmark for evaluating large language models' autonomous investigatory capabilities in analyzing databases, revealing challenges in long-horizon exploration.
Contribution
It presents DDR and DDR-Bench as new benchmarks for assessing investigatory intelligence in LLMs, emphasizing intrinsic strategies over scaling or scaffolding.
Findings
Frontier models show emerging agency but struggle with long-horizon exploration.
Effective investigatory intelligence relies on intrinsic strategies, not just scaling.
DDR-Bench enables verifiable evaluation of autonomous data analysis.
Abstract
The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Big Data and Digital Economy
