Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Wei Liu; Peijie Yu; Michele Orini; Yali Du; Yulan He

arXiv:2602.02039·cs.AI·May 19, 2026

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

PDF

1 Repo 2 Datasets

TL;DR

This paper introduces Deep Data Research (DDR), a benchmark for evaluating large language models' autonomous investigatory capabilities in analyzing databases, revealing challenges in long-horizon exploration.

Contribution

It presents DDR and DDR-Bench as new benchmarks for assessing investigatory intelligence in LLMs, emphasizing intrinsic strategies over scaling or scaffolding.

Findings

01

Frontier models show emerging agency but struggle with long-horizon exploration.

02

Effective investigatory intelligence relies on intrinsic strategies, not just scaling.

03

DDR-Bench enables verifiable evaluation of autonomous data analysis.

Abstract

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thinkwee/DDR_Bench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Big Data and Digital Economy