DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery

Keyu Li; Mohan Jiang; Dayuan Fu; Yunze Wu; Xiangkun Hu; Dequan Wang; Pengfei Liu

arXiv:2508.06960·cs.AI·August 12, 2025

DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery

Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu, Dequan Wang, Pengfei Liu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces DatasetResearch, a benchmark for AI agents to autonomously discover datasets based on user demands, revealing current limitations and guiding future improvements in dataset discovery capabilities.

Contribution

It presents the first comprehensive benchmark for evaluating AI agents' ability to discover datasets, along with a detailed analysis of their strengths and weaknesses.

Findings

01

AI agents achieve only 22% on the benchmark

02

Knowledge tasks are better handled by retrieval-based agents

03

Reasoning tasks are dominated by structured generation agents

Abstract

The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents' ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22%…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. First comprehensive framework targeting automated data discovery—a growing but under-studied problem. 2. Uses gated datasets + reference metadata, preventing leakage and reflecting real research workflows. 3. Combines metadata scoring, few-shot results, and fine-tuning—much richer than single-metric evaluation.

Weaknesses

1. Used OpenAI o3 to generate both reference/discovered metadata and judges metadata similarity. This creates a closed loop that may favor o3’s rather than true task fit. 2. When starting from gated datasets, it prevents agents from downloading the ground-truth data. This structurally disadvantages search agents (vs. synthesis) and conflates ``access policy'' with ``discovery ability.'' 3. Data scope is narrow: NLP-only and text-only.

Reviewer 02Rating 4Confidence 4

Strengths

If well-justified, dataset discovery would be an interesting direction for LLM agents to explore.

Weaknesses

1. The paper still needs in-depth justification on the motivation of dataset discovery demands. It is always intriguing to utilize LLM-based agents for exploring different applications. However, it is still lacking examples of practical use cases for human users to utilize dataset discovery agents. 2. The paper claims the dataset discovery agent shows interesting demands related to knowledge-intensive tasks or reasoning-intensive tasks. However, both tasks have mature strategies regarding data

Reviewer 03Rating 2Confidence 5

Strengths

[**significance**] The authors correctly identify data and data discovery as an important challenge to improving AI models. Efforts aimed at creating benchmarks designed to isolate capabilities useful for automating such challenges is an important endeavor.

Weaknesses

[**clarity**] - The paper (rightfully) emphasizes the importance of data to further advance AI models. Unfortunately, the problem specification is overly vague, entangling different challenges and data use cases. For example, the abstract mentions “countless valuable datasets [...] and domain platforms]” (l13-14], but does not specify if these are hidden due to access constraints or limitations attributable to search algorithms. - The word “synthesis” is used several times in the introduction w

Code & Models

Datasets

GAIR/DatasetResearch
dataset· 123 dl
123 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Big Data and Digital Economy