DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Tongzhou Wu; Yuhao Wang; Xinyu Ma; Xiuqiang He; Shuaiqiang Wang; Dawei Yin; Xiangyu Zhao

arXiv:2603.01152·cs.AI·March 3, 2026

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

PDF

Open Access

TL;DR

This paper introduces DeepResearch-9K, a large-scale challenging dataset for deep-research agents, along with an open-source training framework DeepResearch-R1, to advance multi-step web exploration and question answering capabilities.

Contribution

The paper presents a novel large-scale dataset and an open-source training framework specifically designed for deep-research agents, addressing key bottlenecks in data and tools.

Findings

01

Agents trained on DeepResearch-9K achieve state-of-the-art results.

02

DeepResearch-R1 supports multi-turn web interactions and various reinforcement learning approaches.

03

The dataset includes high-quality search trajectories and verifiable answers.

Abstract

Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems