Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang; Jiabao Zhuang; Wenqing Jing; Kexin Tan; Ziyu Kong; Jingyi Deng; Yujiong Shen; Yuhui Wang; Zhenghao Xiang; Qiyuan Peng; Yuhang Zhao; Ning Luo; Renzhe Zheng; Jiahui Lin; Mingqi Wu; Long Ma; Shihan Dou; Maxm Pan; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2601.12369·cs.CL·May 20, 2026

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces TaxoBench, a comprehensive benchmark for evaluating deep research agents' ability to retrieve and organize papers into expert-like taxonomies, revealing significant gaps in current AI capabilities.

Contribution

The paper presents TaxoBench, a novel benchmark with new metrics for hierarchical organization, and provides an extensive evaluation of current agents and LLMs, highlighting key limitations.

Findings

01

Best agent retrieves only 20.92% of papers

02

Model taxonomies show 75.9% sibling overlap

03

LLMs achieve 28-29% Sem-Path, below human levels

Abstract

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KongLongGeFDU/TaxoBench
github

Datasets

konglongge/TaxoBench
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.