Beyond Retrieval: A Modular Benchmark for Academic Deep Research Agents

Zhihan Guo; Feiyang Xu; Yifan Li; Muzhi Li; Shuai Zou; Jiele Wu; Han Shi; Haoli Bai; Ho-fung Leung; Irwin King

arXiv:2512.00986·cs.CL·February 2, 2026

Beyond Retrieval: A Modular Benchmark for Academic Deep Research Agents

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

PDF

Open Access

TL;DR

This paper introduces ADRA-Bank, a modular benchmark and evaluation paradigm for academic deep research agents, addressing gaps in existing benchmarks by focusing on high-level planning, reasoning, and academic domain specificity.

Contribution

It presents a new human-annotated dataset and a modular evaluation framework tailored for academic research agents, emphasizing planning, retrieval, and reasoning capabilities.

Findings

01

Agents excel in specialized tasks but struggle with multi-source retrieval.

02

High-level planning is key to unlocking reasoning in foundational LLMs.

03

The benchmark exposes actionable failure modes for future improvements.

Abstract

A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents. To address these gaps, we introduce ADRA-Bank, a modular benchmark for Academic DR Agents. Grounded in academic literature, our benchmark is a human-annotated dataset of 200 instances across 10 academic domains, including both research and review papers. Furthermore, we propose a modular Evaluation Paradigm for Academic DR Agents (ADRA-Eval), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Multi-Agent Systems and Negotiation · Topic Modeling