SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu; Yilun Zhao; Canyu Zhang; Arman Cohan; Chen Zhao

arXiv:2602.05975·cs.IR·February 9, 2026

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

PDF

Open Access

TL;DR

This paper introduces SAGE, a benchmark for scientific literature retrieval, revealing that traditional methods outperform LLM-based retrievers in deep research agents, and proposes a scaling framework to enhance retrieval performance.

Contribution

The paper presents SAGE, a comprehensive benchmark for scientific literature retrieval, and introduces a novel test-time scaling framework to improve retrieval accuracy for deep research agents.

Findings

01

BM25 outperforms LLM-based retrievers by ~30%.

02

The proposed scaling framework improves retrieval performance by 8% and 2%.

03

Deep research agents struggle with reasoning-intensive retrieval tasks.

Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Machine Learning in Materials Science · Topic Modeling