SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

TL;DR
This paper introduces SAGE, a benchmark for scientific literature retrieval, revealing that traditional methods outperform LLM-based retrievers in deep research agents, and proposes a scaling framework to enhance retrieval performance.
Contribution
The paper presents SAGE, a comprehensive benchmark for scientific literature retrieval, and introduces a novel test-time scaling framework to improve retrieval accuracy for deep research agents.
Findings
BM25 outperforms LLM-based retrievers by ~30%.
The proposed scaling framework improves retrieval performance by 8% and 2%.
Deep research agents struggle with reasoning-intensive retrieval tasks.
Abstract
Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Materials Science · Topic Modeling
