AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, Qian Yu, Hao Li, Chen Yue, Xiaan Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, Zhicheng Dou

TL;DR
AutoResearchBench is a new benchmark designed to evaluate AI agents' ability to autonomously discover and collect scientific literature, emphasizing in-depth understanding and open-ended search tasks.
Contribution
The paper introduces AutoResearchBench, a research-oriented, literature-focused benchmark with two complex tasks, and provides the dataset, evaluation pipeline, and code for future research.
Findings
State-of-the-art LLMs achieve only around 9.4% accuracy on Deep Research.
Most baseline models perform below 5% on the benchmark.
The benchmark is highly challenging compared to previous web-browsing benchmarks.
Abstract
Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
