AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Lei Xiong; Kun Luo; Ziyi Xia; Wenbo Zhang; Jin-Ge Yao; Zheng Liu; Jingying Shao; Jianlyu Chen; Hongjin Qian; Xi Yang; Qian Yu; Hao Li; Chen Yue; Xiaan Du; Yuyang Wang; Yesheng Liu; Haiyu Xu; Zhicheng Dou

arXiv:2604.25256·cs.AI·April 29, 2026

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, Qian Yu, Hao Li, Chen Yue, Xiaan Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, Zhicheng Dou

PDF

1 Repo 1 Datasets

TL;DR

AutoResearchBench is a new benchmark designed to evaluate AI agents' ability to autonomously discover and collect scientific literature, emphasizing in-depth understanding and open-ended search tasks.

Contribution

The paper introduces AutoResearchBench, a research-oriented, literature-focused benchmark with two complex tasks, and provides the dataset, evaluation pipeline, and code for future research.

Findings

01

State-of-the-art LLMs achieve only around 9.4% accuracy on Deep Research.

02

Most baseline models perform below 5% on the benchmark.

03

The benchmark is highly challenging compared to previous web-browsing benchmarks.

Abstract

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CherYou/AutoResearchBench
github

Datasets

Lk123/AutoResearchBench
dataset· 170 dl
170 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.