BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen; Xueguang Ma; Shengyao Zhuang; Ping Nie; Kai Zou; Andrew Liu; Joshua Green; Kshama Patel; Ruoxi Meng; Mingyi Su; Sahel Sharifymoghaddam; Yanxi Li; Haoran Hong; Xinyu Shi; Xuye Liu; Nandan Thakur; Crystina Zhang; Luyu Gao; Wenhu Chen; Jimmy Lin

arXiv:2508.06600·cs.CL·August 12, 2025

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin

PDF

Open Access 3 Datasets

TL;DR

BrowseComp-Plus introduces a controlled, fair, and transparent benchmark for evaluating deep research agents by using a fixed corpus and human-verified documents, enabling better comparison and analysis of system components.

Contribution

It presents BrowseComp-Plus, a new benchmark with curated data for fair, reproducible evaluation of deep research agents, addressing limitations of existing web API-based benchmarks.

Findings

01

BrowseComp-Plus effectively distinguishes deep research system performance.

02

Open-source Search-R1 with BM25 achieves 3.86% accuracy.

03

GPT-5 with Qwen3-Embedding-8B retriever reaches 70.1% accuracy.

Abstract

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Machine Learning in Materials Science