MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou

TL;DR
MM-BrowseComp introduces a new benchmark with 224 multimodal questions to evaluate AI agents' retrieval and reasoning abilities involving images and videos, exposing current models' limitations.
Contribution
The paper presents MM-BrowseComp, a comprehensive multimodal browsing benchmark with detailed analysis tools, addressing the gap in existing text-focused benchmarks.
Findings
Top models achieve only 29.02% accuracy on the benchmark.
Current models lack robust multimodal reasoning capabilities.
The benchmark reveals significant room for improvement in multimodal AI.
Abstract
AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The proposed benchmark is constructed through multiple rigorous verification phases. * The experiment part systematically compares a wide range of state-of-the-art closed- and open-source models, offering a clear view of current limitations and performance gaps.
* The tasks in this benchmark are often intentionally complex and involve multi-hop reasoning, which may not accurately reflect the typical multimodal search behaviors encountered in real-world web browsing scenarios. * The heavily hand-crafted nature of the benchmark may limit real-world generalizability.
- The data construction process ensures questions require multimodal browsing, effectively eliminating text shortcuts. - Queries in the dataset go through rigorous difficulty-based filtering. - The human-verified checklist of minimal finegrained reasoning steps provides a valuable signal, it provides a way for evaluation to go beyond just right/wrong final answers.
- There is missing a human baseline to calibrate what model accuracy means. It would provide an estimate for the performance ceiling of this task. - In 3.1.1, authors assert that essential information to solve a task should not appear in any text source. However, there is no mention of how this verification is done. - Although the authors repeatedly refer to “video-dependent” tasks, the paper never specifies how models are expected to engage with videos. Are agents intended to interact with vide
1. Clarity: The paper is readable and well-structured, with intuitive examples and comprehensive task taxonomy/mixture. Construction principles and validation steps are communicated with sufficient detail. 2. Significance: Addresses a timely need: deep web browsing with native multimodality—central for real-world assistants. The results and analyses (e.g., modality-specific performance, test-time scaling, error taxonomy) are likely to shape evaluation practices and agent design.
1. Scale: 224 instances is on the small side for a general-purpose benchmark spanning 22 subtasks; per-subtask sample sizes are too thin for robust statistics. Consider releasing a larger dev/test split or staged expansions, and report confidence intervals (e.g., bootstrap over items) in the main text. The dataset probably won't be very meaningful if the data size is too small. 2. Potential construction bias and leakage checks. During dataset construction, there could be several stages with ris
- The benchmark successfully bridges the gap left by previous textual benchmarks (like the original BrowseComp). - The checklists provide fine-grained evaluation, moving beyond simple correctness to assess the path taken. - Evaluates 18 models across multiple dimensions with detailed error taxonomy and modality-specific performance breakdown.
- While the authors convincingly justify the size through the rigor of construction and high filtering rate, a total of 224 instances across 22 distinct subtasks may be insufficient for reporting reasonable scores at this granularity. - Heavy reliance on GPT-4o-2024-11-20 as the sole evaluator for checklist, and I believe this might add certain evaluation bias.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
