VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Zhengbo Zhang; Jinbo Su; Zhaowen Zhou; Changtao Miao; Yuhan Hong; Qimeng Wu; Yumeng Liu; Feier Wu; Yihe Tian; Yuhao Liang; Zitong Shan; Wanke Xia; Yi-Fan Zhang; Bo Zhang; Zhe Li; Shiming Xiang; Ying Yan

arXiv:2603.16289·cs.CV·March 19, 2026

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

PDF

Open Access 1 Datasets

TL;DR

VisBrowse-Bench introduces a comprehensive benchmark for evaluating visual reasoning in multimodal browsing agents, addressing limitations of prior benchmarks by including visual-native web page information and multimodal evidence validation.

Contribution

The paper presents a new benchmark dataset and evaluation framework for visual-native search, along with an agent workflow for active visual information reasoning during web browsing.

Findings

01

Best model achieves 47.6% accuracy

02

Proprietary model achieves 41.1% accuracy

03

Benchmark reveals gaps in current multimodal reasoning capabilities

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Zhengbo-Zhang/VisBrowse-Bench
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks