Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng; Wenxuan Huang; Zhen Fang; Shuang Chen; Yufan Shen; Yishuo Cai; Xiaoman Wang; Zhenfei Yin; Lin Chen; Zehui Chen; Shiting Huang; Yiming Zhao; Xu Tang; Yao Hu; Philip Torr; Wanli Ouyang; Shaosheng Cao

arXiv:2602.02185·cs.CV·March 3, 2026

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao

PDF

Open Access

TL;DR

This paper introduces VDR-Bench, a new benchmark for evaluating multimodal large language models' visual and textual search abilities under realistic conditions, and proposes a multi-round search workflow to enhance retrieval performance.

Contribution

The paper presents VDR-Bench, a carefully curated benchmark for realistic visual-textual search evaluation, and introduces a multi-round cropped-search method to improve model retrieval capabilities.

Findings

01

VDR-Bench effectively challenges current MLLMs in realistic scenarios.

02

Multi-round cropped-search improves visual retrieval performance.

03

Benchmark and method guide future multimodal system development.

Abstract

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning