Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang; Yu Zeng; Qiuchen Wang; Zhen Fang; Shaosheng Cao; Zheng Chu; Qingyu Yin; Shuang Chen; Zhenfei Yin; Lin Chen; Zehui Chen; Xu Tang; Yao Hu; Shaohui Lin; Philip Torr; Feng Zhao; Wanli Ouyang

arXiv:2601.22060·cs.CV·March 25, 2026

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, Wanli Ouyang

PDF

Open Access

TL;DR

This paper introduces Vision-DeepResearch, a multimodal large language model capable of multi-turn, multi-entity, multi-scale visual and textual search, significantly improving deep-research capabilities in noisy real-world scenarios.

Contribution

It presents a novel multimodal deep-research paradigm with extensive reasoning steps, training methods, and robust search strategies, outperforming existing models and workflows.

Findings

01

Supports dozens of reasoning steps and hundreds of engine interactions.

02

Outperforms existing multimodal deep-research models and workflows.

03

Achieves robust performance under heavy visual noise.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning