V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen; Chaoyang Wang; Dezhao Su; Xi Xiao; Zeyu Zhang; Jing Xiong; Qing Li; Yuzhang Shang; Shichao Kan

arXiv:2602.06034·cs.CV·February 26, 2026

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen, Chaoyang Wang, Dezhao Su, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Kan

PDF

Open Access 1 Models 3 Datasets

TL;DR

V-Retrver introduces an evidence-driven agentic reasoning framework for multimodal retrieval, enabling models to actively verify visual evidence and significantly improve retrieval accuracy and reasoning reliability.

Contribution

It presents a novel framework that allows multimodal models to actively gather visual evidence during reasoning, enhancing retrieval performance and interpretability.

Findings

01

23.0% average improvement in retrieval accuracy

02

Enhanced reasoning reliability and generalization

03

Effective evidence-gathering through curriculum-based training

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
V-Retrver/V-Retrver-SFT-7B
model· 8 dl· ♡ 2
8 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Visual Attention and Saliency Detection