V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Jiazhou Zhou; Yucheng Chen; Hongyang Li; Qing Jiang; Hu Zhou; Ying-Cong Chen; Lei Zhang

arXiv:2604.03307·cs.CV·April 17, 2026

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Jiazhou Zhou, Yucheng Chen, Hongyang Li, Qing Jiang, Hu Zhou, Ying-Cong Chen, Lei Zhang

PDF

1 Models

TL;DR

V-Reflection enhances multimodal large language models by enabling them to actively interrogate visual inputs during reasoning, improving accuracy in perception-intensive tasks without increasing inference complexity.

Contribution

This work introduces a novel 'think-then-look' framework with a two-stage distillation strategy, transforming passive models into active visual interrogators.

Findings

01

Significantly improves performance on six perception-intensive benchmarks.

02

Enables models to localize task-critical visual evidence autonomously.

03

Maintains end-to-end autoregressive decoding during inference.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
garlandchou/V-Reflection
model· 9 dl· ♡ 5
9 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.