TL;DR
V-Reflection enhances multimodal large language models by enabling them to actively interrogate visual inputs during reasoning, improving accuracy in perception-intensive tasks without increasing inference complexity.
Contribution
This work introduces a novel 'think-then-look' framework with a two-stage distillation strategy, transforming passive models into active visual interrogators.
Findings
Significantly improves performance on six perception-intensive benchmarks.
Enables models to localize task-critical visual evidence autonomously.
Maintains end-to-end autoregressive decoding during inference.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
