See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
Yongchang Zhang, Oliver Ma, Tianyi Liu, Guangquan Zhou, Yang Chen

TL;DR
This paper introduces a training-free, iterative framework for multimodal reasoning in LVLMs that ensures visual evidence justifies each reasoning step, significantly reducing hallucinations and improving accuracy across benchmarks.
Contribution
A novel, training-free, plug-and-play method that supervises reasoning with visual evidence at test time, avoiding costly reinforcement learning training.
Findings
Achieves 16.5%-29.5% improvements on TreeBench.
Gains 13.7% RH-AUC on RH-Bench.
Reduces hallucination rates while enhancing reasoning accuracy.
Abstract
Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
