Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Zhe Qian, Yanbiao Ma, Zhuohan Ouyang, Zhonghua Wang, Zhongxing Xu, Fei Luo, Xinyu Liu, Zongyuan Ge, Yike Guo, and Jungong Han

TL;DR
This paper identifies hallucinations in multimodal reasoning models linked to high-uncertainty cognitive bifurcation points and proposes V-STAR, a training paradigm with attention reinforcement and reflection mechanisms to improve visual grounding.
Contribution
It introduces V-STAR, a novel training framework with hierarchical attention rewards and reflection strategies to mitigate hallucinations in multimodal reasoning models.
Findings
V-STAR improves visual grounding during high-uncertainty reasoning.
The hierarchical attention reward enhances visual evidence querying.
Reflection mechanisms reduce hallucination frequency.
Abstract
Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network's intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
