Visual Attention Reasoning via Hierarchical Search and Self-Verification
Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Xuelong Li

TL;DR
This paper introduces Visual Attention Reasoning (VAR), a reinforcement learning framework that improves multimodal large language models by enabling hierarchical search and self-verification to reduce hallucinations and enhance visual grounding.
Contribution
It presents a novel hierarchical search and self-verification framework with explicit visual grounding, backed by theoretical validation and superior experimental performance.
Findings
Significantly reduces hallucinations in MLLMs
Enforces traceable evidence grounding with bounding boxes
Outperforms state-of-the-art methods on safety benchmarks
Abstract
Multimodal Large Language Models (MLLMs) frequently hallucinate due to their reliance on fragile, linear reasoning and weak visual grounding. We propose Visual Attention Reasoning (VAR), a reinforcement learning framework that reformulates reasoning as a hierarchical search with self-verification. VAR enforces traceable evidence grounding by generating explicit bounding boxes, guided by a novel reward function combining geometric precision and semantic sufficiency. Furthermore, it replaces linear Chain-of-Thought with a tree-search policy capable of backtracking to correct logical errors. Theoretical analysis validates the framework's reliability, and extensive experiments demonstrate that VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks
