Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation

Xingyu Zhu; Kesen Zhao; Liang Yi; Shuo Wang; Zhicai Wang; Beier Zhu; Hanwang Zhang

arXiv:2602.24041·cs.CV·March 2, 2026

Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AIR, a training-free, adaptive visual reinforcement method for multimodal large language models that reduces hallucinations by selectively emphasizing salient visual cues during decoding.

Contribution

AIR is a novel, training-free framework that improves hallucination mitigation in MLLMs through prototype-based token reduction and OT-guided patch reinforcement.

Findings

01

Significantly reduces hallucinations in MLLMs

02

Maintains core reasoning and understanding capabilities

03

Effective across multiple MLLMs and tasks

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

Efficiency-aware design. Prototype reduction + selective patch fusion yields small overhead, acceptable for many deployments. Clarity. Method is easy to implement in existing FFN-reinjection pipelines

Weaknesses

Benchmark coverage is narrow. Evaluation centers on CHAIR (captioning) and POPE (binary VQA). Absent are harder hallucination suites probing language bias and visual illusions, such as HallusionBench, RLHF-v, and MMHal-Bench, V* etc; including them would strengthen claims of robustness. Pure performance: The performance in incremental compared to VAF. May be provide curves for ε (entropic regularization), τ, Top-Q, and #patches on at least two models, and adversarial stress tests (noisy crops,

Reviewer 02Rating 4Confidence 5

Strengths

1. The proposed method operates purely at inference, making it broadly applicable to existing MLLMs without retraining. 2. Integrating optimal transport to quantify alignment between hidden states and patch embeddings is a creative idea.

Weaknesses

1. The central claim, that reinforcing salient patches directly causes lower hallucination, is not rigorously demonstrated. The supporting evidence is purely descriptive and does not establish a causal relationship between visual emphasis and reduced hallucination. The observed gains could equally arise from reduced visual redundancy or implicit regularization rather than genuine enhancement of visual grounding. 2. All experiments restrict generation to 64 tokens (Table 1), whereas hallucination

Reviewer 03Rating 6Confidence 4

Strengths

1.The method achieves notable improvements across multiple models and benchmarks, showing robustness under both standard and adversarial conditions 2.The OT-based analysis is well-motivated and supported by proof and visualization, providing a clear justification for the proposed selection mechanism. 3.The paper is clearly written, visually well-presented, and includes detailed experimental settings for replication.

Weaknesses

1.AIR assumes well-aligned hidden and visual spaces; if this alignment is weak, OT distance may emphasize irrelevant correlations, limiting reliability on misaligned models. 2.Despite its name, AIR uses fixed thresholds and token counts. Introducing data- or entropy-driven adaptation could further enhance robustness across tasks. 3.Experiments focus on standard datasets with clean imagery; robustness under distribution shifts or noisy visuals remains untested.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hallucinations in medical conditions · Multimodal Machine Learning Applications