Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
Yebo Wu, Han Jin, Zhijiang Guo, Li Li

TL;DR
This paper proposes DaID, a contrastive decoding framework that reduces hallucinations in multimodal large language models by dynamically calibrating token generation using visual attention and internal model discrepancies.
Contribution
The paper introduces Dual-Anchor Introspective Decoding (DaID), a novel method that leverages visual attention to mitigate hallucinations in MLLMs.
Findings
DaID significantly reduces hallucinations across multiple benchmarks.
DaID enhances reasoning capabilities of MLLMs.
The method effectively calibrates token generation using visual attention signals.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model's internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
