TL;DR
This paper introduces LaSCD, a decoding method that uses Laplacian energy of visual attention to identify and reduce hallucinations in multimodal large language models without additional training.
Contribution
It reveals the role of high-frequency attention structure in hallucinations and proposes a novel, training-free decoding strategy to mitigate them.
Findings
LaSCD reduces hallucinations across multiple benchmarks.
Laplacian energy identifies layers where hallucinations emerge.
The method preserves the models' general capabilities.
Abstract
Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
