TL;DR
This paper introduces ILVAD, a training-free, plug-and-play method that reduces hallucinations in LVLMs by enhancing attention to visual evidence based on inter-layer attention discrepancies.
Contribution
The paper uncovers layer-specific sensitivity to visual evidence in LVLMs and proposes a novel attention discrepancy-based method to mitigate hallucinations without additional training.
Findings
ILVAD consistently reduces hallucinations across five LVLMs.
The method improves visual grounding and response accuracy.
It is effective across various architectures and tasks.
Abstract
Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
