DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination
Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei

TL;DR
DAMRO is a training-free method that reduces object hallucination in LVLMs by filtering out background tokens during attention, improving the accuracy of visual-language models without additional training.
Contribution
The paper introduces DAMRO, a novel training-free approach that leverages attention mechanisms to mitigate object hallucination in LVLMs, addressing a key flaw in visual encoders.
Findings
Significantly reduces object hallucination in LVLMs
Effective across multiple LVLM architectures and benchmarks
Improves alignment between attention focus and referred objects
Abstract
Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that ive into ttention echanism of LVLM to educe bject…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHallucinations in medical conditions · Functional Brain Connectivity Studies · EEG and Brain-Computer Interfaces
MethodsSoftmax · Attention Is All You Need · Focus
