IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models
Jiabing Yang, Chenhang Cui, Yiyang Zhou, Yixiang Chen, Peng Xia, Ying Wei, Tao Yu, Yan Huang, Liang Wang

TL;DR
This paper introduces IKOD, a lightweight decoding method that reduces hallucinations in large vision-language models by maintaining better visual attention, without extra training or data, thus improving model reliability.
Contribution
The paper proposes IKOD, a novel decoding strategy that mitigates attention degradation and hallucinations in LVLMs without additional training or external data.
Findings
IKOD effectively reduces hallucinations in LVLMs.
IKOD improves model performance on multiple benchmarks.
IKOD is computationally efficient and easy to integrate.
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to "hallucinations", outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
