Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shi Liu, Kecheng Zheng, Wei Chen

TL;DR
This paper proposes a training-free method that adjusts attention weights and logits to improve image focus in LVLMs, significantly reducing hallucinations and enhancing multi-modal understanding.
Contribution
It introduces a novel, training-free algorithm that balances image and text influence in LVLMs, addressing hallucination issues without additional training.
Findings
Reduces hallucination frequency in various LVLMs
Improves alignment between visual input and language output
Enhances multi-modal comprehension without extra training
Abstract
Existing Large Vision-Language Models (LVLMs) primarily align image features of vision encoder with Large Language Models (LLMs) to leverage their superior text generation capabilities. However, the scale disparity between vision encoder and language model may led to LLMs assuming a predominant role in multi-modal comprehension. This imbalance in LVLMs may result in the instances of hallucinatory. Concretely, LVLMs may generate consistent descriptions with or without visual input, indicating that certain outputs are influenced solely by context text. We refer to this phenomenon as "text inertia." To counteract this issue, we introduce a training-free algorithm to find an equilibrium point between image comprehension and language inference. Specifically, we adaptively involve adjusting and amplifying the attention weights assigned to image tokens, thereby granting greater prominence to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHallucinations in medical conditions
MethodsSoftmax · Attention Is All You Need · ALIGN
