PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model
Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang,, Chris Thomas

TL;DR
PAINT is a framework that reduces hallucinations in large vision-language models by selectively boosting attention to key visual tokens, improving caption accuracy without sacrificing performance.
Contribution
The paper introduces PAINT, a novel plug-and-play method that selectively enhances attention to important visual tokens to mitigate hallucinations in LVLMs.
Findings
Hallucinations arise from weakened attention to visual tokens in deeper layers.
Selective attention boosting to local and summary tokens significantly reduces hallucination rates.
PAINT achieves up to 62.3% reduction in hallucinations on MSCOCO dataset.
Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models often generate descriptions containing objects or details that are absent in the input image, a phenomenon commonly known as hallucination. Our work investigates the key reasons behind this issue by analyzing the pattern of self-attention in transformer layers. We find that hallucinations often arise from the progressive weakening of attention weight to visual tokens in the deeper layers of the LLM. Some previous works naively boost the attention of all visual tokens to mitigate this issue, resulting in suboptimal hallucination reduction. To address this, we identify two critical sets of visual tokens that facilitate the transfer of visual information from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Functional Brain Connectivity Studies · CCD and CMOS Imaging Sensors
MethodsSoftmax · Attention Is All You Need
