PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in   Large Vision-Language Model

Kazi Hasan Ibn Arif; Sajib Acharjee Dip; Khizar Hussain; Lang Zhang,; Chris Thomas

arXiv:2501.12206·cs.CV·March 27, 2025

PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model

Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang,, Chris Thomas

PDF

Open Access 2 Repos

TL;DR

PAINT is a framework that reduces hallucinations in large vision-language models by selectively boosting attention to key visual tokens, improving caption accuracy without sacrificing performance.

Contribution

The paper introduces PAINT, a novel plug-and-play method that selectively enhances attention to important visual tokens to mitigate hallucinations in LVLMs.

Findings

01

Hallucinations arise from weakened attention to visual tokens in deeper layers.

02

Selective attention boosting to local and summary tokens significantly reduces hallucination rates.

03

PAINT achieves up to 62.3% reduction in hallucinations on MSCOCO dataset.

Abstract

Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models often generate descriptions containing objects or details that are absent in the input image, a phenomenon commonly known as hallucination. Our work investigates the key reasons behind this issue by analyzing the pattern of self-attention in transformer layers. We find that hallucinations often arise from the progressive weakening of attention weight to visual tokens in the deeper layers of the LLM. Some previous works naively boost the attention of all visual tokens to mitigate this issue, resulting in suboptimal hallucination reduction. To address this, we identify two critical sets of visual tokens that facilitate the transfer of visual information from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Functional Brain Connectivity Studies · CCD and CMOS Imaging Sensors

MethodsSoftmax · Attention Is All You Need