Seeing is Believing: Mitigating Hallucination in Large Vision-Language   Models via CLIP-Guided Decoding

Ailin Deng; Zhirui Chen; Bryan Hooi

arXiv:2402.15300·cs.CV·April 24, 2024·3 cites

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Ailin Deng, Zhirui Chen, Bryan Hooi

PDF

Open Access 2 Repos

TL;DR

This paper introduces CLIP-Guided Decoding, a training-free method that uses CLIP similarity to reduce object hallucinations in large vision-language models, improving their reliability without sacrificing performance.

Contribution

The paper presents a novel, training-free decoding technique leveraging CLIP similarity to effectively mitigate hallucinations in LVLMs, outperforming existing methods.

Findings

01

CLIP similarity is a stronger hallucination indicator than token likelihoods.

02

CGD significantly reduces object hallucinations across multiple LVLMs.

03

The approach maintains the utility of text generation while improving visual grounding.

Abstract

Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training