Mitigating Object Hallucinations in Large Vision-Language Models through   Visual Contrastive Decoding

Sicong Leng; Hang Zhang; Guanzheng Chen; Xin Li; Shijian Lu; Chunyan; Miao; Lidong Bing

arXiv:2311.16922·cs.CV·November 29, 2023·2 cites

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan, Miao, Lidong Bing

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces Visual Contrastive Decoding (VCD), a training-free method that reduces object hallucinations in large vision-language models by contrasting outputs from original and distorted images, improving factual accuracy.

Contribution

The paper presents VCD, a novel, training-free approach that effectively mitigates object hallucinations in LVLMs by leveraging visual contrastive decoding without external tools.

Findings

01

VCD significantly reduces object hallucinations across various LVLMs.

02

VCD improves accuracy on general LVLM benchmarks.

03

VCD does not require additional training or external tools.

Abstract

Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
MMR1/MMR1-32B-SFT
model· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning