Investigating and Mitigating Object Hallucinations in Pretrained   Vision-Language (CLIP) Models

Yufang Liu; Tao Ji; Changzhi Sun; Yuanbin Wu; Aimin Zhou

arXiv:2410.03176·cs.CV·October 7, 2024

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, Aimin Zhou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the origin of object hallucinations in CLIP models, revealing they can occur in isolation, and proposes a counterfactual data augmentation method to mitigate these hallucinations, improving vision-language system reliability.

Contribution

The study uncovers that object hallucinations originate within CLIP itself and introduces a novel counterfactual data augmentation approach to effectively reduce hallucinations.

Findings

01

CLIP models exhibit object hallucinations even in isolation.

02

Counterfactual data augmentation reduces hallucination occurrences.

03

Enhanced CLIP models improve reliability in vision-language tasks.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yufang-liu/clip_hallucination
pytorchOfficial

Videos

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models· underline

Taxonomy

TopicsFunctional Brain Connectivity Studies · CCD and CMOS Imaging Sensors · EEG and Brain-Computer Interfaces

MethodsContrastive Language-Image Pre-training