Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, Aimin Zhou

TL;DR
This paper investigates the origin of object hallucinations in CLIP models, revealing they can occur in isolation, and proposes a counterfactual data augmentation method to mitigate these hallucinations, improving vision-language system reliability.
Contribution
The study uncovers that object hallucinations originate within CLIP itself and introduces a novel counterfactual data augmentation approach to effectively reduce hallucinations.
Findings
CLIP models exhibit object hallucinations even in isolation.
Counterfactual data augmentation reduces hallucination occurrences.
Enhanced CLIP models improve reliability in vision-language tasks.
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFunctional Brain Connectivity Studies · CCD and CMOS Imaging Sensors · EEG and Brain-Computer Interfaces
MethodsContrastive Language-Image Pre-training
