Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in   Zero-shot Learning

Hanyao Wang; Yibing Zhan; Liu Liu; Liang Ding; Yan Yang; Jun Yu

arXiv:2402.18400·cs.MM·June 28, 2024·2 cites

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Hanyao Wang, Yibing Zhan, Liu Liu, Liang Ding, Yan Yang, Jun Yu

PDF

Open Access

TL;DR

This paper identifies and addresses the hallucination problem in CLIP for zero-shot text-to-image retrieval, proposing a novel method called BSAP that improves retrieval accuracy by using auxiliary prompts and normalization.

Contribution

The paper introduces BSAP, a new approach with auxiliary prompts and normalization to reduce hallucinations in CLIP for zero-shot retrieval tasks, enhancing performance.

Findings

01

BSAP increases CLIP's performance by 20.6% on RefCOCO.

02

The method is effective on REC and RIS tasks.

03

Applicable to other models like ALBEF and BLIP.

Abstract

Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications