Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning
Hanyao Wang, Yibing Zhan, Liu Liu, Liang Ding, Yan Yang, Jun Yu

TL;DR
This paper identifies and addresses the hallucination problem in CLIP for zero-shot text-to-image retrieval, proposing a novel method called BSAP that improves retrieval accuracy by using auxiliary prompts and normalization.
Contribution
The paper introduces BSAP, a new approach with auxiliary prompts and normalization to reduce hallucinations in CLIP for zero-shot retrieval tasks, enhancing performance.
Findings
BSAP increases CLIP's performance by 20.6% on RefCOCO.
The method is effective on REC and RIS tasks.
Applicable to other models like ALBEF and BLIP.
Abstract
Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications
