One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

TL;DR
This paper exposes vulnerabilities in cross-modal encoders caused by hubness, demonstrating that a single hub text can artificially inflate similarity scores in image captioning and retrieval tasks.
Contribution
The authors propose a method to identify hub embeddings and texts, revealing how a single hub can compromise the reliability of cross-modal similarity assessments.
Findings
A single hub text can match many images as well as human captions.
Hubness can cause false positives in image-text retrieval tasks.
The method effectively identifies problematic hub texts in cross-modal models.
Abstract
The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
