One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

Hiroyuki Deguchi; Katsuki Chousa; Yusuke Sakai

arXiv:2604.27674·cs.CL·May 1, 2026

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

PDF

TL;DR

This paper exposes vulnerabilities in cross-modal encoders caused by hubness, demonstrating that a single hub text can artificially inflate similarity scores in image captioning and retrieval tasks.

Contribution

The authors propose a method to identify hub embeddings and texts, revealing how a single hub can compromise the reliability of cross-modal similarity assessments.

Findings

01

A single hub text can match many images as well as human captions.

02

Hubness can cause false positives in image-text retrieval tasks.

03

The method effectively identifies problematic hub texts in cross-modal models.

Abstract

The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.