On the Brittleness of CLIP Text Encoders
Allie Tran, Luca Rossetto

TL;DR
This paper systematically analyzes the stability of CLIP text encoders against various non-semantic perturbations, revealing significant brittleness especially to syntactic and semantic changes, which impacts multimedia retrieval robustness.
Contribution
It provides a comprehensive evaluation of CLIP's vulnerability to different query perturbations, emphasizing the importance of robustness in vision-language models.
Findings
Syntactic and semantic perturbations cause the largest instability in CLIP.
Brittleness is most pronounced with trivial surface edits like punctuation and case.
Robustness should be a key evaluation dimension beyond accuracy.
Abstract
Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
