Half-Truths Break Similarity-Based Retrieval
Bora Kargi, Arnas Uselis, Seong Joon Oh

TL;DR
This paper identifies a vulnerability in CLIP-style models where adding incorrect details can increase similarity scores, and proposes CS-CLIP to improve robustness and compositional understanding.
Contribution
The paper introduces CS-CLIP, a method that decomposes captions into entities and relations, and fine-tunes the model to better distinguish correct from incorrect details.
Findings
CS-CLIP increases half-truth accuracy to 69.3%.
Performance on compositional benchmarks improves by 5.7 points.
CLIP prefers correct shorter descriptions only 40.6% of the time.
Abstract
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
