Half-Truths Break Similarity-Based Retrieval

Bora Kargi; Arnas Uselis; Seong Joon Oh

arXiv:2602.23906·cs.CV·March 2, 2026

Half-Truths Break Similarity-Based Retrieval

Bora Kargi, Arnas Uselis, Seong Joon Oh

PDF

Open Access

TL;DR

This paper identifies a vulnerability in CLIP-style models where adding incorrect details can increase similarity scores, and proposes CS-CLIP to improve robustness and compositional understanding.

Contribution

The paper introduces CS-CLIP, a method that decomposes captions into entities and relations, and fine-tunes the model to better distinguish correct from incorrect details.

Findings

01

CS-CLIP increases half-truth accuracy to 69.3%.

02

Performance on compositional benchmarks improves by 5.7 points.

03

CLIP prefers correct shorter descriptions only 40.6% of the time.

Abstract

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis