Contrastive vision-language learning with paraphrasing and negation
Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez

TL;DR
This paper introduces SemCLIP, a contrastive learning method that improves vision-language models' robustness to paraphrasing and negation by using a new loss function and LLM-generated training triples, enhancing image retrieval and classification accuracy.
Contribution
SemCLIP's novel loss function and training approach effectively distinguish negated captions from original images while maintaining performance on paraphrased captions, improving robustness to semantic variations.
Findings
SemCLIP improves negation robustness, increasing accuracy from 68.1% to 78.1% on CC-Neg.
SemCLIP maintains comparable performance to CLIP on original caption retrieval.
SemCLIP outperforms models trained with negated captions on downstream tasks.
Abstract
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original,…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper proposes to jointly model the two opposing yet critical semantic transformations—equivalence (paraphrasing) and contradiction (negation)—within a single unified contrastive learning framework. This approach is intriguing, establishing a necessary research direction for exploring the holistic semantic robustness of multimodal models.
* Despite the joint objective, paraphrase robustness does not improve: on SCPP, SemCLIP underperforms the CLIP baseline ($53.1\\%$ vs. $60.0\\%$), and on CC-Neg paraphrase it trails a "Paraphrase-only" variant ($21.0\\%$ vs. $23.0\\%$). This pattern suggests a practical tension between the attractive force of $L_{\\text{paraphrase}}$ and the repulsive force of $L_{\\text{negation}}$. * Although negation robustness improves, it remains far from CoN-CLIP ($\text{CC-Neg Acc}_{\\text{neg}}$ $78.1\\
N/A
1. Lack of technical novelty. It is not a new idea to finetune CLIP with negation data or paraphrasing data. 2. Lack of comprehensive evaluation. The proposed SemCLIP model was only evaluated on two compositionality benchmarks and 5 classification benchmarks (CIFAR-10, CIFAR-100, FOODS101, FLOWERS102, OXFORD Pet). This is clearly insufficient to evaluate a CLIP model. Evaluation on more benchmarks (e.g. VTAB+ for classification, COCO/Flickr for text-image retrieval) is necessary for a solid pap
**Strengths:** Authors tackle an important problem of negation in multimodal retrieval.
**Weaknesses:** - CLIP is now outdated and many new multimodal models perform much better than CLIP. See MMEB leaderboard (V1) and the models on it. - Most of these models are expected to be very robust to paraphrases. - Comparison with ConCLIP, NegCLIP and ParaCLIP missing. - Missing Ablations: - What is the need for extra projection layer? Ablations need to be performed. - Why not use a contrastive loss with the new (anchor, paraphrase, negative). Why use two seperate losses? Ablation need
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
