Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan; Saman Sadeghi Afgeh; Joe Townsend; Artur d'Avila Garcez

arXiv:2511.16527·cs.CV·November 21, 2025

Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SemCLIP, a contrastive learning method that improves vision-language models' robustness to paraphrasing and negation by using a new loss function and LLM-generated training triples, enhancing image retrieval and classification accuracy.

Contribution

SemCLIP's novel loss function and training approach effectively distinguish negated captions from original images while maintaining performance on paraphrased captions, improving robustness to semantic variations.

Findings

01

SemCLIP improves negation robustness, increasing accuracy from 68.1% to 78.1% on CC-Neg.

02

SemCLIP maintains comparable performance to CLIP on original caption retrieval.

03

SemCLIP outperforms models trained with negated captions on downstream tasks.

Abstract

Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper proposes to jointly model the two opposing yet critical semantic transformations—equivalence (paraphrasing) and contradiction (negation)—within a single unified contrastive learning framework. This approach is intriguing, establishing a necessary research direction for exploring the holistic semantic robustness of multimodal models.

Weaknesses

* Despite the joint objective, paraphrase robustness does not improve: on SCPP, SemCLIP underperforms the CLIP baseline ($53.1\\%$ vs. $60.0\\%$), and on CC-Neg paraphrase it trails a "Paraphrase-only" variant ($21.0\\%$ vs. $23.0\\%$). This pattern suggests a practical tension between the attractive force of $L_{\\text{paraphrase}}$ and the repulsive force of $L_{\\text{negation}}$. * Although negation robustness improves, it remains far from CoN-CLIP ($\text{CC-Neg Acc}_{\\text{neg}}$ $78.1\\

Reviewer 02Rating 2Confidence 4

Strengths

N/A

Weaknesses

1. Lack of technical novelty. It is not a new idea to finetune CLIP with negation data or paraphrasing data. 2. Lack of comprehensive evaluation. The proposed SemCLIP model was only evaluated on two compositionality benchmarks and 5 classification benchmarks (CIFAR-10, CIFAR-100, FOODS101, FLOWERS102, OXFORD Pet). This is clearly insufficient to evaluate a CLIP model. Evaluation on more benchmarks (e.g. VTAB+ for classification, COCO/Flickr for text-image retrieval) is necessary for a solid pap

Reviewer 03Rating 2Confidence 5

Strengths

**Strengths:** Authors tackle an important problem of negation in multimodal retrieval.

Weaknesses

**Weaknesses:** - CLIP is now outdated and many new multimodal models perform much better than CLIP. See MMEB leaderboard (V1) and the models on it. - Most of these models are expected to be very robust to paraphrases. - Comparison with ConCLIP, NegCLIP and ParaCLIP missing. - Missing Ablations: - What is the need for extra projection layer? Ablations need to be performed. - Why not use a contrastive loss with the new (anchor, paraphrase, negative). Why use two seperate losses? Ablation need

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques