Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Jingqi Xu

arXiv:2603.29258·cs.CV·May 5, 2026

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Jingqi Xu

PDF

TL;DR

Omni-NegCLIP is a fine-tuned CLIP model that significantly improves negation understanding in vision-language tasks by modifying contrastive loss and focusing on front-layer text encoder fine-tuning.

Contribution

It introduces a novel contrastive training approach and leverages front-layer text encoder fine-tuning to enhance negation comprehension in CLIP.

Findings

01

Improves negation understanding by up to 52.65% for presence-based negation.

02

Enhances overall image-text retrieval performance by up to 19.62%.

03

Demonstrates superior ability to handle multiple negation types compared to prior methods.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.