Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Jingqi Xu

TL;DR
Omni-NegCLIP is a fine-tuned CLIP model that significantly improves negation understanding in vision-language tasks by modifying contrastive loss and focusing on front-layer text encoder fine-tuning.
Contribution
It introduces a novel contrastive training approach and leverages front-layer text encoder fine-tuning to enhance negation comprehension in CLIP.
Findings
Improves negation understanding by up to 52.65% for presence-based negation.
Enhances overall image-text retrieval performance by up to 19.62%.
Demonstrates superior ability to handle multiple negation types compared to prior methods.
Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
