Learn "No" to Say "Yes" Better: Improving Vision-Language Models via   Negations

Jaisidh Singh; Ishaan Shrivastava; Mayank Vatsa; Richa Singh; Aparna; Bharati

arXiv:2403.20312·cs.CV·March 13, 2025·1 cites

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna, Bharati

PDF

Open Access 1 Repo

TL;DR

This paper introduces CC-Neg and CoN-CLIP to enhance vision-language models' understanding of negations, leading to better semantic encoding and improved zero-shot classification accuracy.

Contribution

The paper presents a new dataset CC-Neg and a training framework CoN-CLIP that significantly improve VLMs' comprehension of negations and compositional semantics.

Findings

01

3.85% average gain in zero-shot accuracy across 8 datasets

02

Outperforms CLIP on compositionality benchmarks by 4.4%

03

Enhances semantic encoding with reduced computational cost

Abstract

Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaisidhsingh/con-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsContrastive Language-Image Pre-training