Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna, Bharati

TL;DR
This paper introduces CC-Neg and CoN-CLIP to enhance vision-language models' understanding of negations, leading to better semantic encoding and improved zero-shot classification accuracy.
Contribution
The paper presents a new dataset CC-Neg and a training framework CoN-CLIP that significantly improve VLMs' comprehension of negations and compositional semantics.
Findings
3.85% average gain in zero-shot accuracy across 8 datasets
Outperforms CLIP on compositionality benchmarks by 4.4%
Enhances semantic encoding with reduced computational cost
Abstract
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsContrastive Language-Image Pre-training
