Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi

TL;DR
This paper evaluates the ability of current vision-language models to understand negation, introduces NegBench benchmark, and demonstrates that fine-tuning on synthetic negation data improves their performance significantly.
Contribution
The study introduces NegBench, a comprehensive benchmark for negation understanding, and shows that fine-tuning models on synthetic negation data enhances their negation comprehension.
Findings
Modern VLMs perform at chance level on negation tasks.
Fine-tuning on synthetic negation datasets improves recall by 10%.
Accuracy on negated multiple-choice questions increases by 28%.
Abstract
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
