NegVQA: Can Vision Language Models Understand Negation?
Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy

TL;DR
NegVQA is a new benchmark designed to evaluate vision language models' understanding of negation, revealing significant performance gaps and a U-shaped scaling trend as models grow larger.
Contribution
We introduce NegVQA, a comprehensive negation-focused VQA benchmark, and evaluate leading models, uncovering their struggles and the non-linear effects of model size on negation comprehension.
Findings
Models perform poorly on negation questions compared to original ones.
Performance drops initially with increasing model size, then improves at larger scales.
NegVQA exposes critical gaps in current VLMs' negation understanding.
Abstract
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
