Disparities In Negation Understanding Across Languages In Vision-Language Models
Charikleia Moraitaki, Sarah Pan, Skyler Pulling, Gwendolyn Flusche, Kumail Alhamoud, Marzyeh Ghassemi

TL;DR
This paper introduces a multilingual negation benchmark to evaluate vision-language models across diverse languages, revealing disparities in negation understanding linked to linguistic features and model architecture.
Contribution
It presents the first human-verified multilingual negation benchmark covering seven languages, evaluating multiple VLMs and analyzing linguistic factors affecting negation comprehension.
Findings
Standard CLIP performs at or below chance on non-Latin scripts.
MultiCLIP achieves the most accurate and consistent negation understanding.
Negation correction methods improve performance variably across languages.
Abstract
Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
