SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos, Milios, Sageev Oore, Hassan Sajjad

TL;DR
The paper introduces the SUGARCREPE++ dataset to evaluate vision-language and unimodal language models' sensitivity to semantic and lexical changes, revealing current models' limitations in understanding precise semantics.
Contribution
It presents a new dataset and comprehensive evaluation framework to benchmark models' ability to handle semantic and lexical variations, highlighting gaps in current model understanding.
Findings
VLMs struggle with lexical and semantic variations, especially in object attributes and spatial relations.
Larger models and more pre-training improve performance but do not fully solve the problem.
Performance on compositionality datasets does not guarantee success on SUGARCREPE++, indicating different challenges.
Abstract
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Categorization, perception, and language
