SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and   Lexical Alterations

Sri Harsha Dumpala; Aman Jaiswal; Chandramouli Sastry; Evangelos; Milios; Sageev Oore; Hassan Sajjad

arXiv:2406.11171·cs.CV·June 21, 2024

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos, Milios, Sageev Oore, Hassan Sajjad

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

The paper introduces the SUGARCREPE++ dataset to evaluate vision-language and unimodal language models' sensitivity to semantic and lexical changes, revealing current models' limitations in understanding precise semantics.

Contribution

It presents a new dataset and comprehensive evaluation framework to benchmark models' ability to handle semantic and lexical variations, highlighting gaps in current model understanding.

Findings

01

VLMs struggle with lexical and semantic variations, especially in object attributes and spatial relations.

02

Larger models and more pre-training improve performance but do not fully solve the problem.

03

Performance on compositionality datasets does not guarantee success on SUGARCREPE++, indicating different challenges.

Abstract

Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sri-Harsha/scpp
pytorchOfficial

Datasets

Aman-J/SugarCrepe_pp
dataset· 546 dl
546 dl

Videos

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Categorization, perception, and language