VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos, Milios, Sageev Oore, Hassan Sajjad

TL;DR
The VISLA benchmark evaluates how well vision-language and unimodal language models understand semantic and lexical nuances, revealing their sensitivities and limitations without requiring fine-tuning.
Contribution
This paper introduces the VISLA benchmark, unifying image-to-text and text-to-text retrieval tasks for off-the-shelf evaluation of semantic and lexical understanding in models.
Findings
VLMs show greater sensitivity to semantic and lexical variations than ULMs.
Models struggle to distinguish between lexical and semantic differences.
Spatial semantics are highly sensitive to lexical information.
Abstract
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
