Language-Guided Invariance Probing of Vision-Language Models
Jae Joong Lee

TL;DR
This paper introduces LGIP, a benchmark for evaluating vision-language models' robustness to linguistic variations, revealing strengths and weaknesses in their semantic invariance and sensitivity through controlled perturbations.
Contribution
The paper presents LGIP, a novel diagnostic benchmark that measures VLMs' invariance to paraphrases and sensitivity to semantic flips, providing insights beyond standard retrieval metrics.
Findings
EVA02-CLIP and large OpenCLIP variants show strong invariance and sensitivity balance.
SigLIP models exhibit high invariance errors and prefer flipped captions.
LGIP reveals model robustness issues invisible to traditional metrics.
Abstract
Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
