CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Gautier Dagan, Olga Loginova, Anil Batra

TL;DR
CAST is a novel evaluation method that tests vision-language models for self-consistency across modalities, revealing their internal alignment without relying on ground-truth accuracy.
Contribution
We introduce CAST, a new self-consistency test for VLMs that probes their cross-modal alignment without requiring ground-truth labels.
Findings
VLMs vary in self-consistency across tasks
Self-consistency correlates with model capability
CAST reveals modality misalignments
Abstract
Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
MethodsFocus
