CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Gautier Dagan; Olga Loginova; Anil Batra

arXiv:2409.11007·cs.CL·September 18, 2024

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Gautier Dagan, Olga Loginova, Anil Batra

PDF

Open Access 1 Repo

TL;DR

CAST is a novel evaluation method that tests vision-language models for self-consistency across modalities, revealing their internal alignment without relying on ground-truth accuracy.

Contribution

We introduce CAST, a new self-consistency test for VLMs that probes their cross-modal alignment without requiring ground-truth labels.

Findings

01

VLMs vary in self-consistency across tasks

02

Self-consistency correlates with model capability

03

CAST reveals modality misalignments

Abstract

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gautierdag/cast
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsFocus