Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
Jo\"el Roman Ky, Salah Ghamizi, Maxime Cordy

TL;DR
This paper introduces a new benchmark and metric for evaluating the true cross-modal reasoning of vision-language models, addressing limitations of existing unimodal evaluation methods.
Contribution
It proposes Synergistic Faithfulness, a scalable metric based on the Shapley Interaction Index, to accurately measure joint modality interactions in VLMs.
Findings
Existing explainers over-rely on visual salience.
Current metrics are biased by language priors and modality biases.
The new metric achieves high correlation with ground truth and is computationally efficient.
Abstract
Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's ). To resolve this, we introduce Synergistic Faithfulness (), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate () while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
