Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

Jo\"el Roman Ky; Salah Ghamizi; Maxime Cordy

arXiv:2605.22168·cs.AI·May 22, 2026

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

Jo\"el Roman Ky, Salah Ghamizi, Maxime Cordy

PDF

TL;DR

This paper introduces a new benchmark and metric for evaluating the true cross-modal reasoning of vision-language models, addressing limitations of existing unimodal evaluation methods.

Contribution

It proposes Synergistic Faithfulness, a scalable metric based on the Shapley Interaction Index, to accurately measure joint modality interactions in VLMs.

Findings

01

Existing explainers over-rely on visual salience.

02

Current metrics are biased by language priors and modality biases.

03

The new metric achieves high correlation with ground truth and is computationally efficient.

Abstract

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ = - 0.06$ ). To resolve this, we introduce Synergistic Faithfulness ( $F_{sy n}$ ), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ( $ρ = 0.92$ ) while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.