Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke H\"ullermeier, Przemyslaw Biecek

TL;DR
This paper introduces FIxLIP, a game-theoretic approach using weighted Banzhaf interactions to provide more comprehensive explanations of vision-language models' similarity outputs, capturing complex cross-modal interactions.
Contribution
The paper presents FIxLIP, a novel method based on game theory that extends explanation techniques to second-order interactions for vision-language encoders, improving interpretability and computational efficiency.
Findings
Second-order explanations outperform first-order methods in benchmarks.
FIxLIP provides high-quality, faithful model explanations.
Utility demonstrated in comparing different vision-language models.
Abstract
Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
