Attributions All the Way Down? The Metagame of Interpretability
Hubert Baniecki, Przemyslaw Biecek, Fabian Fumagalli

TL;DR
This paper introduces the metagame framework to quantify second-order interaction effects of model explanations, providing hierarchical decomposition and empirical insights across various interpretability applications.
Contribution
It presents a novel metagame approach that measures directional influence among features in attribution methods, extending existing interaction indices with theoretical and empirical validation.
Findings
Hierarchical decomposition of attributions into meta-attributions.
Meta-attributions serve as directional extensions of interaction indices.
Empirical applications include token interactions, cross-modal similarity, and multimodal concept interpretation.
Abstract
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution explaining a model , we measure the directional influence of feature on the attribution of feature , denoted as meta-attribution , by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
