Playing the network backward: A Game Theoretic Attribution Framework
Jakob Paul Zimmermann, Jim Berend, Georg Loho, Sebastian Lapuschkin, Wojciech Samek

TL;DR
This paper introduces a game-theoretic framework for backward attribution methods, unifying various techniques and enabling new properties and improvements, demonstrated on Vision Transformer models.
Contribution
It recasts backward attribution as a two-player game, unifies existing methods, and proposes novel adaptations that improve localization in transformer models.
Findings
Alpha-beta-LRP adaptation outperforms prior methods on ViT-B/16.
Framework allows specifying explanation properties as game-theoretic concepts.
Backward attribution maps are viewed as trajectory projections in the game.
Abstract
Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
