Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization
Michael Matena, Colin Raffel

TL;DR
NPEFF is a novel interpretability method that decomposes Fisher information matrices to reveal and manipulate model processing strategies across language models, aiding understanding and mitigation of model behaviors.
Contribution
This paper introduces NPEFF, a new approach for decomposing Fisher matrices to interpret and influence model strategies, outperforming existing baselines.
Findings
NPEFF components align with model processing strategies.
Parameter perturbations can selectively disrupt specific strategies.
NPEFF outperforms gradient clustering and autoencoder baselines.
Abstract
We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Neural Networks and Applications
