Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

Michael Matena; Colin Raffel

arXiv:2310.04649·cs.LG·May 12, 2025

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

Michael Matena, Colin Raffel

PDF

Open Access

TL;DR

NPEFF is a novel interpretability method that decomposes Fisher information matrices to reveal and manipulate model processing strategies across language models, aiding understanding and mitigation of model behaviors.

Contribution

This paper introduces NPEFF, a new approach for decomposing Fisher matrices to interpret and influence model strategies, outperforming existing baselines.

Findings

01

NPEFF components align with model processing strategies.

02

Parameter perturbations can selectively disrupt specific strategies.

03

NPEFF outperforms gradient clustering and autoencoder baselines.

Abstract

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Neural Networks and Applications