Discovering Knowledge-Critical Subnetworks in Pretrained Language Models
Deniz Bayazit, Negar Foroutan, Zeming Chen, Gail Weiss, Antoine, Bosselut

TL;DR
This paper introduces a method to identify and remove knowledge-critical subnetworks in pretrained language models, enabling precise knowledge suppression while preserving overall model performance.
Contribution
It proposes a multi-objective differentiable masking scheme to discover sparse subnetworks responsible for specific knowledge in language models.
Findings
Highly sparse subnetworks (98%+ sparsity) are critical for specific knowledge.
Removing these subnetworks suppresses targeted knowledge with minimal impact on other abilities.
The method works effectively on multiple GPT2 variants.
Abstract
Pretrained language models (LMs) encode implicit representations of knowledge in their parameters. However, localizing these representations and disentangling them from each other remains an open problem. In this work, we investigate whether pretrained language models contain various knowledge-critical subnetworks: particular sparse computational subgraphs that can, if removed, precisely suppress specific knowledge the model has memorized. We propose a multi-objective differentiable masking scheme that can be applied to both weights and neurons to discover such subnetworks and show that we can use them to precisely remove specific knowledge from models while minimizing adverse effects on the behavior of the original model. We demonstrate our method on multiple GPT2 variants, uncovering highly sparse subnetworks (98%+ sparsity) that are critical for expressing specific collections of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
