Auditing Language Model Unlearning via Information Decomposition
Anmol Goel, Alan Ritter, Iryna Gurevych

TL;DR
This paper introduces an information-theoretic framework using Partial Information Decomposition to audit language model unlearning, revealing residual information about forgotten data and proposing a risk score for privacy protection.
Contribution
It presents a novel, interpretable method to evaluate unlearning effectiveness at the representation level, exposing residual knowledge and guiding privacy-preserving inference.
Findings
Residual information persists after unlearning.
Redundant shared information correlates with vulnerability.
Proposed risk score helps mitigate privacy risks.
Abstract
We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
