An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
Yichen Gao, Altay Unal, Akshay Rangamani, Zhihui Zhu

TL;DR
This paper investigates the internal representations of machine unlearning methods, revealing that many success metrics are misleading and proposing a feature-classifier alignment approach for more faithful unlearning.
Contribution
It uncovers the phenomenon of feature-classifier misalignment in MU methods and introduces CMF-based techniques to improve unlearning fidelity at the representation level.
Findings
Hidden features remain highly discriminative after unlearning.
Simple classifier adjustments can achieve negligible forget accuracy.
CMF-based methods reduce forgotten information while maintaining accuracy.
Abstract
While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
