Distillation Robustifies Unlearning

Bruce W. Lee; Addie Foote; Alex Infanger; Leni Shor; Harish Kamath; Jacob Goldman-Wetzler; Bryce Woodworth; Alex Cloud; Alexander Matt Turner

arXiv:2506.06278·cs.LG·October 27, 2025

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that distillation techniques can significantly improve the robustness of unlearning in large language models, enabling effective removal of unwanted information with less compute and data.

Contribution

The authors introduce UNDO, a scalable distillation method that enhances unlearning robustness, achieving near-retraining performance with reduced computational and data costs.

Findings

01

UNDO matches retraining robustness with less compute

02

Distillation transfers behaviors while preserving capabilities

03

UNDO is effective on synthetic and real benchmarks

Abstract

Current LLM unlearning methods are not robust. A few steps of finetuning can revert their effects. We begin by showing that this is true even for an idealized form of unlearning: training to imitate a model that was never trained on unwanted information. This shows that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact. In light of this dynamic, we show our main result. Training a randomly initialized student on the outputs of an unlearned model transfers behaviors while leaving latent capabilities behind. In short, distillation robustifies unlearning. Based on this result, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Distillation Robustifies Unlearning· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning in Materials Science · Topic Modeling