Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM
Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

TL;DR
This paper demonstrates that exact unlearning in large language models can still leak sensitive data through a novel extraction attack, especially when both pre- and post-unlearning models are accessible, raising privacy concerns.
Contribution
The authors introduce a new data extraction attack exploiting signals from pre-unlearning models, significantly improving privacy breach success rates in practical deployment scenarios.
Findings
Extraction success doubles in some benchmarks
Attack effective on medical diagnosis dataset
Unlearning may increase privacy risks
Abstract
Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
