Leverage Unlearning to Sanitize LLMs
Antoine Boutet, Lucas Magnana

TL;DR
This paper introduces SANI, a novel unlearning method that sanitizes large language models by disrupting memorization of sensitive data through neuron resetting and fine-tuning, reducing privacy risks efficiently.
Contribution
SANI provides an effective unlearning approach to sanitize LLMs without extensive retraining, focusing on disrupting memorization of sensitive information in a computationally efficient manner.
Findings
Significantly reduces regurgitation of sensitive data
Effective on models trained with medical and confidential data
Requires only a few additional training epochs
Abstract
Pre-trained large language models (LLMs) are becoming useful for various tasks. To improve their performance on certain tasks, it is necessary to fine-tune them on specific data corpora (e.g., medical reports, business data). These specialized data corpora may contain sensitive data (e.g., personal or confidential data) that will be memorized by the model and likely to be regurgitated during its subsequent use. This memorization of sensitive information by the model poses a significant privacy or confidentiality issue. To remove this memorization and sanitize the model without requiring costly additional fine-tuning on a secured data corpus, we propose SANI. SANI is an unlearning approach to sanitize language models. It relies on both an erasure and repair phases that 1) reset certain neurons in the last layers of the model to disrupt the memorization of fine-grained information, and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Effective and Fast Sanitization: SANI achieves rapid and substantial reduction in sensitive information regurgitation after just 1–2 unlearning epochs. 2. Strong Utility-Privacy Trade-off: It maintains downstream task performance (e.g., word prediction, classification) while improving privacy, demonstrating minimal utility loss even after sanitization. 3. Versatility Across Use Cases: SANI works for both fine-tuned (e.g., medical data) and pre-trained models (e.g., public corpora with confi
1. Limited methodological novelty. The paper presents SANI as a new unlearning method, but it lacks theoretical depth or clear innovation. The main idea of resetting part of the model and then fine-tuning it while avoiding certain tokens is similar to existing erasure and repair approaches. There are no formal definitions, equations, or theoretical analysis to help evaluate how or why the method works. 2. Insufficient baseline comparisons. The evaluation does not include enough recent or strong
The paper tackles a relevant and timely problem: avoiding privacy leakage in pretrained or fine-tuned LLMs prior to model sharing. The approach is conceptually simple and computationally inexpensive, which can be practical for organizations with limited resources. The evaluation includes two architectures (BERT and GPT-2) and two application scenarios, which broadens relevance beyond a single case. The results are easy to interpret since the proposed metrics directly measure regurgitation and ut
The technical novelty is limited. The method is essentially a straightforward application of selective last-layer reinitialization combined with masked LM training that excludes sensitive tokens. Both ideas are known, and the paper relies heavily on existing work (e.g., erase-and-repair strategies). The choice of randomly resetting 50 percent of final-layer neurons is not justified, nor is there analysis of which neurons actually encode sensitive content. The evaluation lacks strong baselines su
1. The paper proposes a simple yet effective erase-and-repair method that does not require full retraining or architectural changes, making it practical for real-world deployment of LLMs that contain sensitive or private data. 2. Extensive experiments on both fine-tuned medical models and pre-trained language models show that SANI significantly reduces sensitive information regurgitation while preserving downstream performance, outperforming pruning and repair-only baselines.
1. The method resets only the final layer to erase memorized content, but does not provide strong theoretical support for why this layer alone is sufficient to remove deeper internal representations. The approach may fail if sensitive information is stored in earlier layers or attention patterns. 2. SANI primarily focuses on direct regurgitation of exact n-grams. It is unclear whether the method can remove more abstract or paraphrased forms of sensitive knowledge, such as latent identity informa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Privacy-Preserving Technologies in Data
