LLM Unlearning with LLM Beliefs
Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, Jiantao Zhou

TL;DR
This paper introduces a novel unlearning method for large language models that counters the squeezing effect caused by traditional gradient-based approaches, using model beliefs to achieve more effective forgetting of sensitive content.
Contribution
The paper proposes a bootstrapping framework that leverages model beliefs to improve unlearning effectiveness, addressing limitations of existing gradient-based methods.
Findings
The proposed method outperforms existing unlearning techniques across multiple benchmarks.
Incorporating model beliefs reduces the squeezing effect and enhances forgetting accuracy.
The approach maintains model utility while achieving more thorough unlearning.
Abstract
Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is very readable, with a logical flow from motivation → analysis → method → theory → experiments. Figures and appendices are well-organized, and pseudocode makes the algorithms easy to reproduce. 2. The authors make a thoughtful observation about the squeezing effect and systematically demonstrate its existence through both qualitative and quantitative analysis. The proposed bootstrapping strategy is a creative extension of this insight, and the experiments convincingly show that BS
1. While the paper focuses on redistributing likelihood as the core cause of spurious unlearning, the explanation still feels surface-level from a semantic standpoint. The essence of the problem may not lie solely in likelihood shifts, but rather in the fact that current unlearning methods attempt to correct predictions without accounting for semantic relatedness. Unlearning should arguably target semantic classes of knowledge, rather than isolated outputs or sequences. A more principled formula
1. The identification and mechanistic analysis of the "squeezing effect" is a novel and significant contribution. It provides a clear diagnosis for a subtle but critical flaw in widely-used unlearning methods. This finding is highly significant for the field, as it suggests many existing methods may offer a false sense of security regarding privacy and safety. 2. The core claim of the "squeezing effect" is not just asserted but convincingly demonstrated through empirical analysis of probability
1. One small weakness is the practical cost of BS-S. Algorithm 2 implies sampling $N$ high-confidence sequences for every sample in a batch during training. This requires $N$ inference passes for each training step, which seems computationally prohibitive and scales poorly. Figure 6 shows BS-S is ~2x slower than NPO, and it might be even worse as $N$ grows. The paper also notes OOM issues when set $N=5$. It would be better if adding an ablation on the frequency of belief sampling (e.g., once per
- The paper is easy to follow. - The content of the paper is substantial, with both a summary of existing work and sufficient theoretical evidence.
- The paper acknowledges in Appendix G that this method is very sensitive to the settings of hyperparameters such as the bootstrapping coefficient, and often requires extensive tuning for specific datasets and models. This seriously affects the method's application in practical scenarios. - Lack of comparison of computational overhead between various baseline methods. - The Bootstrapping framework relies on the high-confidence results generated by the model itself to guide forgetting. However, t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
