Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs
Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan G\"unnemann

TL;DR
This paper introduces Partial Model Collapse (PMC), a novel unlearning method for large language models that leverages deliberate model collapse to remove private data without explicitly optimizing on unlearning targets, improving privacy and utility.
Contribution
The paper proposes PMC, a new unlearning approach that induces model collapse to effectively erase specific data, avoiding reinforcement of sensitive information and aligning better with privacy principles.
Findings
PMC effectively removes private information from model outputs.
PMC outperforms existing unlearning methods in key limitations.
Theoretical analysis confirms convergence to desired unlearning outcomes.
Abstract
Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We…
Peer Reviews
Decision·ICLR 2026 Poster
Overall, the study offers a novel unlearning approach based on model collapse, which is often viewed as a defect. The proposed PMC method is original in both formulation and intuition, achieving unlearning without relying on the sensitive information that needs to be removed. The preference-guided self-training mechanism is also an interesting idea. The technical quality of this paper is strong, with a solid theoretical foundation and convincing empirical validation on the benchmark. The compar
The experiments rely solely on the TOFU dataset, which is somewhat limiting. It would be beneficial to validate the performance on additional benchmarks such as MUSE, WMDP, or others to strengthen the empirical evidence. PMC is compared against GA, GD, NPO, and SimNPO, but not against several recent unlearning methods, such as SCRUB, DPO, or Negative Preference Fine-Tuning. Including these comparisons would provide a more comprehensive evaluation of the proposed approach. It would also be valu
- The paper offers a interesting perspective: leveraging model collapse via iterative relearning on self-generated data as a mechanism for unlearning, with a clear derivation from categorical settings and a principled LLM instantiation. - The theoretical section establishes convergence of the reward and vanishing variance for the iterative update under stated assumptions, providing a clear link between the objective and unlearning behavior. - The algorithmic procedure is explicit and the narra
- Evaluation scope is narrow. Experiments focus on a single unlearning benchmark (TOFU), two LLMs (Phi-1.5, Llama-3.2-3B-Instruct), limiting generality. Additional datasets and tasks would bolster claims. - Computational cost. The method depends on sampling from the model distribution and the paper acknowledges overhead for LLMs. A clearer cost–benefit analysis or experimental comparisons versus baselines would enhance soundness. - Assumptions behind the theory are strong. Theoretical argument
- The paper reinterprets model collapse, typically seen as undesirable, into a constructive mechanism for unlearning—an elegant and theoretically grounded insight. - Clear mathematical derivation of convergence properties (Lemma 1, Theorem 2) and ablation studies validating the hyperparameter behavior. - The paper is clearly structured, visually intuitive, and well-written.
- While conceptually elegant, PMC requires multiple sampling and fine-tuning iterations, potentially increasing computational cost. Could the authors quantify runtime and explore lightweight approximations? - The method’s performance hinges on the choice of reward function r(x). How robust is PMC to alternative reward definitions, and can it generalize beyond ROUGE-based divergence metrics? - The theoretical analysis assumes idealized convergence. How does PMC behave when unlearning large sets
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
