Catastrophic Failure of LLM Unlearning via Quantization

Zhiwei Zhang; Fali Wang; Xiaomin Li; Zongyu Wu; Xianfeng Tang; Hui; Liu; Qi He; Wenpeng Yin; Suhang Wang

arXiv:2410.16454·cs.CL·March 24, 2025·2 cites

Catastrophic Failure of LLM Unlearning via Quantization

Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui, Liu, Qi He, Wenpeng Yin, Suhang Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that quantization can reverse the effects of LLM unlearning, revealing that models may still retain significant forgotten knowledge despite unlearning efforts.

Contribution

It uncovers a catastrophic failure mode where quantization restores unlearned information, challenging the reliability of current unlearning benchmarks for LLMs.

Findings

01

Quantization significantly increases retained forgotten knowledge from 21% to 83%.

02

Unlearning methods with utility constraints are vulnerable to quantization-based restoration.

03

Current benchmarks may not accurately measure true forgetting in LLM unlearning.

Abstract

Large language models (LLMs) have shown remarkable proficiency in generating text, benefiting from extensive training on vast textual corpora. However, LLMs may also acquire unwanted behaviors from the diverse and sensitive nature of their training data, which can include copyrighted and private content. Machine unlearning has been introduced as a viable solution to remove the influence of such problematic content without the need for costly and time-consuming retraining. This process aims to erase specific knowledge from LLMs while preserving as much model utility as possible. Despite the effectiveness of current unlearning methods, little attention has been given to whether existing unlearning methods for LLMs truly achieve forgetting or merely hide the knowledge, which current unlearning benchmarks fail to detect. This paper reveals that applying quantization to models that have…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is original to my knowledge. 2. The writing is clear and flows well. 3. The problem is well motivated --- I do think unlearning is largely studied in full precision and the effects of quantization are important to understand given that its a common practice.

Weaknesses

1. **Presentation** I found the results incredibly difficult to parse. The large tables are hard to exact trends from and I think at times the results that matter are actually split across more than one of these tables or duplicated. For example, Table 3 has some of the data in Table 1 but it's missing other rows one might care about. I was flipping back and forth a bunch and it is very hard to internalize the trends. An informative visualization of this data is missing. 2. **Unclear Experimenta

Reviewer 02Rating 8Confidence 4

Strengths

- The paper identifies a critical flaw in current unlearning methodologies and demonstrates a major consequence - lack of robustness to quantization. This calls for closer examination of learning rates used in the unlearning literature. - The experimental validation is thorough, covering two datasets, multiple metrics for measuring unlearning performance, and different unlearning approaches (gradient ascent and negative preference optimization). This comprehensive evaluation builds confidence t

Weaknesses

- It is unclear whether robustness to quantization is a more useful way of measuring unlearning robustness compared to few-shot finetuning, which has been studied more extensively recently. The explanation for the mechanism behind vulnerability to quantization - minimal weight changes during unlearning - suggests that both quantization and few-shot finetuning could exploit similar weaknesses. - Recent unlearning methods [1] [2] [3] potentially encourage deeper forgetting in model representation

Reviewer 03Rating 6Confidence 4

Strengths

- [S1] **Interesting topic and observation.** The field of LLM unlearning is of growing interest and could be of great potential use to the ML community in practice. The observation that a simple 4-bit quantization of an unlearned model can restore "forgotten" knowledge is very interesting, and would benefit many researchers working on machine unlearning. - [S2] **Strong empirical results.** Experiments show that the proposed method SURE consistently retains its unlearning efficacy after 4-bit q

Weaknesses

- [W1] **Narrow scope of the paper.** There exist a number of ways to attack an unlearned model into generating "forgotten" data such as jailbreaking or in-context relearning [A], but the paper and its newly proposed method is solely motivated under one particular attack mechanism (i.e., quantization), which makes the overall scope of the paper rather narrow. To make things worse, the main issue of restoring "forgotten" data does not appear when using 8-bit quantization, but only appears with 4-

Code & Models

Repositories

zzwjames/failurellmunlearning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComparative and International Law Studies · Artificial Intelligence in Law · Legal Education and Practice Innovations

MethodsSoftmax · Attention Is All You Need