Learning-Time Encoding Shapes Unlearning in LLMs
Ruihan Wu, Konstantin Garov, Kamalika Chaudhuri

TL;DR
This paper investigates how the way knowledge is encoded during training affects the ability to unlearn specific facts in large language models, highlighting the importance of encoding choices for effective post-hoc unlearning.
Contribution
It empirically demonstrates that learning with paraphrased descriptions enhances unlearning, and that unlearning from text chunks is inherently challenging, emphasizing the role of encoding strategies.
Findings
Paraphrased descriptions improve unlearning effectiveness.
Unlearning from text chunks is difficult.
Learning-time encoding influences unlearning success.
Abstract
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.
Peer Reviews
Decision·ICLR 2026 Poster
(S1) This work is well motivated, and the core question is important and novel, while not trivial. (S2) I commend the authors for the clear organization of research questions, appropriate experimental designs, and well-organized writing.
(W1) **Limited mechanistic understanding**: While the intuition that presenting factual knowledge in multiple paraphrased format leads to more structured representation is compelling and aligns with experimental results, the analysis relies on the observation of knowledge unlearning success, and the understanding on the mechanism governing this behavior is limited. For example: - Is there any difference in the distribution of the update vector or knowledge circuits [1], that is computed for unle
- The paper explores a novel aspect on data shapes for unlearning in LLMs. Most unlearning research focuses on post-hoc algorithms. This work provides a new and important perspective by showing that data curation strategies during fine-tuning are a critical and overlooked factor. - The findings are clear and consistent across two different model families (Llama2, Gemma2) , two different datasets (Eval-DU+ and TOFU+), and two representative unlearning algorithms (Gradient Ascent and Task Vectors)
- Paraphrasing vs. Frequency: The effect of using multiple paraphrases (e.g., in FT-Unlearn-Mul) is closely related to simply increasing the frequency of the fact in the training data. The paper argues this encourages "structured" representations and distinguishes itself from related work on frequency, but it doesn't empirically disentangle the effect of the paraphrasing from the frequency. - No real-world dataset: The use of synthetic data is a key strength for experimental control, but also a
* This paper approaches the unlearning problem from a perspective, analyzing how the learning-time encoding (the way target knowledge is represented and learned during training) affects the effectiveness of unlearning. The framing that "how and what is learned determines how well it can be forgotten" clearly distinguishes this work from prior studies focused solely on algorithmic improvements. This is a valid and valuable perspective for deepening our understanding of knowledge unlearning in LLM
* In each experimental setting (e.g., FT-Single vs FT-Unlearn-Mul), the initial learning strength of the forget and retain knowledge differs. Models trained with paraphrased data tend to encode the same facts more strongly, resulting in higher initial scores and making unlearning appear more difficult. The paper attempts to account for this difference using the Norm-AUC metric, but this measure has a structural limitation: models with higher initial scores may be disadvantaged in relative evalua
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
