Dissecting Fine-Tuning Unlearning in Large Language Models

Yihuai Hong; Yuelin Zou; Lijie Hu; Ziqian Zeng; Di Wang; Haiqin Yang

arXiv:2410.06606·cs.CL·October 16, 2024

Dissecting Fine-Tuning Unlearning in Large Language Models

Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, Haiqin Yang

PDF

Open Access 1 Repo

TL;DR

This paper critically examines fine-tuning-based unlearning in large language models, revealing that such methods do not truly erase knowledge but instead alter retrieval processes, impacting overall model behavior.

Contribution

It uncovers the limitations of current unlearning techniques, highlighting the role of MLP coefficients and showing their effects on model behavior and knowledge retention.

Findings

01

Unlearning methods do not genuinely erase embedded knowledge.

02

MLP coefficients are key to controlling model behavior.

03

Unlearning impacts unrelated knowledge and capabilities.

Abstract

Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, providing further evidence that they do not genuinely erase the problematic knowledge embedded in the model parameters. Instead, the coefficients generated by the MLP components in the model's final layer are the primary contributors to these seemingly positive unlearning effects, playing a crucial role in controlling the model's behaviors. Furthermore, behavioral tests demonstrate that this unlearning mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yihuaihong/dissecting-ft-unlearning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsActivation Patching