Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via   Mechanistic Localization

Phillip Guo; Aaquib Syed; Abhay Sheshadri; Aidan Ewart; Gintare; Karolina Dziugaite

arXiv:2410.12949·cs.LG·December 6, 2024

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare, Karolina Dziugaite

PDF

Open Access

TL;DR

This paper demonstrates that localizing model components related to specific mechanisms enhances the robustness and effectiveness of knowledge unlearning and editing in large language models, reducing side effects and resistance to relearning.

Contribution

It introduces a mechanistic interpretability approach to improve the precision and robustness of knowledge unlearning and editing in language models.

Findings

01

Localized edits to lookup-table mechanisms improve robustness across formats.

02

Mechanistic localization reduces unintended side effects in unlearning.

03

Certain localized edits make models more resistant to relearning unwanted information.

Abstract

Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability -- which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability -- can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the lookup-table mechanism for factual recall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning