Investigating Model Editing for Unlearning in Large Language Models
Shariqah Hossain, Lalana Kagal

TL;DR
This paper explores the use of model editing algorithms like ROME, IKE, and WISE for unlearning in large language models, demonstrating they can sometimes outperform traditional unlearning methods in forgetting specific information.
Contribution
It introduces new editing targets for unlearning and compares model editing algorithms to baseline unlearning methods, highlighting their strengths and limitations.
Findings
Model editing can surpass baseline unlearning in forgetting quality.
Both approaches struggle to fully unlearn without affecting overall performance.
Model editing approaches are promising but have scope limitations.
Abstract
Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
