Large Language Model Unlearning via Embedding-Corrupted Prompts
Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, Yang Liu

TL;DR
This paper introduces ECO Prompts, a lightweight framework for efficiently unlearning specific knowledge in large language models by corrupting prompt embeddings, achieving near-zero side effects and scalability across models.
Contribution
The paper proposes a novel embedding-corruption method using prompt classifiers and zeroth order optimization for effective unlearning in large language models.
Findings
Effective unlearning with minimal side effects
Scalable to models from 0.5B to 236B parameters
No additional cost as model size increases
Abstract
Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
