Erasing Conceptual Knowledge from Language Models
Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

TL;DR
This paper presents ELM, a novel method for concept-level unlearning in language models that reduces undesired concept generation while preserving overall performance and robustness.
Contribution
Introduces ELM, a principled approach leveraging the model's own introspective capabilities for targeted concept unlearning via low-rank updates.
Findings
ELM effectively erases targeted concepts from language models.
Models with ELM show near-random performance on erased concepts.
ELM preserves model performance on unrelated tasks and enhances robustness.
Abstract
In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model's own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model's broader capabilities. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
