Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh; Clara Suslik; Yihuai Hong; Fazl Barez; Mor Geva

arXiv:2505.22586·cs.CL·October 30, 2025

Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva

PDF

Open Access 1 Video

TL;DR

This paper introduces PISCES, a novel method for precisely erasing entire concepts from large language models by editing parameter directions, improving specificity and robustness over existing techniques.

Contribution

PISCES is the first framework to directly edit concept-encoding directions in model parameters for targeted knowledge removal.

Findings

01

Reduces target concept accuracy to as low as 7.7%

02

Improves erasure specificity by up to 31%

03

Enhances robustness by up to 38%

Abstract

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Precise In-Parameter Concept Erasure in Large Language Models· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications