Intrinsic Test of Unlearning Using Parametric Knowledge Traces

Yihuai Hong; Lei Yu; Haiqin Yang; Shauli Ravfogel; Mor Geva

arXiv:2406.11614·cs.CL·September 3, 2025

Intrinsic Test of Unlearning Using Parametric Knowledge Traces

Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a new internal evaluation method for unlearning in large language models, focusing on changes in parametric knowledge traces, and presents a benchmark dataset to assess unlearning effectiveness.

Contribution

It proposes a novel parameter-based evaluation approach for unlearning, including the ConceptVectors benchmark dataset, revealing limitations of existing behavioral assessments.

Findings

01

Existing unlearning methods minimally affect concept vectors

02

Ablating concept vectors removes associated knowledge effectively

03

Behavioral tests may not reflect true unlearning success

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yihuaihong/conceptvectors
pytorchOfficial

Datasets

YihuaiHong/ConceptVectors
dataset· 47 dl
47 dl

Videos

Intrinsic Test of Unlearning Using Parametric Knowledge Traces· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Advanced Data Processing Techniques

MethodsSoftmax · Attention Is All You Need