Intrinsic Test of Unlearning Using Parametric Knowledge Traces
Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva

TL;DR
This paper introduces a new internal evaluation method for unlearning in large language models, focusing on changes in parametric knowledge traces, and presents a benchmark dataset to assess unlearning effectiveness.
Contribution
It proposes a novel parameter-based evaluation approach for unlearning, including the ConceptVectors benchmark dataset, revealing limitations of existing behavioral assessments.
Findings
Existing unlearning methods minimally affect concept vectors
Ablating concept vectors removes associated knowledge effectively
Behavioral tests may not reflect true unlearning success
Abstract
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Advanced Data Processing Techniques
MethodsSoftmax · Attention Is All You Need
