Has this Fact been Edited? Detecting Knowledge Edits in Language Models

Paul Youssef; Zhixue Zhao; Christin Seifert; J\"org Schl\"otterer

arXiv:2405.02765·cs.CL·February 11, 2025

Has this Fact been Edited? Detecting Knowledge Edits in Language Models

Paul Youssef, Zhixue Zhao, Christin Seifert, J\"org Schl\"otterer

PDF

Open Access 1 Video

TL;DR

This paper introduces a new task to detect whether knowledge in language models has been edited or remains original, using features like hidden states and probability distributions, to enhance transparency and trust.

Contribution

It proposes the novel task of detecting knowledge edits in language models and demonstrates effective detection methods using simple classifiers with limited data.

Findings

01

Simple AdaBoost classifiers perform well in detection

02

Detection remains challenging for related but unedited knowledge

03

Method is robust across different domains

Abstract

Knowledge editing methods (KEs) can update language models' obsolete or inaccurate knowledge learned from pre-training. However, KEs can be used for malicious applications, e.g., inserting misinformation and toxic content. Knowing whether a generated output is based on edited knowledge or first-hand knowledge from pre-training can increase users' trust in generative models and provide more transparency. Driven by this, we propose a novel task: detecting edited knowledge in language models. Given an edited model and a fact retrieved by a prompt from an edited model, the objective is to classify the knowledge as either unedited (based on the pre-training), or edited (based on subsequent editing). We instantiate the task with four KEs, two LLMs, and two datasets. Additionally, we propose using the hidden state representations and the probability distributions as features for the detection.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Has this Fact been Edited? Detecting Knowledge Edits in Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsLogistic Regression