Has this Fact been Edited? Detecting Knowledge Edits in Language Models
Paul Youssef, Zhixue Zhao, Christin Seifert, J\"org Schl\"otterer

TL;DR
This paper introduces a new task to detect whether knowledge in language models has been edited or remains original, using features like hidden states and probability distributions, to enhance transparency and trust.
Contribution
It proposes the novel task of detecting knowledge edits in language models and demonstrates effective detection methods using simple classifiers with limited data.
Findings
Simple AdaBoost classifiers perform well in detection
Detection remains challenging for related but unedited knowledge
Method is robust across different domains
Abstract
Knowledge editing methods (KEs) can update language models' obsolete or inaccurate knowledge learned from pre-training. However, KEs can be used for malicious applications, e.g., inserting misinformation and toxic content. Knowing whether a generated output is based on edited knowledge or first-hand knowledge from pre-training can increase users' trust in generative models and provide more transparency. Driven by this, we propose a novel task: detecting edited knowledge in language models. Given an edited model and a fact retrieved by a prompt from an edited model, the objective is to classify the knowledge as either unedited (based on the pre-training), or edited (based on subsequent editing). We instantiate the task with four KEs, two LLMs, and two datasets. Additionally, we propose using the hidden state representations and the probability distributions as features for the detection.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsLogistic Regression
