How to Make LLMs Forget: On Reversing In-Context Knowledge Edits
Paul Youssef, Zhixue Zhao, J\"org Schl\"otterer, Christin Seifert

TL;DR
This paper presents methods to detect and reverse in-context knowledge edits in large language models, enhancing transparency and preventing malicious manipulations without requiring parameter changes.
Contribution
It introduces a novel approach for detecting IKE-edits with high accuracy and proposes a new reversal technique using specially tuned tokens to recover original outputs.
Findings
IKE-edits can be detected with F1 > 80% using output probabilities.
Reversal tokens achieve over 80% accuracy in restoring original outputs.
Continuous reversal tokens are highly effective with minimal impact on unedited prompts.
Abstract
In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 > 80\%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80\% accuracy in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies
MethodsSoftmax · Attention Is All You Need
