How to Make LLMs Forget: On Reversing In-Context Knowledge Edits

Paul Youssef; Zhixue Zhao; J\"org Schl\"otterer; Christin Seifert

arXiv:2410.12586·cs.CL·April 11, 2025

How to Make LLMs Forget: On Reversing In-Context Knowledge Edits

Paul Youssef, Zhixue Zhao, J\"org Schl\"otterer, Christin Seifert

PDF

Open Access 1 Video

TL;DR

This paper presents methods to detect and reverse in-context knowledge edits in large language models, enhancing transparency and preventing malicious manipulations without requiring parameter changes.

Contribution

It introduces a novel approach for detecting IKE-edits with high accuracy and proposes a new reversal technique using specially tuned tokens to recover original outputs.

Findings

01

IKE-edits can be detected with F1 > 80% using output probabilities.

02

Reversal tokens achieve over 80% accuracy in restoring original outputs.

03

Continuous reversal tokens are highly effective with minimal impact on unedited prompts.

Abstract

In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 > 80\%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80\% accuracy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How to Make LLMs Forget: On Reversing In-Context Knowledge Edits· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies

MethodsSoftmax · Attention Is All You Need