Tracing and Reversing Edits in LLMs
Paul Youssef, Zhixue Zhao, Christin Seifert, J\"org Schl\"otterer

TL;DR
This paper introduces methods to detect, trace, and reverse malicious or unintended edits in large language models by analyzing weight modifications, enhancing model safety and integrity.
Contribution
It presents novel, training-free techniques for accurately inferring edited entities and reversing edits solely from weight changes, without access to prompts or original data.
Findings
Achieves up to 99% accuracy in inferring edited entities.
Reverses up to 94% of edits to restore original model outputs.
Provides a new approach for safeguarding LLMs against malicious manipulations.
Abstract
Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate malicious edits. To that end, we introduce the tasks of tracing and reversing edits. We propose a novel method to infer the edited object entity, solely based on the modified weights, without access to the editing prompt or any other semantically similar prompts, with up to 99% accuracy. Further, we propose an effective and training-free method for reversing edits. Our method reverses up to 94% of the edits, and helps regain the original model's output…
Peer Reviews
Decision·ICLR 2026 Poster
+ The proposed defense is practical and lightweight, requiring only access to the edited weights and no training data or edit prompts, making it suitable for real-world forensic use. + The reversal approach is simple yet effective, using an interpretable SVD-based method that efficiently removes the edit signal while maintaining model integrity. + The experimental validation is comprehensive and convincing, demonstrating strong performance across multiple models and datasets with clear quantitat
- The method’s generality is limited, as it is evaluated only on single-layer, rank-one edits and may not extend to more complex, multi-layer, or non-rank-one scenarios. - The evaluation scope is narrow, focusing mainly on object recovery and KL divergence without exploring broader behavioral or capability effects after reversal.
- The paper introduces a training-free framework for detecting and reversing malicious edits directly from model parameters, a new defense direction for LLM safety. - Experimental results show high accuracy and generalization across different models and datasets, suggesting good robustness. - The methods are computationally efficient and require no access to original weights or editing prompts, enhancing practical applicability for security auditing.
- The study focuses only on rank-one edits, limiting applicability to other editing methods and scenarios like MEMIT, MEND, SERAC. - The motivation The evaluation scope is restricted to controlled datasets and synthetic edits, leaving real-world validation uncertain. - The interpretability of why bottom-rank approximations work well for reversal is not fully explored, reducing theoretical clarity of the mechanism. - The motivation for reversing edits is questionable, since model editing is prima
- The proposed methods for both tracing and reversing are designed to operate solely on the edited weights, without requiring access to the editing prompt, unedited weights, or any other information about the edit. This makes the countermeasures more practical for real-world defense against malicious editing. - The tracing method achieved high accuracy in identifying the edited object and showed strong generalization to out-of-distribution data and different editing methods (ROME and r-ROME). Si
- The effectiveness of the reversal (and analysis of rank-one approximations) is shown to be model-dependent. For example, the optimal rank k for bottom-rank approximation varies significantly across models (e.g., k=11 for GPT2-XL vs. k=15 for llama3 to achieve highest reversal accuracy), and the similarity of the top rank-one approximation to the update matrix is much lower for LLAMA3 than for GPT models. This suggests a need for model-specific tuning of the reversal hyperparameter. - The core
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
