Can Editing LLMs Inject Harm?
Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu

TL;DR
This paper introduces the concept of Editing Attacks on LLMs, revealing their potential to stealthily inject misinformation and bias, thereby threatening safety alignment and highlighting new misuse risks.
Contribution
It formulates Editing Attack as a novel safety threat, constructs a dataset, and systematically investigates the effectiveness and stealthiness of misinformation and bias injection into LLMs.
Findings
Editing attacks can inject both commonsense and long-tail misinformation.
Biased sentences can be injected with high effectiveness, degrading fairness.
Editing attacks are highly stealthy, posing new safety risks.
Abstract
Large Language Models (LLMs) have emerged as a new information channel. Meanwhile, one critical but under-explored question is: Is it possible to bypass the safety alignment and inject harmful information into LLMs stealthily? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the first risk, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the effectiveness for the former one is particularly high. For the second risk, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection…
Peer Reviews
Decision·Submitted to ICLR 2025
Important problem that may become even more salient as we increasingly rely on LLMs. New dataset seems valuable to future research in this area. While not too surprising (see below), the conclusions remain important.
My main concerns are with the writing and positioning of the paper. In particular: The paper claims to "reformulate knowledge editing as a new type of threats for LLMs" (line 117-119, again 468-469). But this does not seem to be a new idea. I'm not an expert in knowledge editing, so please correct me if I'm mistaken, but my cursory search found a survey https://arxiv.org/pdf/2310.16218 that highlighted "if KME is maliciously applied to inject harmful knowledge into language models, the edited m
1. The authors conduct a thorough investigation into the effectiveness of editing attacks on misinformation and bias injection, providing a comprehensive analysis of the risks involved. 2. The construction of the EDITATTACK dataset contributes to the field by offering a new resource for benchmarking LLMs against editing attacks, which can facilitate future research and development of defense mechanisms.
1. Although framing knowledge editing as a potential threat is helpful, its technical contribution is somewhat limited, and the results can be expected to not be entirely surprising. 2. The paper’s experiments focus on a few smaller LLMs (e.g., Llama3-8b, Mistral-v0.2-7b), limiting the findings' applicability to larger, state-of-the-art models that may respond differently to editing attacks. This narrow scope weakens the generalizability and robustness of the conclusions. For instance, ICE exper
It’s a new work that gave researchers a notice regarding Misinformation Injection and Bias Injection through knowledge editing to harm LLMs. The authors provided implementation code. The paper is well-written with clear takeaways.
1. The motivation and practicality of editing attacks need further improvement. The authors present three types of attack injection methods: ROME, Fine-Tuning, and In-Context Editing. However, based on open-source LLMs, users typically do not directly use personally trained LLMs. The models that are widely used and would have a significant social impact are usually black boxes, making the attacks proposed by the authors unfeasible. 2. All the attacks are conducted on the original LLMs, such as
Code & Models
Videos
Taxonomy
MethodsFocus
