Position: Editing Large Language Models Poses Serious Safety Risks
Paul Youssef, Zhixue Zhao, Daniel Braun, J\"org Schl\"otterer, Christin Seifert

TL;DR
This paper highlights the overlooked safety risks of knowledge editing in large language models, emphasizing potential malicious uses, ecosystem vulnerabilities, and the need for countermeasures and security measures.
Contribution
It identifies safety concerns related to knowledge editing in LLMs, discusses malicious use cases, ecosystem vulnerabilities, and calls for research on tamper-resistant models and security.
Findings
Knowledge editing tools are accessible and efficient, enabling malicious use.
Malicious actors can easily adapt KE techniques for harmful purposes.
Current AI ecosystem lacks safeguards for model updates and verification.
Abstract
Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Reliability and Analysis Research · Topic Modeling · Natural Language Processing Techniques
