Identifying Knowledge Editing Types in Large Language Models
Xiaopeng Li, Shasha Li, Shangwen Wang, Shezheng Song, Bin Ji, Huijun Liu, Jun Ma, Jie Yu

TL;DR
This paper introduces KETI, a new task for identifying different types of knowledge edits in large language models, aiming to detect malicious modifications and prevent harmful content generation.
Contribution
It proposes KETIBench with five harmful and one benign edit types, and develops baseline models demonstrating effective identification of malicious LLM edits.
Findings
Baseline models achieve decent performance in identifying malicious edits.
Identification performance is independent of editing method reliability.
Models generalize across domains and unknown sources.
Abstract
Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, nowledge diting ype dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
MethodsSoftmax · Attention Is All You Need
