Precision Knowledge Editing: Enhancing Safety in Large Language Models
Xuying Li, Zhuo Li, Yuji Kosuga, Yasuhiro Yoshida, Victor Bian

TL;DR
This paper introduces Precision Knowledge Editing (PKE), a novel method that improves the safety of large language models by more precisely identifying and modifying toxic content regions, outperforming previous techniques.
Contribution
The paper presents PKE, a new knowledge editing approach that enhances toxicity mitigation in LLMs through neuron weight tracking and activation pathway tracing.
Findings
PKE significantly reduces attack success rate across multiple models.
PKE maintains overall model performance after editing.
Models edited with PKE outperform some closed-source models in safety.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but they also pose risks related to the generation of toxic or harmful content. This work introduces Precision Knowledge Editing (PKE), an advanced technique that builds upon existing knowledge editing methods to more effectively identify and modify toxic parameter regions within LLMs. By leveraging neuron weight tracking and activation pathway tracing, PKE achieves finer granularity in toxic content management compared to previous methods like Detoxifying Instance Neuron Modification (DINM). Our experiments demonstrate that PKE significantly reduces the attack success rate (ASR) across various models, including Llama2-7b and Llama-3-8b-instruct, while maintaining overall model performance. Additionally, we also compared the performance of some closed-source models (gpt-4-0613 and Claude 3 Sonnet) in our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
