Precision Knowledge Editing: Enhancing Safety in Large Language Models

Xuying Li; Zhuo Li; Yuji Kosuga; Yasuhiro Yoshida; Victor Bian

arXiv:2410.03772·cs.CL·October 14, 2024

Precision Knowledge Editing: Enhancing Safety in Large Language Models

Xuying Li, Zhuo Li, Yuji Kosuga, Yasuhiro Yoshida, Victor Bian

PDF

Open Access

TL;DR

This paper introduces Precision Knowledge Editing (PKE), a novel method that improves the safety of large language models by more precisely identifying and modifying toxic content regions, outperforming previous techniques.

Contribution

The paper presents PKE, a new knowledge editing approach that enhances toxicity mitigation in LLMs through neuron weight tracking and activation pathway tracing.

Findings

01

PKE significantly reduces attack success rate across multiple models.

02

PKE maintains overall model performance after editing.

03

Models edited with PKE outperform some closed-source models in safety.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, but they also pose risks related to the generation of toxic or harmful content. This work introduces Precision Knowledge Editing (PKE), an advanced technique that builds upon existing knowledge editing methods to more effectively identify and modify toxic parameter regions within LLMs. By leveraging neuron weight tracking and activation pathway tracing, PKE achieves finer granularity in toxic content management compared to previous methods like Detoxifying Instance Neuron Modification (DINM). Our experiments demonstrate that PKE significantly reduces the attack success rate (ASR) across various models, including Llama2-7b and Llama-3-8b-instruct, while maintaining overall model performance. Additionally, we also compared the performance of some closed-source models (gpt-4-0613 and Claude 3 Sonnet) in our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling