Towards Safer Large Language Models through Machine Unlearning

Zheyuan Liu; Guangyao Dou; Zhaoxuan Tan; Yijun Tian; Meng Jiang

arXiv:2402.10058·cs.CL·June 6, 2024·1 cites

Towards Safer Large Language Models through Machine Unlearning

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, Meng Jiang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SKU, a novel framework for selectively removing harmful knowledge from large language models to enhance safety without sacrificing their normal utility.

Contribution

SKU is a new two-stage unlearning method that effectively eliminates harmful knowledge while maintaining model performance on regular prompts.

Findings

01

SKU balances harmful knowledge removal and utility preservation.

02

Experiments show SKU effectively reduces harmful outputs across various LLMs.

03

SKU maintains high performance on normal prompts while removing harmful content.

Abstract

The rapid advancement of Large Language Models (LLMs) has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. However, LLMs often encounter challenges in generating harmful content when faced with problematic prompts. To address this problem, existing work attempted to implement a gradient ascent based approach to prevent LLMs from producing harmful output. While these methods can be effective, they frequently impact the model utility in responding to normal prompts. To address this gap, we introduce Selective Knowledge negation Unlearning (SKU), a novel unlearning framework for LLMs, designed to eliminate harmful knowledge while preserving utility on normal prompts. Specifically, SKU is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. The first stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

franciscoliu/sku
pytorchOfficial

Videos

Towards Safer Large Language Models through Machine Unlearning· underline

Taxonomy

TopicsNatural Language Processing Techniques