Exclusive Unlearning
Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma

TL;DR
This paper introduces Exclusive Unlearning, a method to broadly erase harmful content from large language models while preserving their ability to respond accurately in specific domains.
Contribution
It proposes a novel unlearning approach that selectively forgets harmful knowledge en masse, enhancing safety without sacrificing domain-specific capabilities.
Findings
Models with Exclusive Unlearning effectively resist harmful prompts.
The method maintains performance on domain-specific tasks.
It offers a scalable solution for safety in LLM deployment.
Abstract
When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
