Offset Unlearning for Large Language Models
James Y. Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, Muhao Chen

TL;DR
This paper introduces { extdelta}-Unlearning, a novel method for unlearning sensitive data in black-box large language models by learning logit offsets through smaller models, addressing ethical concerns while preserving performance.
Contribution
The paper presents { extdelta}-Unlearning, a versatile offset unlearning framework applicable to black-box LLMs that does not require internal model access or data retention.
Findings
Effectively unlearns target data from black-box LLMs.
Maintains or improves performance on general tasks.
Compatible with various unlearning algorithms.
Abstract
Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, biased, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose {\delta}-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, {\delta}-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
