Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei

TL;DR
This paper introduces Negative Preference Optimization (NPO), a novel method for unlearning undesirable data in large language models that mitigates catastrophic collapse and outperforms gradient ascent-based approaches.
Contribution
NPO is a simple, alignment-inspired approach that effectively unlearns data while preserving model utility, with theoretical and empirical advantages over existing methods.
Findings
NPO unlearns 50% of data effectively, outperforming existing methods.
NPO results in more sensible outputs than gradient ascent methods.
Theoretical analysis shows NPO's slower progression toward collapse.
Abstract
Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)
MethodsGenetic Algorithms · Tofu
