Negative Preference Optimization: From Catastrophic Collapse to   Effective Unlearning

Ruiqi Zhang; Licong Lin; Yu Bai; Song Mei

arXiv:2404.05868·cs.LG·October 14, 2024·3 cites

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Negative Preference Optimization (NPO), a novel method for unlearning undesirable data in large language models that mitigates catastrophic collapse and outperforms gradient ascent-based approaches.

Contribution

NPO is a simple, alignment-inspired approach that effectively unlearns data while preserving model utility, with theoretical and empirical advantages over existing methods.

Findings

01

NPO unlearns 50% of data effectively, outperforming existing methods.

02

NPO results in more sensible outputs than gradient ascent methods.

03

Theoretical analysis shows NPO's slower progression toward collapse.

Abstract

Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsb-nlp-chang/uld
pytorch

Models

🤗
girishgupta/deep-ignorance-unfiltered_unlearned_npo
model· 69 dl
69 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)

MethodsGenetic Algorithms · Tofu