WPN: An Unlearning Method Based on N-pair Contrastive Learning in   Language Models

Guitao Chen; Yunshen Wang; Hongye Sun; Guang Chen

arXiv:2408.09459·cs.CL·August 20, 2024

WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models

Guitao Chen, Yunshen Wang, Hongye Sun, Guang Chen

PDF

Open Access

TL;DR

This paper introduces WPN, a novel unlearning method based on N-pair contrastive learning, which effectively reduces harmful outputs in language models while preserving overall performance.

Contribution

The paper proposes WPN, a new unlearning approach that mitigates harmful knowledge in language models without significantly degrading their performance.

Findings

01

WPN reduces harmful responses to up to 95.8%.

02

Maintains less than 2% performance degradation on benchmarks.

03

Demonstrates robustness against out-of-distribution and adversarial attacks.

Abstract

Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as "harmful prompts" leading to "harmful outputs," which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model's behavior from "harmful prompt-harmful output" to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsGPT-Neo · OPT · Contrastive Learning