WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models
Guitao Chen, Yunshen Wang, Hongye Sun, Guang Chen

TL;DR
This paper introduces WPN, a novel unlearning method based on N-pair contrastive learning, which effectively reduces harmful outputs in language models while preserving overall performance.
Contribution
The paper proposes WPN, a new unlearning approach that mitigates harmful knowledge in language models without significantly degrading their performance.
Findings
WPN reduces harmful responses to up to 95.8%.
Maintains less than 2% performance degradation on benchmarks.
Demonstrates robustness against out-of-distribution and adversarial attacks.
Abstract
Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as "harmful prompts" leading to "harmful outputs," which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model's behavior from "harmful prompt-harmful output" to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsGPT-Neo · OPT · Contrastive Learning
