UCD: Unlearning in LLMs via Contrastive Decoding
Vinith M. Suriyakumar, Ayush Sekhari, Ashia Wilson

TL;DR
This paper introduces a novel inference-time unlearning method for large language models using contrastive decoding with auxiliary models, effectively removing specific information while maintaining overall performance.
Contribution
It presents a new contrastive decoding approach that improves unlearning efficiency and effectiveness in large language models at inference time.
Findings
Significant improvement in forget quality and retained performance
Effective removal of specific information from LLMs
Outperforms prior unlearning methods on benchmark datasets
Abstract
Machine unlearning aims to remove specific information, e.g. sensitive or undesirable content, from large language models (LLMs) while preserving overall performance. We propose an inference-time unlearning algorithm that uses contrastive decoding, leveraging two auxiliary smaller models, one trained without the forget set and one trained with it, to guide the outputs of the original model using their difference during inference. Our strategy substantially improves the tradeoff between unlearning effectiveness and model utility. We evaluate our approach on two unlearning benchmarks, TOFU and MUSE. Results show notable gains in both forget quality and retained performance in comparison to prior approaches, suggesting that incorporating contrastive decoding can offer an efficient, practical avenue for unlearning concepts in large-scale models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsSparse Evolutionary Training
