CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment
Zhengbang Yang, Yisheng Zhong, Junyuan Hong, Zhuangdi Zhu

TL;DR
CATNIP introduces a novel unlearning method for large language models that calibrates gradient updates based on model confidence, enabling effective forgetting of undesirable knowledge without needing retention data.
Contribution
The paper proposes CATNIP, a new unlearning approach that improves control over forgetting by using token-level confidence, reducing data requirements and enhancing performance.
Findings
Effective unlearning without retention data.
Stronger forgetting and preservation tradeoffs.
Robust to data scarcity and length variation.
Abstract
Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
