TL;DR
This paper introduces a novel constrained optimization framework for unlearning in large language models, improving stability and effectiveness in removing specific information while maintaining model utility.
Contribution
It formulates LLM unlearning as a constrained problem with a new logit-margin flattening loss and solves it using a scalable primal-dual algorithm, outperforming existing methods.
Findings
Effectively removes targeted information from LLMs.
Maintains model utility and performance on retained data.
Demonstrates superior results on TOFU and MUSE benchmarks.
Abstract
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsTofu
