LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection
Ameya Anjarlekar, Sandeep Pombra

TL;DR
This paper introduces GRIN, a modular framework for unlearning in large language models that uses gradient ratio metrics to identify and selectively noise-inject parameters responsible for memorizing sensitive data, improving unlearning effectiveness.
Contribution
The paper presents a novel gradient-ratio-based metric for targeted unlearning and combines it with selective noise injection to enhance unlearning in LLMs while preserving utility.
Findings
Effective unlearning demonstrated on standard benchmarks
Gradient ratio metric accurately identifies parameters for forgetting
Noise injection improves unlearning without significant utility loss
Abstract
The growing legal and ethical scrutiny of large language models (LLMs) necessitates effective machine unlearning, particularly for sensitive or unauthorized data. Existing empirical methods often yield incomplete forgetting or unintended degradation of unrelated knowledge due to poor localization. In this work, we propose GRIN: a modular and targeted framework for LLM unlearning. GRIN introduces a novel gradient-ratio-based metric to identify parameters most responsible for memorizing forget data. We then perform selective noise injection into these parameters prior to fine-tuning, which improves unlearning performance while maintaining model utility. Finally, we propose new evaluation metrics tailored to the LLM setting and validate our approach on standard benchmarks such as TOFU, WMDP, and SafePKU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods
