GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models
Peizhi Niu, Evelyn Ma, Huiting Zhou, Duo Zhou, Huan Zhang, S. Rasoul Etesami, Olgica Milenkovic

TL;DR
GUARD introduces a data attribution-guided framework for large language model unlearning that effectively balances forgetting specific data while retaining valuable information, outperforming prior methods in utility preservation.
Contribution
The paper proposes a novel data attribution metric and unlearning objective that adaptively weights samples, significantly improving retention during unlearning in large language models.
Findings
Reduces utility loss by up to 194.92% on TOFU benchmark.
Improves knowledge retention by 16.20% on MUSE NEWS.
Maintains comparable privacy loss to state-of-the-art methods.
Abstract
Unlearning in large language models is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this problem, we propose GUARD, a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the alignment between the Forget and Retain sets while…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The proposed method is simple intuitively, and experiments show that it outperforms strong baselines, especially on the retain performance. 2. The results show that using the proposed sample weights consistently outperforms the uniform weights when applied to different unlearning methods, models, and datasets. 3. The proposed method is theoretically grounded.
1. The method assumes access to the retain set in order to calculate the attribution scores. However, what if we don't know what questions are in the retain set? Does the method still work if we use a general corpus, such as a subset of the pre-training corpus, to calculate the attribution score? Are there any ways to obtain some surrogates for the retain set? 2. The method needs to first calculate gradients over the retain and forget set, which is more expensive than baselines. How much additio
* The proposed weighting method can be applied to many different LLM unlearning loss, and it seems to bring benefit to multiple method in the experiment. * The paper presents a theoratical gurantee about the weighting's effectiveness for unlearning training.
* Potential unreasonable unlearning setting. Equation (2) computation on the weighting depends on original model $\theta_0$ gradient on the $D_r$ dataset and forget set example. This does not look reasonable to me, since typical unlearning does not assume access to the pre-trained model weight before fine-tuning on the knowledge data. * Potential unreasonable data assumption. Assumption 1 (line 291) mentions condition 1, with $<\bar{g_r}, \bar{g_f}>0$, which seems abrupt and lacks sufficient mot
1. Recasts unlearning as retention aware weighting using a gradient alignment score, directly aligned with the goal of preserving retained knowledge. 2. GUARD pipeline shows clear properties and theory that guarantees lower retain loss, comparable forgetting, and a better retention–forget tradeoff. 3. Proposed GUARD pipeline gains across strong baselines and metrics, with large improvements on Retain while keeping Forget performance on target, also quite easy to plug into existing procedures. 4.
1. The core score is the inner product between a forget sample gradient and the average retain gradient. So I am curious about the effect of the inverse weighting by deriving it from an explicit objective, for example, maximizing loss on the forget set subject to a first-order constraint on the retain set, and present the solution through a Lagrangian or projection analysis. 2. In the experiment section, the comparison is mainly focused on the current MU methods and with GUARD pipeline, and sinc
1. The attribution-driven unlearning method is underexplored in the literature and the proposed method is simple. 2. The experiments show that the method is effective compared to baselines. 3. The authors provide proofs supporting improvements in retention efficiency.
1. The writing could be improved. For example the introduction is lengthy and reduant to some extent, which reduces clarity and readability. 2. No other data attribution methods are tested. It would strengthen the paper to explore different attribution measures or analyze sensitivity to gradient noise.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
MethodsTofu · Sparse Evolutionary Training
