Reinforcement Unlearning via Group Relative Policy Optimization
Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci

TL;DR
This paper introduces PURGE, a novel unlearning method for large language models that effectively removes sensitive data while maintaining model utility, safety, and scalability, addressing legal compliance needs.
Contribution
PURGE formulates unlearning as a verifiable optimization problem using Group Relative Policy Optimization, achieving significant improvements over existing methods in efficiency and safety.
Findings
Up to 46x lower token usage for unlearning
Improved fluency (+5.48%) and robustness (+12.02%)
Achieves 11% unlearning effectiveness while preserving 98% utility
Abstract
During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach achieves up to x46 lower token usage per target than…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper's structure is logical and accessible, with clear transitions from motivation to methods and results. - Introducing reinforcement learning to the LLM unlearning domain is relatively novel, bringing fresh perspectives that could inspire future work and contributing to the paper's innovative quality. - Empirical results are comprehensive and well-supported, including breakdowns across multiple RWKU sub-tasks with quantitative comparisons to baselines, enhancing the paper's credibility
- The effectiveness of the proposed reward model is questionable, as relying solely on extracted entities may not compactly represent the knowledge to be forgotten; in some cases, knowledge has many variants (e.g., complex concepts), while in others, like copyright protection, only specific text needs forgetting without erasing concepts, potentially leading to over-penalization based on entity presence alone. - To maintain training stability, GRPO includes a clipping mechanism that limits policy
1. The paper is well-written and easy to follow. The logical flow from the proposed algorithm (PURGE), through its theoretical analysis, and into the experimental results is organic, making the core contributions clear and understandable. 2. A primary contribution is the novel formulation of LLM unlearning as a verifiable task, shifting the paradigm from standard preference-optimization or gradient-ascent methods. This re-framing is creatively combined with Group Relative Policy Optimization (G
1. Limited Discussion of Recent Literature: While the paper cites foundational unlearning works, it lacks engagement with the most recent literature, particularly the significant volume of unlearning papers from ICLR 2025 [1-5]. The authors should incorporate this discussion to more clearly differentiate their contributions from these recent papers. 2. Dependency on External Proprietary Models: The "Synthetic Forget Corpus Construction" (Sec 4.1) creates a significant external dependency on a p
1. The paper introduces a novel approach, PURGE, which reframes unlearning as a reinforcement learning problem. The use of group-relative policy optimization and the introduction of a verifiable unlearning process are both creative contributions, distinguishing this work from previous methods that often require full retraining or lack theoretical guarantees. 2. The method is rigorously designed with both theoretical guarantees and practical effectiveness. The authors provide a clear framework wi
1. The work only compares performance on a single benchmark, RWKU. Other benchmarks, such as TOFU, should also be included in the experiments to better demonstrate the generalizability of the proposed method. 2. According to Table 1, PURGE lags behind the baseline on many metrics (including QA, FM, GA, etc.). Although Section 5.2 provides some explanations, the method's broad effectiveness remains questionable. Given the generally poor performance of the Utility Set, there may be a potential iss
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
