Explainable LLM Unlearning Through Reasoning
Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen Fang

TL;DR
This paper introduces a reasoning-based unlearning method for large language models that improves the precision and reliability of removing specific knowledge while maintaining overall capabilities and robustness.
Contribution
It proposes a novel reasoning-based unlearning target and a targeted reasoning unlearning (TRU) method that enhances unlearning accuracy and robustness in LLMs.
Findings
TRU achieves more reliable unlearning compared to baselines.
TRU better preserves general capabilities of LLMs.
TRU shows superior robustness under attack scenarios.
Abstract
LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted…
Peer Reviews
Decision·ICLR 2026 Poster
1) The paper articulates two concrete failure modes (scope and response control) and motivates why prior GA-style methods fail. The case studies are persuasive. 2) Combining supervised training on reasoning+refusal traces with a GA loss is conceptually straightforward yet addresses both criteria (scope + response). The objective and algorithm are easy to implement. 3) The paper ablates the GA component, the target loss, and the reasoning traces to show the role of each piece (Table 2 - Table
1) TRU relies on reasoning traces and refusal responses generated by a reasoning LLM (Deepseek) to define what to learn to refuse. This raises important questions: 1-a) If the same or a closely related large reasoning model produced the targets and is used in evaluation (via LaaJ or shared model families), the method may partly be learning the style/behavior of that external model rather than an independent notion of refusal. The paper needs to discuss whether (and how) target-generation LLMs a
The paper is well-written and easy to read.\ The problem raised in the paper about in-scope unlearning failure is of significant importance for LLM unlearning.\ Finally, they perform comprehensive experiments with different unlearning benchmarks and wide range of unlearning methods.
Please address the following major concerns I have: - Kindly describe how you converted responses from a reasoning model (which should contain \<think\> tokens) to non-reasoning based models like Zephyr-7B-beta. Do you ignore the think tokens and concatenate reasoning trace with the refusal ? - If so, do you have any intuition as to why the reasoning is important ? As we see in the ablation study in Table 2, there seems to be a tradeoff between UQ and RQ with and without reasoning. Thus it se
Originality – Proposes reasoning-guided unlearning, a novel perspective that goes beyond optimization tricks. Technical soundness – The combination of supervised reasoning loss with GA-based unlearning is simple yet effective. Comprehensive experiments – Evaluated on three benchmarks and multiple backbones (Llama-2/3, Zephyr). Explainability and robustness – Demonstrates interpretable refusals, cross-lingual generalization, and resilience to jailbreak/relearning. Strong ablation and analysis
Limited human evaluation – The reliance on LLM-as-a-Judge may inherit biases; some human validation would strengthen the claims. Computational cost – The reasoning-target generation via Deepseek and extra supervision might be expensive for large-scale deployment. Generality – While TRU is effective for safety/copyright removal, it’s unclear how well it generalizes to factual correction or bias unlearning. Limited interpretability metrics – The paper claims explainability, but lacks quantitati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
