Explainable LLM Unlearning Through Reasoning

Junfeng Liao; Qizhou Wang; Shanshan Ye; Xin Yu; Ling Chen; Zhen Fang

arXiv:2603.09980·cs.LG·March 12, 2026

Explainable LLM Unlearning Through Reasoning

Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen Fang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a reasoning-based unlearning method for large language models that improves the precision and reliability of removing specific knowledge while maintaining overall capabilities and robustness.

Contribution

It proposes a novel reasoning-based unlearning target and a targeted reasoning unlearning (TRU) method that enhances unlearning accuracy and robustness in LLMs.

Findings

01

TRU achieves more reliable unlearning compared to baselines.

02

TRU better preserves general capabilities of LLMs.

03

TRU shows superior robustness under attack scenarios.

Abstract

LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

1) The paper articulates two concrete failure modes (scope and response control) and motivates why prior GA-style methods fail. The case studies are persuasive. 2) Combining supervised training on reasoning+refusal traces with a GA loss is conceptually straightforward yet addresses both criteria (scope + response). The objective and algorithm are easy to implement. 3) The paper ablates the GA component, the target loss, and the reasoning traces to show the role of each piece (Table 2 - Table

Weaknesses

1) TRU relies on reasoning traces and refusal responses generated by a reasoning LLM (Deepseek) to define what to learn to refuse. This raises important questions: 1-a) If the same or a closely related large reasoning model produced the targets and is used in evaluation (via LaaJ or shared model families), the method may partly be learning the style/behavior of that external model rather than an independent notion of refusal. The paper needs to discuss whether (and how) target-generation LLMs a

Reviewer 02Rating 2Confidence 5

Strengths

The paper is well-written and easy to read.\ The problem raised in the paper about in-scope unlearning failure is of significant importance for LLM unlearning.\ Finally, they perform comprehensive experiments with different unlearning benchmarks and wide range of unlearning methods.

Weaknesses

Please address the following major concerns I have: - Kindly describe how you converted responses from a reasoning model (which should contain \<think\> tokens) to non-reasoning based models like Zephyr-7B-beta. Do you ignore the think tokens and concatenate reasoning trace with the refusal ? - If so, do you have any intuition as to why the reasoning is important ? As we see in the ablation study in Table 2, there seems to be a tradeoff between UQ and RQ with and without reasoning. Thus it se

Reviewer 03Rating 6Confidence 4

Strengths

Originality – Proposes reasoning-guided unlearning, a novel perspective that goes beyond optimization tricks. Technical soundness – The combination of supervised reasoning loss with GA-based unlearning is simple yet effective. Comprehensive experiments – Evaluated on three benchmarks and multiple backbones (Llama-2/3, Zephyr). Explainability and robustness – Demonstrates interpretable refusals, cross-lingual generalization, and resilience to jailbreak/relearning. Strong ablation and analysis

Weaknesses

Limited human evaluation – The reliance on LLM-as-a-Judge may inherit biases; some human validation would strengthen the claims. Computational cost – The reasoning-target generation via Deepseek and extra supervision might be expensive for large-scale deployment. Generality – While TRU is effective for safety/copyright removal, it’s unclear how well it generalizes to factual correction or bias unlearning. Limited interpretability metrics – The paper claims explainability, but lacks quantitati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling