DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Yisheng Zhong; Zhengbang Yang; Zhuangdi Zhu

arXiv:2601.21283·cs.LG·March 2, 2026

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu

PDF

Open Access 3 Reviews

TL;DR

DUET is a distillation-based unlearning method for large language models that effectively removes undesirable knowledge while maintaining utility, offering higher efficiency and robustness against attacks compared to existing approaches.

Contribution

It introduces DUET, a novel unlearning approach combining in-contextualized and distillation techniques to improve efficiency and robustness in LLM unlearning.

Findings

01

DUET outperforms existing methods in forgetting and utility preservation.

02

It is significantly more data-efficient than prior unlearning techniques.

03

Extensive evaluations show DUET's robustness against prompt removal and reverse engineering.

Abstract

LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed approach is novel and sound. The paper reveals that training-based unlearning achieves stronger robustness but risks greater utility degradation, while contextualized unlearning enables more precise unlearning yet can be easily reversed. The proposed method strikes a good balance between these two paradigms. 2. The paper proposes top-k logits distillation to further enhance performance. 3. The paper is enjoyable to read.

Weaknesses

The authors should compare their method with more baseline methods, such as [1][2][3], as well as additional distillation-based approaches—for instance, distillation from gradient ascent methods. How about distillation from multiple unlearning teachers? [1] Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement, NeurIPS'24 [2] Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation, ICLR'24 [3] Torward Natural Mach

Reviewer 02Rating 4Confidence 3

Strengths

The novelty of DUET lies in its distillation-based unlearning approach, where a student model learns from a prompt-steered teacher to selectively suppress undesirable knowledge while preserving useful information. This method combines efficiency, robustness, and effectiveness, outperforming existing unlearning techniques in both forgetting unwanted content and maintaining model utility. The research provides: • Efficient unlearning without full retraining, saving computational resources. • Ef

Weaknesses

The research demonstrates some shortcomings • Reliance on teacher quality — form my understanding effectiveness depends on how well the teacher model suppresses unwanted knowledge. • The work makes a valuable contribution and builds effectively on current advances. However, including a discussion of remaining challenges and possible avenues for future research would strengthen the paper and highlight its long-term potential. • Limited evaluation — generalisation across LLMs is not shown.

Reviewer 03Rating 4Confidence 3

Strengths

1. **Simple method and clear motivation.** The approach is conceptually straightforward and is well justified relative to tuning-based and purely in-context unlearning. 2. **Good empirical performance.** On MUSE-Books and WMDP, DUET achieves lower forgetting scores and stronger utility preservation than most baselines. 3. **Robustness to attack.** The distilled model is notably less sensitive to a reverse-prompt attack than a purely in-context teacher.

Weaknesses

1. **Distilling only the first decoding step is not fully convincing.** Many tasks (e.g., math reasoning) often start with stereotyped lead tokens (e.g., _“To solve the problem, I need to…”_), so aligning only the first-step logits may fail to shape downstream generation in a robust way. The paper explicitly trains on **first-position** logits only; a deeper justification and multi-step ablations would help. 2. **Limited experimental breadth.** The forget set and retention set used for training

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning