Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
Minsang Kim, Seung Jun Baek

TL;DR
This paper introduces Token-Selective Dual Knowledge Distillation (TSD-KD), a novel framework that enhances reasoning in smaller models by focusing on important tokens and combining indirect and direct feedback mechanisms, leading to state-of-the-art results.
Contribution
TSD-KD is a new student-centric distillation method that selectively distills tokens and uses indirect feedback, improving reasoning abilities of small models beyond existing methods.
Findings
Achieves up to 54.4% accuracy improvement on reasoning benchmarks.
Outperforms baseline and runner-up models significantly.
In some cases, student models surpass their teachers by up to 20.3%.
Abstract
Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those…
Peer Reviews
Decision·ICLR 2026 Poster
1. The core idea of Token-Selective Direct Distillation is well motivated. 2. Authors provide comprehensive ablation study in demonstrating effects of each component and showed strong empirical results over baselines.
W1: Hyperparameter Sensitivity: The framework relies on an extremely sensitive set of hyperparameters ($c$, $k$, $s$, $\beta$), as demonstrated by sharp performance drop-offs in the appendix analyses. This suggests the method is brittle and lacks practical generalizability. In the Table 1, authors also only report the performance from the best hyperparamter selections. I wonder how much this complex setup could transfer into new domains or tasks. W2: Conflict Between On-Policy Learning and Entr
1) TSD-KD achieves State-of-the-Art performance across 10 challenging reasoning benchmarks. Experimental results demonstrate its significant superiority over existing baseline methods across multiple tasks. 2) The student model, after training, even surpasses its teacher model on some reasoning tasks (with improvements up to 20.3%). This result strongly suggests the framework is not merely imitative but effectively promotes the student model in building its own, more generalizable reasoning log
1) The core insight of the paper—that "high-entropy/uncertain tokens are critical branching points in reasoning" and should be targeted for selective supervision—is not an original discovery. This phenomenon, which guides the model learning process, has been well-established in antecedent works (such as the RL-based methods by Wang et al. (2025) and Lei et al. (2025)). Therefore, the paper's contribution lies primarily in the engineering application and integration of this existing principle int
1.The use of student-generated candidates (preference-based indirect distillation) and selective, entropy-based token gating in direct distillation is thoughtfully motivated and distinguishes the framework from prior “teacher-forcing” approaches. The focus on letting the student “explain in its own words” resonates with cognitive insights and supports the central claim. 2.The explicit combination of indirect and direct knowledge distillation, each carefully limited to critical tokens, is well-po
1. The preference-based indirect distillation encourages the student to align with the teacher’s ranking on top-$k$ student candidates. However, this assumes that the student's beam search is likely to generate candidates close to the correct reasoning trace, which may not hold for weaker students or for highly ambiguous problems. 2. While tables and figures provide extensive quantitative results, the paper lacks qualitative or error analysis on the types of reasoning improvements the student m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
