Reinforcement-aware Knowledge Distillation for LLM Reasoning
Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

TL;DR
This paper introduces RLAD, a reinforcement learning-aware distillation method that selectively guides LLMs during RL training to improve reasoning capabilities while balancing exploration and imitation.
Contribution
The paper proposes RLAD with TRRD, a novel distillation approach that aligns teacher-student training during RL, addressing distribution mismatch and objective interference issues.
Findings
RLAD outperforms offline distillation and standard methods on reasoning benchmarks.
TRRD provides advantage-aware, trust-region-bounded distillation.
The approach effectively balances exploration, exploitation, and imitation during training.
Abstract
Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
