KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning
Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, Fei Mi

TL;DR
KDRL introduces a unified post-training framework combining knowledge distillation and reinforcement learning to enhance reasoning capabilities of large language models, achieving better performance and efficiency on reasoning benchmarks.
Contribution
This work presents the first unified framework that jointly optimizes reasoning LLMs through teacher supervision and self-exploration, integrating KD and RL with a systematic analysis of their interactions.
Findings
KDRL outperforms existing methods on multiple reasoning benchmarks.
The unified approach balances reasoning performance and token efficiency.
Different KL approximations and reward strategies significantly influence training dynamics.
Abstract
Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbf{KDRL}, a \textit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsKnowledge Distillation
