KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Hongling Xu; Qi Zhu; Heyuan Deng; Jinpeng Li; Lu Hou; Yasheng Wang; Lifeng Shang; Ruifeng Xu; Fei Mi

arXiv:2506.02208·cs.LG·June 4, 2025

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, Fei Mi

PDF

Open Access

TL;DR

KDRL introduces a unified post-training framework combining knowledge distillation and reinforcement learning to enhance reasoning capabilities of large language models, achieving better performance and efficiency on reasoning benchmarks.

Contribution

This work presents the first unified framework that jointly optimizes reasoning LLMs through teacher supervision and self-exploration, integrating KD and RL with a systematic analysis of their interactions.

Findings

01

KDRL outperforms existing methods on multiple reasoning benchmarks.

02

The unified approach balances reasoning performance and token efficiency.

03

Different KL approximations and reward strategies significantly influence training dynamics.

Abstract

Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbf{KDRL}, a \textit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law

MethodsKnowledge Distillation