RLKD: Distilling LLMs' Reasoning via Reinforcement Learning
Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

TL;DR
This paper introduces RLKD, a reinforcement learning framework that distills authentic multi-branch reasoning structures from teacher LLMs into student models, surpassing traditional supervised fine-tuning methods.
Contribution
RLKD employs a novel Generative Structure Reward Model to align reasoning structures, enabling effective distillation of complex reasoning paths via reinforcement learning.
Findings
RLKD outperforms standard SFT-RL pipelines.
Effective reasoning structure distillation with only 0.1% data.
Student models achieve greater reasoning potential.
Abstract
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications
