Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi

TL;DR
This paper introduces a novel reinforcement distillation method that leverages both correct and incorrect reasoning traces to improve large language model reasoning, achieving high performance with significantly less data.
Contribution
It proposes the REDI objective and a two-stage training process that effectively utilizes negative reasoning traces, outperforming existing preference optimization methods.
Findings
Qwen-REDI-1.5B achieves 83.1% on MATH-500 with only 131k traces.
The approach matches the performance of models trained on much larger proprietary datasets.
Utilizing negative traces significantly improves data efficiency in LLM distillation.
Abstract
Recent advances in model distillation show that data from advanced reasoning models can effectively train smaller student models. However, standard practices discard incorrect reasoning traces -- valuable, yet underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? We employ a two-stage training recipe: first, Supervised Fine-Tuning (SFT) on positive traces, followed by a refinement stage using both positive and negative traces. We find that a simple REINFORCE-style objective, which we term the Reinforcement Distillation (REDI) objective, outperforms established preference optimization methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate the effectiveness of this approach. Notably, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
MethodsShrink and Fine-Tune · Direct Preference Optimization
