Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Shuyao Xu; Cheng Peng; Jiangxuan Long; Weidi Xu; Wei Chu; Yuan Qi

arXiv:2505.24850·cs.LG·December 16, 2025

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel reinforcement distillation method that leverages both correct and incorrect reasoning traces to improve large language model reasoning, achieving high performance with significantly less data.

Contribution

It proposes the REDI objective and a two-stage training process that effectively utilizes negative reasoning traces, outperforming existing preference optimization methods.

Findings

01

Qwen-REDI-1.5B achieves 83.1% on MATH-500 with only 131k traces.

02

The approach matches the performance of models trained on much larger proprietary datasets.

03

Utilizing negative traces significantly improves data efficiency in LLM distillation.

Abstract

Recent advances in model distillation show that data from advanced reasoning models can effectively train smaller student models. However, standard practices discard incorrect reasoning traces -- valuable, yet underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? We employ a two-stage training recipe: first, Supervised Fine-Tuning (SFT) on positive traces, followed by a refinement stage using both positive and negative traces. We find that a simple REINFORCE-style objective, which we term the Reinforcement Distillation (REDI) objective, outperforms established preference optimization methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate the effectiveness of this approach. Notably, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tim-siu/reinforcement-distillation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification

MethodsShrink and Fine-Tune · Direct Preference Optimization