Distill Not Only Data but Also Rewards: Can Smaller Language Models   Surpass Larger Ones?

Yudi Zhang; Lu Wang; Meng Fang; Yali Du; Chenghua Huang; Jun Wang,; Qingwei Lin; Mykola Pechenizkiy; Dongmei Zhang; Saravan Rajmohan; Qi Zhang

arXiv:2502.19557·cs.CL·February 28, 2025

Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang,, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

PDF

Open Access

TL;DR

This paper introduces a novel distillation method that transfers both data and reward signals from large language models, enabling smaller models to outperform larger ones through self-supervised reward learning and reinforcement learning.

Contribution

The paper proposes a new distillation pipeline that incorporates reward transfer and self-supervised reward learning, improving the performance of smaller language models beyond their teachers.

Findings

01

Smaller models surpass larger teacher models in benchmarks.

02

Self-supervised reward transfer enhances model performance.

03

The method reduces reliance on external reward supervision.

Abstract

Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications

MethodsShrink and Fine-Tune