Online Knowledge Distillation with Reward Guidance

Chen Jia

arXiv:2505.18952·cs.LG·May 27, 2025

Online Knowledge Distillation with Reward Guidance

Chen Jia

PDF

Open Access

TL;DR

This paper introduces a reward-guided imitation learning framework for sequential knowledge distillation of large language models, optimizing preference alignment through a min-max formulation and extending to white-box scenarios.

Contribution

It proposes a novel reward-guided KD framework with theoretical analysis, addressing preference optimization and extending to online, offline, and white-box settings.

Findings

01

Effective preference alignment demonstrated in experiments

02

Framework achieves near-optimal performance in KD tasks

03

Theoretical guarantees support empirical results

Abstract

This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the $Q$ -value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Optimization and Search Problems · Computability, Logic, AI Algorithms

MethodsKnowledge Distillation · Sparse Evolutionary Training