Knowledge Distillation with Training Wheels

Guanlin Liu; Anand Ramachandran; Tanmay Gangwani; Yan Fu; Abhinav; Sethy

arXiv:2502.17717·cs.CL·February 26, 2025

Knowledge Distillation with Training Wheels

Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, Abhinav, Sethy

PDF

Open Access

TL;DR

This paper introduces a generalized framework for knowledge distillation that allows models to learn from teachers during training and selectively seek help at test-time, improving performance and flexibility in language tasks.

Contribution

It formulates knowledge distillation as an entropy-regularized optimization problem and develops a new algorithm using Path Consistency Learning and constrained reinforcement learning for test-time assistance.

Findings

01

Improved translation and summarization accuracy.

02

Enhanced control over teacher assistance during inference.

03

Unlocks new operating points beyond existing decoding methods.

Abstract

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Educational Games and Gamification

MethodsKnowledge Distillation