Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu; Shangzhe Li; Xinhua Zhang

arXiv:2505.20335·cs.CL·January 6, 2026

Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu, Shangzhe Li, Xinhua Zhang

PDF

Open Access 1 Video

TL;DR

This paper presents a novel framework for language model distillation using temporal difference learning that leverages the distributional sparsity of language models to improve efficiency and performance.

Contribution

It introduces a general temporal difference-based distillation framework exploiting vocabulary sparsity, leading to more efficient and effective model compression.

Findings

01

Improved distillation performance using the proposed framework.

02

Demonstrated efficiency gains by operating on reduced action spaces.

03

Validated the approach with practical algorithms and experiments.

Abstract

Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Language Model Distillation: A Temporal Difference Imitation Learning Perspective· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling