Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Sungmin Cha; Kyunghyun Cho

arXiv:2505.13111·cs.LG·January 16, 2026

Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Sungmin Cha, Kyunghyun Cho

PDF

Open Access 1 Video

TL;DR

This paper explains how knowledge distillation improves generative models by inducing a precision-recall trade-off, focusing on high-quality samples at the expense of diversity, validated through simulations and large language models.

Contribution

It offers a minimal, general explanation of KD's effectiveness in generative models, highlighting the precision-recall trade-off mechanism.

Findings

01

Distillation induces a trade-off between precision and recall in generative models.

02

Higher teacher selectivity leads to more focused, high-quality samples.

03

The precision-recall dynamics are validated in large-scale language models.

Abstract

Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented -- enabling smaller student models to emulate the performance of much larger teachers -- the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage, which is a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation· slideslive

Taxonomy

TopicsTopic Modeling · Machine Learning and Algorithms · Computational and Text Analysis Methods