Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation
Sungmin Cha, Kyunghyun Cho

TL;DR
This paper explains how knowledge distillation improves generative models by inducing a precision-recall trade-off, focusing on high-quality samples at the expense of diversity, validated through simulations and large language models.
Contribution
It offers a minimal, general explanation of KD's effectiveness in generative models, highlighting the precision-recall trade-off mechanism.
Findings
Distillation induces a trade-off between precision and recall in generative models.
Higher teacher selectivity leads to more focused, high-quality samples.
The precision-recall dynamics are validated in large-scale language models.
Abstract
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented -- enabling smaller student models to emulate the performance of much larger teachers -- the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage, which is a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Computational and Text Analysis Methods
