Generative Distribution Distillation

Jiequan Cui; Beier Zhu; Qingshan Xu; Xiaogang Xu; Pengguang Chen; Xiaojuan Qi; Bei Yu; Hanwang Zhang; Richang Hong

arXiv:2507.14503·cs.LG·July 22, 2025

Generative Distribution Distillation

Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

PDF

3 Reviews

TL;DR

This paper introduces Generative Distribution Distillation (GenDD), a novel framework for knowledge distillation formulated as a generative problem, addressing high-dimensional optimization and lack of semantic supervision, with theoretical and empirical validation.

Contribution

The paper proposes GenDD with Split Tokenization and Distribution Contraction, providing a new generative approach to knowledge distillation that improves unsupervised and supervised training efficiency.

Findings

01

Surpasses KL baseline by 16.29% on ImageNet validation

02

Achieves 82.28% top-1 accuracy with ResNet-50 in 600 epochs

03

Performs competitively in unsupervised settings

Abstract

In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 5

Strengths

1.Casting the point-wise discriminative knowledge distillation paradigm into a generative one is conceptually novel and well-motivated. 2.Experimental results show that the proposed method is effective and yields significant gains on different datasets and settings.

Weaknesses

1.Relevant (e.g., MGD [1]) or recent strong KD baselines (e.g., LSKD [2], CRLD [3]) are missing for introduction and comparison. For example, Tables 2 and 5 only present outdated KD methods, and recent strong baselines such as FCFD [4], LSKD [2], CRLD [3], and SDD [5], are absent for comparison. 2.The diffusion process is notorious for its time-consuming multi-step sampling. The 1000-step and 64-step sampling used in the training and inference of the proposed method could make it substantially

Reviewer 02Rating 2Confidence 5

Strengths

The method is novel and quite interesting to read. In contrast to prior works, which just introduce an additional alignment loss, they propose to model the problem as a generative task, where the reconstruction loss can be seen as a regularization to align with the teacher. This incorporation of label supervision is very natural. There is a good motivation for the SplitTok component and it is clearly ablated to show its importance. The ablation experiments are on ImageNet, which is really goo

Weaknesses

In abstract: "With label supervision, our ResNet-50 achieves 82.28% top-1 accuracy on ImageNet in 600 epochs of training". This statement is misleading. [1] is a very well known KD paper that also explicitly does label supervision for training a ResNet-50 model and achieves an accuracy is 82.8%. If the authors wish to highlight this result, they should say state-of-the-art within this given training budget. Missing related work [2, 3, 4], which to my understanding do not have any extensive hype

Reviewer 03Rating 2Confidence 4

Strengths

1. The experiments cover a wide range of models and datasets. 2. The writing is clear and easy to follow.

Weaknesses

1. The proof of Theorem 1 is questionable. A critical step of this proof is line 776, where the authors claim $W _y = c _y$. However, this claim is not close to the truth without any other assumptions, and the authors did **not** mention this assumption elsewhere in the main body of the paper. If this problem is not addressed, then the proof can be viewed as invalid. 2. The motivation of the proposed method is unclear and not well justified. Since the features of both teacher and student models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.