Multi-student Diffusion Distillation for Better One-step Generators

Yanke Song; Jonathan Lorraine; Weili Nie; Karsten Kreis; James Lucas

arXiv:2410.23274·cs.LG·December 4, 2024

Multi-student Diffusion Distillation for Better One-step Generators

Yanke Song, Jonathan Lorraine, Weili Nie, Karsten Kreis, James Lucas

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Multi-Student Diffusion Distillation (MSD), a framework that distills a teacher diffusion model into multiple smaller, faster single-step generators, improving quality and inference speed for image generation.

Contribution

MSD enables training multiple smaller student generators from a teacher diffusion model, enhancing generation quality and speed over single-student approaches.

Findings

01

MSD achieves competitive FID scores with faster inference.

02

Multiple students outperform single-student baselines.

03

MSD provides a lightweight quality boost.

Abstract

Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model's inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of the conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

Paper is easy to read and understand the setting, focusing on domain specific student (partition of dataset). Different objectives and better initialization to perform distillation makes sense and resultant effectiveness is demonstrated empirically. Demonstrates finetuning with adversarial training further improves quality of distilled model, which makes sense.

Weaknesses

Currently this work lacks strong motivation or useful analysis. There are previous works like eDiff which specialize different diffusion models per timestep and also works exploring MoE for efficient inferen w.r.t efficiency as motivation more effective pruning, efficient architectures, caching across timesteps etc. have been proposed to achieve smaller models and/or lower latency. This work explores splitting student into multiple models w.r.t dataset, while that is practical this work does

Reviewer 02Rating 8Confidence 5

Strengths

1. The paper is well written and presented. I enjoyed reading the paper. Though MoE is not a new idea, using it for Distillation is new, further using it to accelerate inference is commendable. 2. The idea of using Multiple-Students for distillation for inference time-quality tradeoff is quite intuitive. Moreover, assigning a student to a subset of conditions is a smart choice to increase the capacity of overall model. 3. Authors solve the obvious problem with above choice - initialization from

Weaknesses

1. The paper focuses exclusively on DMD (Distribution Matching Distillation) and its extension ADA, which limits the demonstration of the method's generality. While the authors acknowledge this limitation, can the authors demonstrate preliminary results with other distillation approaches, particularly Consistency Distillation [1-3], on simple datasets like Mixture-of-Gaussian. Such experiments would better establish MSD's generality beyond DMD/ADA. 2. There is insufficient clarity regarding the

Reviewer 03Rating 3Confidence 4

Strengths

Overall, this paper is well-written and easy to follow, with relatively new comparison methods.

Weaknesses

1. The authors state in Line 257 that 'Conditions within each partition should be more semantically similar than those in other partitions, so networks require less capacity to achieve a set quality on their partition.' However, there are no experiments presented to support this claim. I believe that implementing this idea is challenging and will demand additional computational resources. I recommend including relevant experiments and source code to facilitate a comprehensive review. 2. The sta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProcess Optimization and Integration · Innovative Microfluidic and Catalytic Techniques Innovation · Field-Flow Fractionation Techniques

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings