Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang

TL;DR
Phased DMD introduces a multi-step distillation method that improves the diversity, fidelity, and motion dynamics of generative models by progressive distribution matching within subintervals, addressing limitations of previous one-step approaches.
Contribution
It proposes a novel phased distillation framework combining progressive distribution matching and score matching within subintervals, enhancing model capacity and stability for complex generative tasks.
Findings
Enhanced motion dynamics in video generation
Improved visual fidelity in image generation
Increased diversity of generated outputs
Abstract
Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. Yet, the limited capacity of one-step distilled models compromises generative diversity and degrades performance in complex generative tasks, e.g., generating intricate object motions in text-to-video task. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generative diversity in text-to-image generation and slows motion dynamics in video generation, reducing performance to the level of one-step models. To address these limitations, we propose Phased DMD, a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) The proposed method demonstrates increased generation quality and sample diversity compared to the baselines; 2) Applicability of the method is demonstrated in high-dimensional settings corresponding to the state-of-the-art image and video models.
### Positioning and contributions 1) First, the score matching within subintervals, proposed as one of the key contributions, is not novel. It is a straightforward consequence of the general score identity $\nabla_y \log p_{Y}(y) = \int \nabla_y \log p_{Y | X}(y | x) p_{X | Y}(x | y) d x$, deeply discussed in e.g. [1]. Similar formulation was applied for the subsequent discrete timesteps in DSB [2]. The exact same (as in Phased DMD) subinterval formulation was applied in e.g. [3]; 2) I think th
- The phased, SNR-subinterval approach provides a theoretically grounded extension to DMD. - This research improves the diversity of image and video distillation based on the extensive experimental results. - The experimental results are tested with large models, which proves the scalability. - From my perspective, the paper is easy to read and follow as the mathematical formulas are neat.
Although the research improves the diversity of distilled results, as claimed, the paper still contains several drawbacks. - The paper claims that the proposed approach can improve the diversity of generation. However, the theoretical connection between phase-based SNR learning is weak. The improved results might come from the larger capacity of a mixture of experts, which can store more information, but not phase learning. - Although the paper claims that the huge computation cost and large am
- While the idea of progressive diffusion distillation under various criteria has been explored in previous studies such as [1, 2], the specific idea, splitting the SNR range into sub‑intervals to perform progressive multi‑step DMD coupled with MoE, is simple, novel, and interesting. - In addition, Section 2.3.2 derives an objective that theoretically enables score distillation within sub‑intervals, which is another strength. - Although the experimental evaluation is not comprehensive, the mai
Although the proposed method is new and interesting, the major weakness of this work as a scientific paper is the insufficient experimental evaluation. For instance, diversity is evaluated only for image generation (without video generation), and the metrics (DINOv3 and LPIPS) used do not seem to be standard for image generation. For video generation, the evaluation is limited to optical flow, dynamic degree, and screenshots of generated samples. To comprehensively assess the effectiveness of t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
