Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi

TL;DR
This paper reexamines diffusion model distillation, revealing that CFG Augmentation, not distribution matching, is the main driver of few-step performance, leading to improved methods and practical applications.
Contribution
It uncovers the primary role of CFG Augmentation in diffusion distillation, decouples it from distribution matching, and proposes principled modifications for better performance.
Findings
CFG Augmentation is the core engine of distillation.
Distribution Matching acts mainly as a regularizer.
Decoupling noise schedules improves performance.
Abstract
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term…
Peer Reviews
Decision·ICLR 2026 Poster
* Provides a timely and insightful analysis of the functional roles of DMD’s two loss terms, addressing the open question of why DMD excels in few-step or one-step generation. * The authors design careful and hypothesis-driven experiments to isolate and test the contribution of each loss term, leading to well-supported conclusions. * Based on these insights, the paper proposes using distinct $\tau$ values for the two terms, leading to measurable performance gains.
Most experiments rely primarily on qualitative evaluation (visual inspection of generated images). While visualization is valuable for illustrating effects, heavy reliance on qualitative judgments risks confirmation bias—highlighting supportive examples while overlooking contradictory ones. A more scientifically rigorous approach would involve defining quantitative metrics and validating observations across the entire test set, to ensure statistical robustness and reproducibility.
1. This paper identifies a discrepancy between theory and practice in DMD that CFG is only used in the teacher model but not the student model. This is an interesting observation and a natural motivation for this important research topic. 2. The decomposition of the DMD loss into the DM and CA terms provide novel and valuable insights towards a better and principled understanding of the underlying mechanism of DMD. 3. The arguments and hypotheses in the paper are supported by extensive experimen
Overall, I like the paper very much. My only concern is the paper's claim about the CA term being the engine for DMD, which is a bit strong to me. For example, early DMD papers achieved great distillation performance on unconditional generation for CIFAR images, which is not discussed or explored in this paper.
I really like this research topic and believe the distribution-matching distillation is an under-explored topic, and only from a divergence perspective, it can't answer why it works or why it doesn't work in some scenarios, so I think the topic of this paper is very valuable. The experiments are also sound, which can support the argument.
My major concern with this paper is that I found the conclusion a little bit conclusive. The argument is CFG Augmentation is the engine for dilatation, and Distribution Matching is the regularizer for stability. However, many CIFAR experiments don't use label-conditioned and can achieve one-step distillation, e.g. the original diff-intruct paper or more recent paper: https://arxiv.org/pdf/2502.08005. In this case, the pure driven engine is only the distribution matching term, which couldn't
Code & Models
- 🤗Tongyi-MAI/Z-Image-Turbomodel· 824k dl· ♡ 4375824k dl♡ 4375
- 🤗unsloth/Z-Image-Turbo-GGUFmodel· 39k dl· ♡ 12039k dl♡ 120
- 🤗unsloth/Z-Image-Turbo-unsloth-bnb-4bitmodel· 439 dl· ♡ 5439 dl♡ 5
- 🤗not-pegasus/IMAGE_MODALmodel· 5 dl5 dl
- 🤗tsqn/Z-Image-Turbo_fp32-fp16-bf16_full_and_ema-onlymodel· 609 dl· ♡ 12609 dl♡ 12
- 🤗kp-forks/Z-Image-Turbomodel· 1 dl1 dl
- 🤗tsqn/Z-Image-Turbo_fp32-fp16-bf16_comfyuimodel· 1.2k dl· ♡ 41.2k dl♡ 4
- 🤗tsqn/Z-Image-Turbo_GGUFmodel· 196 dl· ♡ 1196 dl♡ 1
- 🤗tsqn/Z-Image-Turbo_fp8_comfyuimodel· 602 dl· ♡ 3602 dl♡ 3
- 🤗srcphag/Z-Image-Turbomodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
