TL;DR
This paper provides a theoretical analysis of classifier-free guidance in masked diffusion models, revealing how guidance timing affects sample quality and proposing a simple, effective modification to improve generation results.
Contribution
It offers a low-dimensional theoretical insight into CFG effects, identifies issues in current implementations, and introduces a straightforward guidance mechanism that enhances sample quality.
Findings
Early guidance harms generation quality when inputs are heavily masked.
Late-stage guidance improves sample quality.
A simple code change can significantly enhance guidance effectiveness.
Abstract
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance improves it. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly…
Peer Reviews
Decision·ICLR 2026 Poster
The paper presents a principled low-dimensional analysis of CFG for masked discrete diffusion, showing that strong early guidance is harmful while late guidance is beneficial. Building on this insight, it introduces an elegant, theory-grounded tweak—a one-line column/softmax normalization—that corrects imbalanced early unmasking. Importantly, by linking this tractable normalization to improved robustness and better FID/ImageReward/MATH-500 results, the paper offers a practical change likely to b
1. The paper develops analysis and closed-form expressions only in 1–2 dimensions for masked discrete diffusion. As a result, guarantees for realistic high-dimensional CTMCs remain implicit, leaving the theoretical treatment somewhat loose. 2. Some results isolate the mechanism using a simple sampler without remasking and with fixed step counts (e.g., 50 steps on ImageNet-256), which may limit generality. Moreover, sampling schedules and samplers are crucial to implementing the guidance mechani
1. CFG is a well researched topic in continuous diffusion. But in masked/discrete diffusion is under active exploration. Clarifying scheduling effects is valuable for both image inpainting/masked modeling and text infilling models 2. Section 3.4 provides exhaustive analysis across factors such as time parameters and guidance strength. 3. The experiment in Section 2.3 is intuitive, improving explainability. 4. The method can be performed in inferencing time.
1. **Novelty:** The importance of guidance scheduling and rescaling by conditional/unconditional norms has been reported by Kynkäänniemi et al. (2024). Can you clarify how your approach differs? 2. **Metrics/Benchmarks:** For text-to-image evaluation, you only report ImageReward. Could you also evaluate with **HPSv2** to check for aesthetic trade-offs? Additionally, please test on T2I benchmarks like **GenEval** and **T2I-CompBench**. [1] Kynkäänniemi, T., Aittala, M., Karras, T., Laine, S., Ai
- Since guidance in discrete diffusion models is underexplored, this paper bridges an important gap between CFG in continuous diffusion models and discrete diffusion models. The results have potential implications for all systems that rely on discrete diffusion. - The proposed method is simple and can be easily integrated into existing sampling pipelines. - The theoretical results provide intuition and reasoning behind the method, although their presentation could be significantly improved. -
- In my opinion, the main weakness of the paper is its presentation. Several parameters are either misused in notation or not defined prior to their introduction in the text. This makes the paper difficult to follow and obscures the intuition and analysis behind the proposed method. - Section 3.4 appears to contradict the main message of the paper. It states that “effective schedules have higher guidance in the beginning and middle phases of generation,” whereas the best performance is reported
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
