No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models
Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, Romann M. Weber

TL;DR
This paper introduces independent condition guidance (ICG) and time-step guidance (TSG), novel methods that enhance diffusion model performance without additional training or modifications, applicable to both conditional and unconditional models.
Contribution
The paper proposes ICG and TSG, new guidance techniques that eliminate the need for special training procedures and extend CFG to unconditional models, simplifying diffusion model training and inference.
Findings
ICG matches standard CFG performance across various models.
TSG improves generation quality without conditional information.
Guidance techniques are easy to implement with same sampling cost as CFG.
Abstract
Classifier-free guidance (CFG) has become the standard method for enhancing the quality of conditional diffusion models. However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pre-trained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called time-step guidance (TSG), which…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well-written and the authors perform extensive experiments and provide algorithmic and experimental details to reproduce their experiments. **Experiments** The experiments show that ICG and CFG perform identically and ICG “simulates the behavior of CFG across several conditional models” (line 249). They also run a number of ablations, for instance, they compare using an independent Gaussian vector versus sampling $y \sim p(y)$ in ICG, providing stronger experimental evidence.
The main question left un-answered by this work is whether a marginal model, either trained separately or with a null token, is necessary for a sampling method like CFG: 1. An analysis of the variance of the estimate $s_\theta(x_t \mid y)$ is missing from the paper. 2. An analysis of the impact of the variance on sampling is also missing from the paper 3. Is there a trade-off between the slower convergence of training with a null token with estimation of the marginal model? See the questions s
1. The paper is generally well written and easy to follow. 2. The introduction and related works sections provided a good justification for the limitations of CFG and the benefits of solutions like ICG and TSG. There is certainly a potential for this paper to provide a meaningful contribution to the field of diffusion-guidance, if my concerns about the theoretical claims and empirical validation are adequately addressed by the authors. 3. The experiments are for the most part appropriate and the
1. A central claim is that ICG provides the same benefits of CFG without the need for additional training requirements. This hinges on the validity of the theoretical claim in Section 4, which avoids the need for a potentially expensive marginalization over classes. However, the theoretical justification appears insufficient. It seems incorrect to state that sampling a class label $\hat{y}$ independently from $z_t$ at time-step $t$ implies the equivalence $\nabla_{z_t} \log p_t(\hat{y} \mid z_t)
1. The idea of ICG is novel and theoretically simple to implement. ICG provides a new perspective on CFG. 2. The presentation is good and the paper is easy to follow. 3. Extensive experiments show consistent effectiveness of the proposed two methods
1. My major concern lies in the proposed TSG: - the authors explain TSG from its connection to Langevin dynamics, but the connection is built up on Taylor approximation where $\tilde{t} = t + \Delta{t}$, and $\Delta{t}$ should be sufficiently small. But in practice, the authors use $\Delta{t}=st^\alpha n$ where $n \sim N(0, I), s=2, \alpha=1$ and $t$ is the time-step, therefore I do think $\Delta{T}$ is sufficiently small anymore. - I suspect the effectiveness of TSG comes from the error correct
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsDiffusion
