DICE: Distilling Classifier-Free Guidance into Text Embeddings
Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, Siwei Lyu

TL;DR
DICE introduces a method to replace classifier-free guidance in text-to-image diffusion models with refined text embeddings, reducing computational costs while maintaining high-quality, well-aligned image generation.
Contribution
DICE distills CFG into a CFG-free model by sharpening text embeddings, enabling faster sampling without sacrificing image quality.
Findings
Maintains image quality comparable to CFG-based models
Reduces computational complexity by half
Effective across multiple diffusion model variants
Abstract
Text-to-image diffusion models are capable of generating high-quality images, but suboptimal pre-trained text representations often result in these images failing to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, CFG introduces significant computational overhead. In this paper, we present DIstilling CFG by sharpening text Embeddings (DICE) that replaces CFG in the sampling process with half the computational complexity while maintaining similar generation quality. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsALIGN · Diffusion
