Shifted Diffusion for Text-to-image Generation
Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou Chen, Jinhui, Xu

TL;DR
Corgi introduces a shifted diffusion model that improves text-to-image generation by better integrating CLIP knowledge, enabling efficient semi-supervised training and outperforming existing models like DALL-E 2 and Lafite.
Contribution
The paper proposes a novel shifted diffusion approach that incorporates CLIP knowledge into diffusion models, enhancing text-to-image generation and enabling language-free training.
Findings
Outperforms DALL-E 2 in efficiency and effectiveness.
Achieves state-of-the-art results in language-free generation.
Enables semi-supervised training with minimal captioned data.
Abstract
We present Corgi, a novel method for text-to-image generation. Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Unlike the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion. Compared to the strong DALL-E 2 baseline, our method performs better in generating image embedding from the text in terms of both efficiency and effectiveness, resulting in better text-to-image generation. Extensive large-scale experiments are conducted and evaluated in terms of both quantitative measures and human evaluation, indicating a stronger generation ability of our method compared to existing ones. Furthermore, our model enables semi-supervised and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsNone · Contrastive Language-Image Pre-training · Diffusion
