ECNet: Effective Controllable Text-to-Image Diffusion Models
Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie,, Kazunori Miyata, Hongsheng Li

TL;DR
ECNet introduces innovative guidance and supervision techniques to significantly improve the controllability and robustness of text-to-image diffusion models, enabling more precise and reliable image generation from complex conditions.
Contribution
The paper presents Spatial Guidance Injector and Diffusion Consistency Loss, novel methods that enhance control accuracy and supervision in diffusion-based text-to-image models.
Findings
Enhanced controllability over various conditions
Outperforms existing state-of-the-art models
Improved robustness and precision in image generation
Abstract
The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Music and Audio Processing
MethodsDiffusion
