Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
Zhiyu Jin, Xuli Shen, Bin Li, Xiangyang Xue

TL;DR
This paper introduces a training-free method to adapt diffusion models for variable-sized text-to-image synthesis, improving image quality and text alignment across different resolutions without additional training.
Contribution
It proposes a novel scaling factor that adjusts attention entropy to handle various image sizes, eliminating the need for retraining or fine-tuning.
Findings
Enhanced image quality across different resolutions
Improved text-image alignment without extra training
Validated effectiveness through extensive experiments
Abstract
Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
