Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

TL;DR
Ctrl-X introduces a fast, flexible, and training-free framework for controlling structure and appearance in text-to-image generation, supporting arbitrary condition images and outperforming existing methods in quality and versatility.
Contribution
It proposes a novel, plug-and-play approach for structure and appearance control in T2I diffusion models without additional training or guidance.
Findings
Supports arbitrary modality condition images for structure and appearance control
Achieves superior image quality and transfer performance compared to existing methods
Provides instant, flexible control for T2I and T2V models
Abstract
Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAugmented Reality Applications · Human Motion and Animation
MethodsDiffusion
