Ctrl-X: Controlling Structure and Appearance for Text-To-Image   Generation Without Guidance

Kuan Heng Lin; Sicheng Mo; Ben Klingher; Fangzhou Mu; Bolei Zhou

arXiv:2406.07540·cs.CV·December 12, 2024·1 cites

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

PDF

Open Access 1 Video

TL;DR

Ctrl-X introduces a fast, flexible, and training-free framework for controlling structure and appearance in text-to-image generation, supporting arbitrary condition images and outperforming existing methods in quality and versatility.

Contribution

It proposes a novel, plug-and-play approach for structure and appearance control in T2I diffusion models without additional training or guidance.

Findings

01

Supports arbitrary modality condition images for structure and appearance control

02

Achieves superior image quality and transfer performance compared to existing methods

03

Provides instant, flexible control for T2I and T2V models

Abstract

Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance· slideslive

Taxonomy

TopicsAugmented Reality Applications · Human Motion and Animation

MethodsDiffusion