Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing
Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen

TL;DR
This paper reveals that diffusion transformer latent spaces are inherently semantically disentangled and proposes a zero-shot editing method that uses text prompts to control image attributes precisely without extra training.
Contribution
It uncovers the intrinsic semantic disentanglement of DiT latent spaces and introduces a zero-shot editing framework using text-guided directions without additional training.
Findings
DiT's latent space is semantically disentangled.
Effective zero-shot fine-grained image editing is achieved.
A new metric quantifies latent space disentanglement.
Abstract
Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Microfluidic and Catalytic Techniques Innovation · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
