Latent Space Disentanglement in Diffusion Transformers Enables Precise   Zero-shot Semantic Editing

Zitao Shuai; Chenwei Wu; Zhengxu Tang; Bowen Song; Liyue Shen

arXiv:2411.08196·cs.CV·November 14, 2024

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen

PDF

Open Access

TL;DR

This paper reveals that diffusion transformer latent spaces are inherently semantically disentangled and proposes a zero-shot editing method that uses text prompts to control image attributes precisely without extra training.

Contribution

It uncovers the intrinsic semantic disentanglement of DiT latent spaces and introduces a zero-shot editing framework using text-guided directions without additional training.

Findings

01

DiT's latent space is semantically disentangled.

02

Effective zero-shot fine-grained image editing is achieved.

03

A new metric quantifies latent space disentanglement.

Abstract

Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Microfluidic and Catalytic Techniques Innovation · Domain Adaptation and Few-Shot Learning

MethodsDiffusion