Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Joonghyuk Shin; Alchan Hwang; Yujin Kim; Daneul Kim; Jaesik Park

arXiv:2508.07519·cs.CV·August 12, 2025

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, Jaesik Park

PDF

Open Access

TL;DR

This paper analyzes multimodal diffusion transformers (MM-DiT) used in state-of-the-art image editing models, revealing their attention mechanisms and proposing a new prompt-based editing method that supports diverse edits across different MM-DiT variants.

Contribution

It provides a systematic analysis of MM-DiT's attention matrices and introduces a robust prompt-based editing technique adaptable to various MM-DiT architectures.

Findings

01

Decomposition of attention matrices into four blocks reveals their characteristics.

02

Proposed editing method enables global to local edits across MM-DiT variants.

03

Insights bridge U-Net-based and transformer-based diffusion models.

Abstract

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Cell Image Analysis Techniques