Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv; Tianlin Pan; Chenyang Si; Zhaoxi Chen; Wangmeng Zuo; Ziwei Liu; Kwan-Yee K. Wong

arXiv:2506.07986·cs.CV·July 24, 2025

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces TACA, a novel attention mechanism that dynamically balances cross-modal interactions in diffusion transformers, significantly improving text-image alignment with minimal extra computation.

Contribution

The paper proposes Temperature-Adjusted Cross-modal Attention (TACA), a new method that enhances multimodal attention balance in diffusion models, leading to better semantic alignment.

Findings

01

TACA improves text-image alignment on T2I-CompBench.

02

TACA enhances object, attribute, and spatial relationship fidelity.

03

TACA requires minimal additional computational resources.

Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vchitect/taca
pytorchOfficial

Models

🤗
ldiex/TACA
model· 94 dl· ♡ 6
94 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Diffusion