Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao; Dianxi Shi; Xing Fu; Xuechao Zou; Haikuo Peng; Xueqi Li; Chun Yu; Junliang Xing

arXiv:2511.12631·cs.CV·January 8, 2026

Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng, Xueqi Li, Chun Yu, Junliang Xing

PDF

Open Access

TL;DR

This paper introduces MDiTFace, a diffusion transformer framework that enhances multimodal facial generation by improving feature interaction and reducing computational costs through a novel decoupled attention mechanism.

Contribution

The paper proposes a unified tokenization strategy and a decoupled attention mechanism for better cross-modal interaction and efficiency in facial generation.

Findings

01

Outperforms existing methods in facial fidelity

02

Reduces computational overhead by over 94%

03

Maintains high conditional consistency

Abstract

While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications