MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Bharath Krishnamurthy; Ajita Rattani

arXiv:2603.29029·cs.CV·April 1, 2026

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Bharath Krishnamurthy, Ajita Rattani

PDF

2 Repos 1 Models 1 Datasets

TL;DR

MMFace-DiT introduces a dual-stream diffusion transformer that synergistically fuses spatial priors and text to achieve high-fidelity, controllable multimodal face generation with unprecedented spatial-semantic consistency.

Contribution

It proposes a novel dual-stream transformer with shared RoPE attention and a Modality Embedder for dynamic, end-to-end multimodal face synthesis.

Findings

01

Achieves 40% improvement in visual fidelity and prompt alignment.

02

Effectively fuses semantic and spatial modalities without architectural constraints.

03

Outperforms six state-of-the-art models in multimodal face generation.

Abstract

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
BharathK333/MMFace-DiT-Models
model

Datasets

BharathK333/MMFace-DiT-Datasets
dataset· 631 dl
631 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.