MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Bharath Krishnamurthy, Ajita Rattani

TL;DR
MMFace-DiT introduces a dual-stream diffusion transformer that synergistically fuses spatial priors and text to achieve high-fidelity, controllable multimodal face generation with unprecedented spatial-semantic consistency.
Contribution
It proposes a novel dual-stream transformer with shared RoPE attention and a Modality Embedder for dynamic, end-to-end multimodal face synthesis.
Findings
Achieves 40% improvement in visual fidelity and prompt alignment.
Effectively fuses semantic and spatial modalities without architectural constraints.
Outperforms six state-of-the-art models in multimodal face generation.
Abstract
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
