MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention
Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, Xun Yang

TL;DR
MoCA introduces a novel diffusion transformer-based framework with a mixture of cross-attention mechanisms to improve identity preservation and temporal coherence in text-to-video generation, achieving superior results on a new diverse dataset.
Contribution
The paper presents MoCA, a new diffusion model with a mixture of cross-attention modules and a hierarchical temporal pooling strategy for improved ID-preserving T2V generation.
Findings
Outperforms existing T2V methods by over 5% in face similarity.
Effectively captures fine-grained facial dynamics and temporal identity coherence.
Demonstrates strong generalization across diverse identities on the CelebIPVid dataset.
Abstract
Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
