MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

Qi Xie; Yongjia Ma; Donglin Di; Xuehao Gao; Xun Yang

arXiv:2508.03034·cs.CV·August 14, 2025

MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, Xun Yang

PDF

TL;DR

MoCA introduces a novel diffusion transformer-based framework with a mixture of cross-attention mechanisms to improve identity preservation and temporal coherence in text-to-video generation, achieving superior results on a new diverse dataset.

Contribution

The paper presents MoCA, a new diffusion model with a mixture of cross-attention modules and a hierarchical temporal pooling strategy for improved ID-preserving T2V generation.

Findings

01

Outperforms existing T2V methods by over 5% in face similarity.

02

Effectively captures fine-grained facial dynamics and temporal identity coherence.

03

Demonstrates strong generalization across diverse identities on the CelebIPVid dataset.

Abstract

Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.