LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

TL;DR
LaVieID introduces a local autoregressive diffusion framework for identity-preserving text-to-video generation, improving facial detail retention and temporal consistency in personalized videos.
Contribution
It proposes a novel local autoregressive diffusion model with a local router and temporal autoregression to enhance identity preservation in video synthesis.
Findings
Achieves state-of-the-art identity preservation in text-to-video tasks.
Produces high-fidelity, personalized videos with improved temporal consistency.
Outperforms existing methods in qualitative and quantitative evaluations.
Abstract
In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Emotion and Mood Recognition
