LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song; Hanhui Li; Jiehui Huang; Panwen Hu; Yuhao Cheng; Long Chen; Yiqiang Yan; Xiaodan Liang

arXiv:2508.07603·cs.CV·August 12, 2025

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

PDF

Open Access

TL;DR

LaVieID introduces a local autoregressive diffusion framework for identity-preserving text-to-video generation, improving facial detail retention and temporal consistency in personalized videos.

Contribution

It proposes a novel local autoregressive diffusion model with a local router and temporal autoregression to enhance identity preservation in video synthesis.

Findings

01

Achieves state-of-the-art identity preservation in text-to-video tasks.

02

Produces high-fidelity, personalized videos with improved temporal consistency.

03

Outperforms existing methods in qualitative and quantitative evaluations.

Abstract

In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Emotion and Mood Recognition