StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

TL;DR
StableAvatar is an innovative end-to-end video diffusion transformer that generates infinite-length, high-quality audio-driven avatar videos with improved synchronization and identity consistency, addressing limitations of previous models.
Contribution
It introduces a novel Time-step-aware Audio Adapter and Audio Native Guidance Mechanism to prevent error accumulation and enhance audio synchronization in long video generation.
Findings
Effective infinite-length video generation demonstrated on benchmarks.
Improved audio-visual synchronization and identity preservation.
Qualitative and quantitative validation of model performance.
Abstract
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
