StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Shuyuan Tu; Yueming Pan; Yinming Huang; Xintong Han; Zhen Xing; Qi Dai; Chong Luo; Zuxuan Wu; Yu-Gang Jiang

arXiv:2508.08248·cs.CV·August 12, 2025

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access 1 Models

TL;DR

StableAvatar is an innovative end-to-end video diffusion transformer that generates infinite-length, high-quality audio-driven avatar videos with improved synchronization and identity consistency, addressing limitations of previous models.

Contribution

It introduces a novel Time-step-aware Audio Adapter and Audio Native Guidance Mechanism to prevent error accumulation and enhance audio synchronization in long video generation.

Findings

01

Effective infinite-length video generation demonstrated on benchmarks.

02

Improved audio-visual synchronization and identity preservation.

03

Qualitative and quantitative validation of model performance.

Abstract

Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FrancisRing/StableAvatar
model· 72 dl· ♡ 79
72 dl♡ 79

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis