JoyStreamer-Flash: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

TL;DR
JoyStreamer-Flash is a real-time, autoregressive audio-driven avatar generation model that produces infinite-length videos with improved stability and coherence, overcoming previous limitations of high computational costs and short video durations.
Contribution
The paper introduces three novel techniques—Progressive Step Bootstrapping, Motion Condition Injection, and Unbounded RoPE via Cache-Resetting—that enhance real-time, long-duration avatar synthesis.
Findings
Achieves 16 FPS inference on a single GPU.
Maintains high visual quality and temporal consistency.
Supports infinite-length video generation.
Abstract
Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyStreamer-Flash, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
