AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising
Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu

TL;DR
AvatarForcing introduces a novel one-step streaming diffusion framework for real-time talking avatar generation, achieving low latency and stable long-form synthesis through innovative temporal forcing and distillation techniques.
Contribution
The paper proposes AvatarForcing, a new one-step streaming diffusion method with dual-anchor temporal forcing for stable, real-time talking avatar synthesis, overcoming limitations of previous autoregressive and diffusion models.
Findings
Achieves 34 ms/frame inference time with high visual quality.
Maintains lip synchronization and stability over long videos.
Outperforms existing methods on standard benchmarks and a new long-form dataset.
Abstract
Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
