AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Liyuan Cui; Wentao Hu; Wenyuan Zhang; Zesong Yang; Fan Shi; Xiaoqiang Liu

arXiv:2603.14331·cs.CV·March 19, 2026

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu

PDF

Open Access 1 Models

TL;DR

AvatarForcing introduces a novel one-step streaming diffusion framework for real-time talking avatar generation, achieving low latency and stable long-form synthesis through innovative temporal forcing and distillation techniques.

Contribution

The paper proposes AvatarForcing, a new one-step streaming diffusion method with dual-anchor temporal forcing for stable, real-time talking avatar synthesis, overcoming limitations of previous autoregressive and diffusion models.

Findings

01

Achieves 34 ms/frame inference time with high visual quality.

02

Maintains lip synchronization and stability over long videos.

03

Outperforms existing methods on standard benchmarks and a new long-form dataset.

Abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lycui/AvatarForcing
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis