StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang; Zhi Cen; Sida Peng; Xiangwei Chen; Yifu Deng; Xinyu Zhu; Fan Jia; Xiaowei Zhou; Hujun Bao

arXiv:2511.14223·cs.CV·May 19, 2026

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

PDF

1 Repo 1 Video

TL;DR

StreamingTalker introduces an autoregressive diffusion model for real-time, audio-driven 3D facial animation that handles varying audio lengths with low latency, outperforming previous methods in naturalness and efficiency.

Contribution

The paper presents a novel streaming autoregressive diffusion approach for speech-driven 3D facial animation, enabling real-time synthesis with flexible audio length handling.

Findings

01

Achieves high-quality, real-time facial animation from speech inputs.

02

Handles varying audio lengths with low latency.

03

Demonstrates effectiveness through a real-time interactive demo.

Abstract

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://zju3dv.github.io/StreamingTalker
github

Videos

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model· underline

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing