ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Xuangeng Chu; Nabarun Goswami; Ziteng Cui; Hanqin Wang; Tatsuya Harada

arXiv:2502.20323·cs.CV·September 5, 2025

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada

PDF

Open Access 1 Models

TL;DR

This paper presents ARTalk, an autoregressive model for real-time, speech-driven 3D head animation that produces synchronized lip movements, head poses, and eye blinks, adaptable to unseen speaking styles.

Contribution

Introduces a novel autoregressive approach for real-time 3D head animation from speech, capable of adapting to unseen styles and outperforming existing methods.

Findings

01

Achieves real-time generation of synchronized facial motions.

02

Outperforms existing methods in lip synchronization accuracy.

03

Demonstrates adaptability to unseen speaking styles.

Abstract

Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xg-chu/ARTalk
model· 57 dl· ♡ 1
57 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings