TL;DR
StreamingTalker introduces an autoregressive diffusion model for real-time, audio-driven 3D facial animation that handles varying audio lengths with low latency, outperforming previous methods in naturalness and efficiency.
Contribution
The paper presents a novel streaming autoregressive diffusion approach for speech-driven 3D facial animation, enabling real-time synthesis with flexible audio length handling.
Findings
Achieves high-quality, real-time facial animation from speech inputs.
Handles varying audio lengths with low latency.
Demonstrates effectiveness through a real-time interactive demo.
Abstract
This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
