Model See Model Do: Speech-Driven Facial Animation with Style Control
Yifang Pan, Karan Singh, Luiz Gustavo Hafemann

TL;DR
This paper introduces a style-conditioned diffusion model for speech-driven 3D facial animation that captures nuanced expressive styles while maintaining accurate lip synchronization.
Contribution
It proposes a novel style basis conditioning mechanism that effectively transfers subtle stylistic cues in facial animations from reference clips.
Findings
High-quality style transfer in facial animations
Superior lip synchronization across speech scenarios
Effective capture of subtle stylistic nuances
Abstract
Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLatent Diffusion Model · Diffusion · Contrastive Language-Image Pre-training · ALIGN
