Model See Model Do: Speech-Driven Facial Animation with Style Control

Yifang Pan; Karan Singh; Luiz Gustavo Hafemann

arXiv:2505.01319·cs.GR·July 16, 2025

Model See Model Do: Speech-Driven Facial Animation with Style Control

Yifang Pan, Karan Singh, Luiz Gustavo Hafemann

PDF

TL;DR

This paper introduces a style-conditioned diffusion model for speech-driven 3D facial animation that captures nuanced expressive styles while maintaining accurate lip synchronization.

Contribution

It proposes a novel style basis conditioning mechanism that effectively transfers subtle stylistic cues in facial animations from reference clips.

Findings

01

High-quality style transfer in facial animations

02

Superior lip synchronization across speech scenarios

03

Effective capture of subtle stylistic nuances

Abstract

Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLatent Diffusion Model · Diffusion · Contrastive Language-Image Pre-training · ALIGN