SAiD: Speech-driven Blendshape Facial Animation with Diffusion
Inkyu Park, Jaewoong Cho

TL;DR
This paper introduces SAiD, a diffusion-based Transformer model for speech-driven 3D facial animation, and presents BlendVOCA, a new dataset to improve lip synchronization and diversity in facial animations.
Contribution
The paper proposes a novel diffusion model with cross-modality alignment for improved lip sync and diversity, along with a new dataset, BlendVOCA, for training and benchmarking.
Findings
Achieves comparable or better lip synchronization than baselines.
Ensures more diverse lip movements.
Streamlines animation editing process.
Abstract
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Diffusion · Convolution · U-Net
