SAiD: Speech-driven Blendshape Facial Animation with Diffusion

Inkyu Park; Jaewoong Cho

arXiv:2401.08655·cs.CV·January 26, 2024·1 cites

SAiD: Speech-driven Blendshape Facial Animation with Diffusion

Inkyu Park, Jaewoong Cho

PDF

Open Access 1 Repo

TL;DR

This paper introduces SAiD, a diffusion-based Transformer model for speech-driven 3D facial animation, and presents BlendVOCA, a new dataset to improve lip synchronization and diversity in facial animations.

Contribution

The paper proposes a novel diffusion model with cross-modality alignment for improved lip sync and diversity, along with a new dataset, BlendVOCA, for training and benchmarking.

Findings

01

Achieves comparable or better lip synchronization than baselines.

02

Ensures more diverse lip movements.

03

Streamlines animation editing process.

Abstract

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunik1004/said
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Diffusion · Convolution · U-Net