Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Yingruo Fan; Zhaojiang Lin; Jun Saito; Wenping Wang; Taku Komura

arXiv:2112.02214·cs.CV·December 8, 2021

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

PDF

Open Access

TL;DR

This paper introduces a joint audio-text model that leverages high-level contextual text embeddings to improve expressive speech-driven 3D facial animation, capturing diverse facial motions and expressions more realistically.

Contribution

The novel integration of pre-trained language model embeddings with audio features enhances the synthesis of expressive facial animations beyond phoneme-level approaches.

Findings

01

Outperforms existing state-of-the-art methods in realism and synchronization.

02

Effectively captures diverse upper face expressions.

03

Demonstrates superior results in quantitative, qualitative, and perceptual evaluations.

Abstract

Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Facial Nerve Paralysis Treatment and Research