Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hao Li; Ju Dai; Xin Zhao; Feng Zhou; Junjun Pan; Lei Li

arXiv:2505.23290·cs.SD·May 30, 2025

Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, Lei Li

PDF

Open Access 1 Repo

TL;DR

Wav2Sem introduces a semantic decorrelation module that improves 3D speech-driven facial animation by decoupling audio features, reducing averaging effects of similar syllables, and enhancing animation naturalness.

Contribution

The paper presents a novel plug-and-play semantic decorrelation module, Wav2Sem, that effectively decouples audio features to improve lip motion generation in facial animation.

Findings

01

Significantly reduces averaging of similar syllables in lip shapes.

02

Enhances the naturalness and accuracy of facial animations.

03

Effective across multiple speech-driven models.

Abstract

In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wslh852/wav2sem
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing