Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model
Fan Zhang, Naye Ji, Fuxing Gao, Siyuan Zhao, Zhaohan Wang, Shunman Li

TL;DR
This paper presents 'diffmotion-v2', a novel speech-driven gesture synthesis model using WavLM, capable of generating natural, stylized co-speech gestures directly from raw speech audio without manual annotations.
Contribution
It introduces a diffusion-based, transformer model leveraging WavLM for extracting rich audio features, enabling stylized gesture generation solely from speech audio, simplifying previous multimodal approaches.
Findings
Successfully synthesizes natural co-speech gestures with various styles.
Outperforms existing methods in subjective evaluations.
Effectively captures personality and emotion traits from speech.
Abstract
The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Motion and Animation · Music and Audio Processing
