SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic   Injection with Large-Scale Pre-Training Diffusion Models

Qingrong Cheng; Xu Li; Xinghui Fu; Fei Xia; Zhongqian Sun

arXiv:2405.13336·cs.HC·September 24, 2024

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

Qingrong Cheng, Xu Li, Xinghui Fu, Fei Xia, Zhongqian Sun

PDF

Open Access

TL;DR

SIGGesture is a diffusion-based framework that synthesizes high-quality, semantically relevant 3D gestures from speech by combining large-scale pre-training and semantic injection, improving over previous methods in realism and generalization.

Contribution

The paper introduces a novel diffusion model with semantic injection and leverages large language models for semantic gesture synthesis, advancing the state-of-the-art in speech-driven gesture generation.

Findings

01

Outperforms existing baselines in gesture quality and relevance.

02

Demonstrates strong generalization to in-the-wild speech data.

03

Provides controllability over semantic gesture synthesis.

Abstract

The automated synthesis of high-quality 3D gestures from speech is of significant value in virtual humans and gaming. Previous methods focus on synthesizing gestures that are synchronized with speech rhythm, yet they frequently overlook the inclusion of semantic gestures. These are sparse and follow a long-tailed distribution across the gesture sequence, making them difficult to learn in an end-to-end manner. Moreover, generating gestures, rhythmically aligned with speech, faces a significant issue that cannot be generalized to in-the-wild speeches. To address these issues, we introduce SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are of both high quality and semantically pertinent. Specifically, we firstly build a strong diffusion-based foundation model for rhythmical gesture synthesis by pre-training it on a collected large-scale dataset with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Hearing Impairment and Communication