A conversational gesture synthesis system based on emotions and semantics
Thanh Hoang-Minh

TL;DR
DeepGesture is a diffusion-based framework that generates expressive, emotionally-aware co-speech gestures from multimodal inputs, improving realism and contextuality for digital humans.
Contribution
It introduces novel architectural enhancements to DiffuseStyleGesture, enabling semantic alignment, emotion control, and generalization to synthetic speech in gesture synthesis.
Findings
Improved human-likeness and contextual appropriateness of gestures
Supports interpolation between emotional states
Generalizes to out-of-distribution speech, including synthetic voices
Abstract
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · Diffusion · Attentive Walk-Aggregating Graph Neural Network
