LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang,, Shenghua Gao

TL;DR
LivelySpeaker is a novel framework for generating semantically-aware co-speech gestures by decoupling script-based gesture creation from audio-guided refinement, enabling realistic, controllable, and contextually aligned gestures.
Contribution
It introduces a two-stage gesture generation method using CLIP embeddings and diffusion models, achieving semantic alignment and style control in co-speech gesture synthesis.
Findings
Outperforms existing methods in semantic alignment and realism
Achieves state-of-the-art results on two benchmark datasets
Enables flexible gesture editing and style control
Abstract
Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Motion and Animation · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
