LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation

Yihao Zhi; Xiaodong Cun; Xuelin Chen; Xi Shen; Wen Guo; Shaoli Huang,; Shenghua Gao

arXiv:2309.09294·cs.CV·September 19, 2023·1 cites

LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation

Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang,, Shenghua Gao

PDF

Open Access 1 Repo

TL;DR

LivelySpeaker is a novel framework for generating semantically-aware co-speech gestures by decoupling script-based gesture creation from audio-guided refinement, enabling realistic, controllable, and contextually aligned gestures.

Contribution

It introduces a two-stage gesture generation method using CLIP embeddings and diffusion models, achieving semantic alignment and style control in co-speech gesture synthesis.

Findings

01

Outperforms existing methods in semantic alignment and realism

02

Achieves state-of-the-art results on two benchmark datasets

03

Enables flexible gesture editing and style control

Abstract

Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zyhbili/livelyspeaker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Motion and Animation · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training