TL;DR
This paper introduces a novel co-speech gesture synthesis method that explicitly models rhythm and semantics using hierarchical neural embeddings, resulting in more realistic and coherent gestures aligned with speech.
Contribution
It proposes a hierarchical neural embedding framework and a rhythm-based segmentation pipeline to improve the realism and coherence of synthesized co-speech gestures.
Findings
Outperforms state-of-the-art systems on objective metrics
Achieves better rhythm and semantic alignment in gestures
Receives positive human feedback on gesture realism
Abstract
Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
