M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis
Zhizhuo Yin, Yuk Hang Tsui, Pan Hui

TL;DR
This paper introduces M3G, a multi-granular framework for generating natural full-body human gestures from audio by modeling gesture patterns at different temporal granularities.
Contribution
The paper proposes a novel multi-granular tokenization and prediction framework for audio-driven gesture synthesis, addressing fixed granularity limitations of prior methods.
Findings
Outperforms state-of-the-art in naturalness and expressiveness
Effective multi-granular tokenization of motion patterns
Improved gesture diversity and realism
Abstract
Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVQ-VAE
