HumanTOMATO: Text-aligned Whole-body Motion Generation
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei, Zhang, Heung-Yeung Shum

TL;DR
HumanTOMATO introduces a novel framework for text-driven whole-body motion generation, effectively capturing facial expressions, hand gestures, and body motions with improved text alignment and diversity.
Contribution
This work is the first to integrate holistic motion generation with fine-grained control and explicit text-motion alignment using hierarchical VQ-VAE and GPT models.
Findings
Outperforms existing methods in motion quality
Achieves better text-motion alignment
Generates diverse and coherent motions
Abstract
This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka HVQ) and a Hierarchical-GPT for fine-grained body and hand motion…
Peer Reviews
Decision·ICML 2024 Poster
The authors propose a promising task and give a thoughtful solution. From the results, we can easily judge the effectiveness of the proposed method. The writing is also very fluent.
1. The title '3.1.2 Evaluation' somehow is easily misleading. In this section, you introduced the evaluation metrics and compared methods. How about changing to 'Evaluation Details'? 2. I feel that the focus of this article is on how to generate physical movements in the hands. The discussion about facial cVAE is limited, and I have not found any experiments to analyze this module. 3. Although it's hard to find compared methods in text-aligned whole-body motion generation, the authors can comp
The authors pioneered the task of whole-body motion (including the face and hand motion) generation from speech. To generate fine-grained hand and face motions, two core technique designs were introduced: 1) a holistic hierarchical VQ-VAE based on RVQ for body and hand motion reconstruction; and 2) a hierarchical GPT-based generation. To achieve a good alignment between text and motion, a text-motion retrieval model is pre-trained and used as a prior for the text-motion generation stage explicit
1. The technical contribution of this paper seems somewhat limited in the following: (1) the pipeline of the method is similar to the T2M-GPT where a VQ-VAE is used for motion reconstruction and a transformer-based GPT model is used for motion generation, while the tasks are different; (2) the pretraining of a motion encoder and a text encoder via aligning text and motion in a contrastive way is also not new such as TMR, and further using it to replace the clip is natural in the prediction stag
1. The paper is the first to target the text-driven whole-body motions generation task, aiming at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions. 2. The paper introduces a novel framework for text-driven whole-body motion generation, featuring a Holistic Hierarchical Vector Quantization for learning informative and compact representations at low bit rates, along with a Hierarchical-GPT for predicting hierarchical discrete codes for body and ha
In my view, the proposed H2VQ and Hierarchical-GPT just extend the model introduced in T2M-GPT by incorporating hand gesture modeling. These modifications are rather straightforward. Firstly, in the context of vector quantization, they integrate the hand pose vector quantization with the body pose using a hierarchical strategy rather than directly quantizing the whole body pose. Secondly, The T2M-GPT has been modified to decode the body pose code and hand pose code alternately, rather than direc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsALIGN · VQ-VAE
