HumanTOMATO: Text-aligned Whole-body Motion Generation

Shunlin Lu; Ling-Hao Chen; Ailing Zeng; Jing Lin; Ruimao Zhang; Lei; Zhang; Heung-Yeung Shum

arXiv:2310.12978·cs.CV·October 20, 2023·2 cites

HumanTOMATO: Text-aligned Whole-body Motion Generation

Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei, Zhang, Heung-Yeung Shum

PDF

Open Access 1 Repo 3 Reviews

TL;DR

HumanTOMATO introduces a novel framework for text-driven whole-body motion generation, effectively capturing facial expressions, hand gestures, and body motions with improved text alignment and diversity.

Contribution

This work is the first to integrate holistic motion generation with fine-grained control and explicit text-motion alignment using hierarchical VQ-VAE and GPT models.

Findings

01

Outperforms existing methods in motion quality

02

Achieves better text-motion alignment

03

Generates diverse and coherent motions

Abstract

This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H $^{2}$ VQ) and a Hierarchical-GPT for fine-grained body and hand motion…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The authors propose a promising task and give a thoughtful solution. From the results, we can easily judge the effectiveness of the proposed method. The writing is also very fluent.

Weaknesses

1. The title '3.1.2 Evaluation' somehow is easily misleading. In this section, you introduced the evaluation metrics and compared methods. How about changing to 'Evaluation Details'? 2. I feel that the focus of this article is on how to generate physical movements in the hands. The discussion about facial cVAE is limited, and I have not found any experiments to analyze this module. 3. Although it's hard to find compared methods in text-aligned whole-body motion generation, the authors can comp

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The authors pioneered the task of whole-body motion (including the face and hand motion) generation from speech. To generate fine-grained hand and face motions, two core technique designs were introduced: 1) a holistic hierarchical VQ-VAE based on RVQ for body and hand motion reconstruction; and 2) a hierarchical GPT-based generation. To achieve a good alignment between text and motion, a text-motion retrieval model is pre-trained and used as a prior for the text-motion generation stage explicit

Weaknesses

1. The technical contribution of this paper seems somewhat limited in the following: (1) the pipeline of the method is similar to the T2M-GPT where a VQ-VAE is used for motion reconstruction and a transformer-based GPT model is used for motion generation, while the tasks are different; (2) the pretraining of a motion encoder and a text encoder via aligning text and motion in a contrastive way is also not new such as TMR, and further using it to replace the clip is natural in the prediction stag

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper is the first to target the text-driven whole-body motions generation task, aiming at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions. 2. The paper introduces a novel framework for text-driven whole-body motion generation, featuring a Holistic Hierarchical Vector Quantization for learning informative and compact representations at low bit rates, along with a Hierarchical-GPT for predicting hierarchical discrete codes for body and ha

Weaknesses

In my view, the proposed H2VQ and Hierarchical-GPT just extend the model introduced in T2M-GPT by incorporating hand gesture modeling. These modifications are rather straightforward. Firstly, in the context of vector quantization, they integrate the hand pose vector quantization with the body pose using a hierarchical strategy rather than directly quantizing the whole body pose. Secondly, The T2M-GPT has been modified to decode the body pose code and hand pose code alternately, rather than direc

Code & Models

Repositories

IDEA-Research/HumanTOMATO
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsALIGN · VQ-VAE