M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion   Comprehension and Generation

Mingshuang Luo; Ruibing Hou; Zhuo Li; Hong Chang; Zimo Liu; Yaowei; Wang; Shiguang Shan

arXiv:2405.16273·cs.CV·November 5, 2024

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei, Wang, Shiguang Shan

PDF

Open Access 1 Repo 1 Video

TL;DR

M$^3$GPT is a novel multimodal, multitask framework that enables comprehensive motion understanding and generation by integrating multiple signals and modeling their interactions within a unified system.

Contribution

It introduces a unified multimodal representation space, direct raw motion modeling, and task synergy learning, pioneering motion comprehension and generation from diverse signals.

Findings

01

Superior performance on motion tasks

02

Effective zero-shot generalization

03

Seamless multimodal integration

Abstract

This paper presents M $^{3}$ GPT, an advanced $M$ ultimodal, $M$ ultitask framework for $M$ otion comprehension and generation. M $^{3}$ GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M $^{3}$ GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luomingshuang/m3gpt
pytorchOfficial

Videos

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications