CoMA: Compositional Human Motion Generation with Multi-modal Agents

Shanlin Sun; Gabriel De Araujo; Jiaqi Xu; Shenghan Zhou; Hanwen Zhang,; Ziheng Huang; Chenyu You; Xiaohui Xie

arXiv:2412.07320·cs.CV·January 9, 2025

CoMA: Compositional Human Motion Generation with Multi-modal Agents

Shanlin Sun, Gabriel De Araujo, Jiaqi Xu, Shenghan Zhou, Hanwen Zhang,, Ziheng Huang, Chenyu You, Xiaohui Xie

PDF

Open Access 1 Video

TL;DR

CoMA introduces an agent-based framework utilizing large language and vision models, along with a mask transformer generator, to enhance complex human motion generation, editing, and understanding, especially for detailed and unseen motions.

Contribution

The paper presents a novel multi-agent system with body-part-specific encoders and codebooks, enabling detailed, controllable, and high-quality human motion synthesis and editing.

Findings

01

Competitive performance on HumanML3D dataset

02

Significant improvement in long and detailed motion generation

03

User studies favor CoMA over existing methods

Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoMA: Compositional Human Motion Generation with Multi-modal Agents· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Hand Gesture Recognition Systems

MethodsSparse Evolutionary Training