VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension
Zeyu Ling, Bo Han, Shiyang Li, Jikang Cheng, Hongdeng Shen, Changqing Zou

TL;DR
VersatileMotion is a comprehensive multimodal motion model that unifies multiple tasks, supports single and multi-agent motions, and enables cross-modal translation, achieving state-of-the-art results across diverse motion-related applications.
Contribution
It introduces a novel motion tokenizer combining VQ-VAE and flow matching, and a unified framework supporting nine motion tasks, including cross-modal translation and multi-agent motion understanding.
Findings
Achieves state-of-the-art performance on seven tasks.
Supports cross-modal translation between motion, text, music, and speech.
Handles both single-agent and multi-agent motions in a unified framework.
Abstract
Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion-related tasks. VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework and enable cross-modal conversion between motion, text,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Spatial Cognition and Navigation
