AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond
Zixiang Zhou, Yu Wan, Baoyuan Wang

TL;DR
AvatarGPT is a comprehensive framework that unifies multiple human motion understanding and generation tasks using a shared large language model, enabling seamless task integration and long-motion synthesis.
Contribution
It introduces an all-in-one LLM-based framework for human motion tasks, including encoding motion as tokens and enabling iterative long-motion synthesis.
Findings
Achieves state-of-the-art on low-level motion tasks
Demonstrates promising results on high-level tasks
Enables unlimited long-motion synthesis through task traversal
Abstract
Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. Inspired by InstuctGPT, and the generalist concept behind Gato, we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface, constituting a closed-loop within the framework. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vocabulary of LLM. Then, an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation
