Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation
Sai Shashank Kalakonda, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla

TL;DR
Action-GPT leverages large language models to generate detailed action descriptions, improving text-to-motion alignment and synthesis quality in motion generation models, with zero-shot capabilities and multi-description utilization.
Contribution
The paper introduces a versatile framework that enhances text-based action generation by integrating LLMs for richer descriptions, applicable to various models and enabling zero-shot motion synthesis.
Findings
Improved qualitative and quantitative motion synthesis quality.
Effective use of multiple LLM-generated descriptions.
Demonstrated zero-shot generation capabilities.
Abstract
We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization
