TL;DR
PlanMoGPT introduces a hierarchical planning and flow-enhanced tokenization approach for text-to-motion synthesis, significantly improving global semantic alignment and motion detail preservation, leading to state-of-the-art results.
Contribution
It proposes a novel LLM-based framework with progressive planning and flow-enhanced tokenization to address granularity issues in text-to-motion generation.
Findings
Achieves 63.8% improvement in FID scores on long sequences.
Enhances motion diversity by 49.9%.
Sets new benchmarks for text-to-motion synthesis.
Abstract
Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full…
Peer Reviews
Decision·Submitted to ICLR 2026
- The writing and structure of the paper are clear and easy to follow. - The authors conducted comprehensive experiments on multiple public datasets, demonstrating improvements in numerical metrics for the proposed method.
- The paper lacks video samples. For a 3D motion generation model, providing diverse generated video samples is crucial, as it intuitively showcases the model's generation capabilities and quality. Without video samples, it is difficult for me to assess the model's actual performance, and as a reviewer, I cannot accept a 3D motion generation paper without any video samples. - The baseline methods compared are outdated. The authors should include comparisons with the latest state-of-the-art appro
The paper proposes PlanMoGPT, which demonstrates notable performance improvements on the authors’ customized benchmarks
1. Lacks novelty: - the paper appears to be an incremental improvement, and the scientific contribution is not clearly articulated. Much of the work seems engineering-oriented (e.g., “doubles the downsampling resolution and expands the codebook size by eight times” as stated in the abstract). 2. Writing and presentation issues. 1. The overall narrative lacks clarity. The introduction discusses problems of LLMs, but the method actually targets issues inherent to Transformers in general, n
This paper focuses on the issue of the granularity of motion tokenization and introduces flow-matching into motion tokenization to propose flow-enhanced fine-grained motion tokenization. This paper also introduces progressive generation for an LLM-based motion generation model. Comprehensive ablation experiments demonstrate the effectiveness of the proposed method.
There are two experimental results in this paper that cannot support the contribution of the paper: 1. PlanMoGPT achieves suboptimal results on the KIT-ML dataset. 2. Introducing time interval 8 does not improve the text-to-motion performance, and time interval 6 leads to higher FID.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
