PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

Chuhao Jin; Haosen Li; Bingzi Zhang; Che Liu; Xiting Wang; Ruihua Song; Wenbing Huang; Ying Qin; Fuzheng Zhang; Di Zhang

arXiv:2506.17912·cs.CV·June 24, 2025

PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang

PDF

3 Reviews

TL;DR

PlanMoGPT introduces a hierarchical planning and flow-enhanced tokenization approach for text-to-motion synthesis, significantly improving global semantic alignment and motion detail preservation, leading to state-of-the-art results.

Contribution

It proposes a novel LLM-based framework with progressive planning and flow-enhanced tokenization to address granularity issues in text-to-motion generation.

Findings

01

Achieves 63.8% improvement in FID scores on long sequences.

02

Enhances motion diversity by 49.9%.

03

Sets new benchmarks for text-to-motion synthesis.

Abstract

Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- The writing and structure of the paper are clear and easy to follow. - The authors conducted comprehensive experiments on multiple public datasets, demonstrating improvements in numerical metrics for the proposed method.

Weaknesses

- The paper lacks video samples. For a 3D motion generation model, providing diverse generated video samples is crucial, as it intuitively showcases the model's generation capabilities and quality. Without video samples, it is difficult for me to assess the model's actual performance, and as a reviewer, I cannot accept a 3D motion generation paper without any video samples. - The baseline methods compared are outdated. The authors should include comparisons with the latest state-of-the-art appro

Reviewer 02Rating 2Confidence 4

Strengths

The paper proposes PlanMoGPT, which demonstrates notable performance improvements on the authors’ customized benchmarks

Weaknesses

1. Lacks novelty: - the paper appears to be an incremental improvement, and the scientific contribution is not clearly articulated. Much of the work seems engineering-oriented (e.g., “doubles the downsampling resolution and expands the codebook size by eight times” as stated in the abstract). 2. Writing and presentation issues. 1. The overall narrative lacks clarity. The introduction discusses problems of LLMs, but the method actually targets issues inherent to Transformers in general, n

Reviewer 03Rating 6Confidence 3

Strengths

This paper focuses on the issue of the granularity of motion tokenization and introduces flow-matching into motion tokenization to propose flow-enhanced fine-grained motion tokenization. This paper also introduces progressive generation for an LLM-based motion generation model. Comprehensive ablation experiments demonstrate the effectiveness of the proposed method.

Weaknesses

There are two experimental results in this paper that cannot support the contribution of the paper： 1. PlanMoGPT achieves suboptimal results on the KIT-ML dataset. 2. Introducing time interval 8 does not improve the text-to-motion performance, and time interval 6 leads to higher FID.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.