Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding

Runqi Ouyang; Haoyun Li; Zhenyuan Zhang; Xiaofeng Wang; Zeyu Zhang; Zheng Zhu; Guan Huang; Sirui Han; Xingang Wang

arXiv:2506.10353·cs.CV·November 25, 2025

Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zeyu Zhang, Zheng Zhu, Guan Huang, Sirui Han, Xingang Wang

PDF

3 Reviews

TL;DR

Motion-R1 introduces a novel framework combining decomposed Chain-of-Thought reasoning with reinforcement learning to improve the quality, interpretability, and semantic accuracy of text-to-motion generation, addressing temporal and causal complexities.

Contribution

It presents the Decomposed CoT Data Engine and RL Binding strategies, enabling better modeling of temporal dependencies and causal relationships in human motion generation.

Findings

01

Achieved 3.5% improvement in MM-Dist on HumanML3D

02

Surpassed existing methods in R-Precision and FID metrics

03

Demonstrated state-of-the-art performance across benchmark datasets

Abstract

Text-to-Motion generation has become a fundamental task in human-machine interaction, enabling the synthesis of realistic human motions from natural language descriptions. Although recent advances in large language models and reinforcement learning have contributed to high-quality motion generation, two major challenges remain. Existing approaches often fail to capture the temporal and causal complexities inherent in natural language, leading to oversimplified or incoherent motions. Additionally, RL-based methods are frequently overly complex, hindering their scalability and adaptability across various motion generation tasks. To address these challenges, we propose Motion-R1, a novel framework that combines decomposed Chain-of-Thought reasoning with reinforcement learning to enhance both the quality and interpretability of generated motions. Specifically, we introduce the Decomposed…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Skillfully combines the Chain-of-Thought paradigm from natural language processing with reinforcement learning, providing a novel and effective framework for applying large language models to complex motion generation tasks. 2. Not only achieves performance improvements but also provides transparency into the generation process through structured CoT reasoning steps, enhancing model interpretability and controllability—crucial aspects for practical deployment. 3. Directly tackles two core cha

Weaknesses

1. The overall framework performance is highly dependent on the quality of CoT data generated by LLMs. Errors in LLM's understanding of certain actions could be amplified in subsequent processes, yet the paper lacks systematic analysis of such error propagation. 2. The three reward functions (format, motion, semantic), while intuitive and effective, may not cover all important aspects of motion generation, such as physical plausibility, energy consumption, style consistency, and other nuanced qu

Reviewer 02Rating 6Confidence 4

Strengths

1. Elegant and Effective Methodology: The core contribution—combining an automated CoT data generation pipeline with a streamlined RL mechanism—directly addresses a clear limitation of prior end-to-end models, which often struggle to interpret and execute multi-step or complex instructions. 2. Eliminates the Need for Human Annotation: RL Binding obviates the need for costly, time-consuming, and often subjective human preference labeling. By cleverly using the existing ground-truth data (text an

Weaknesses

Dependency on External LLM Quality: The performance of the entire framework is fundamentally tied to the reasoning and decomposition capabilities of the LLM used in the Decomposed CoT Data Engine. The paper acknowledges that the LLM can produce "noisy or suboptimal plans," which could introduce errors into the training data. This dependency might limit the reproducibility and robustness of the approach if a different or less capable LLM is used. Inconsistency LLM output across multiple decomposi

Reviewer 03Rating 2Confidence 5

Strengths

1. Paper writing: the paper writing, figures and overall structure make the paper easy to follow. 2. Technical novelty: The RL Binding replaces human preference modeling with automatic motion/text similarity rewards is somewhat novel, but its effectiveness are not probably evaluated (see weaknesses). 3. Good quantitative results: Based on the reported results in Table 1, Motion-R1 achieves consistent gains across major metrics (yet, the improvement is not significant and worse in some metrics)

Weaknesses

1. Unfair and incomplete comparison with Motion-Agent: The paper compares Motion-R1 only against MotionLLM, which is merely one internal component of Motion-Agent (Wu et al., 2024). Motion-Agent integrates MotionLLM with GPT-4o for reasoning, task decomposition, long-sequence composition, and interactive motion editing. Thus, the comparison omits the very agent capabilities that Motion-R1 aims to emulate with its CoT Data Engine and RL Binding. Claims of superiority are therefore not substantiat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsADaptive gradient method with the OPTimal convergence rate