MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation
Hongpeng Wang, Zeyu Zhang, Wenhao Li, Hao Tang

TL;DR
MoRL is a unified multimodal motion model that enhances human motion understanding and generation through reinforcement learning, reasoning, and test-time planning, achieving significant improvements over existing methods.
Contribution
Introduces MoRL, a novel model combining supervised fine-tuning and reinforcement learning with reasoning-based rewards for improved motion understanding and generation.
Findings
Significant performance improvements on HumanML3D and KIT-ML datasets.
Effective reasoning and planning via Chain-of-Motion (CoM) method.
Constructed large-scale reasoning datasets MoUnd-CoT-140K and MoGen-CoT-140K.
Abstract
Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper presents a unified multimodal framework that effectively integrates semantic, reasoning, and physical consistency rewards, leading to more logically coherent and perceptually realistic motion understanding and generation. 2. It introduces Chain of Motion reasoning and large-scale CoT datasets, which significantly enhance the model’s interpretability and performance, achieving measurable improvements over state-of-the-art baselines on standard benchmarks. 3. The paper is well-written
1. The proposed method fails to outperform or at least reach comparable performance to all the baselines, especially in terms of CIDEr, where it was significantly surpassed by certain baselines. 2. In the ablation study, it seems that the performance of full MoRL and that without certain reward item or CoM are comparable. Moreover, it does not even show a significant improvement of full MoRL over SFT only.
- well motivated problem. Motion understanding and generation have been studied separately but unifying them with a shared representation is valuable. Table 3 shows that joint training improves both understanding and generation, suggesting these tasks reinforce each other. - the four reward (semantic, coherence, physical, alignment) all cover complementary aspects and I appreciate the ablations in table 3 to show each contributes meaningfully - strong experimental validation with two benchmark
-Missing an analysis of the quality of the synthetic CoT data as there is no human evaluation of the dataset provided. We have no idea which gemini reasoning traces are actually correct. I hope the authors can share more data quality metrics and analysis. - I would like to see when does MoRL fail. what types of motions or captions are challenging. A qualitative and quantitative error analysis would be nice.
- The paper unifies motion understanding and generation with task-specific rewards. The paper adopts a simple but effective RL recipe, using GRPO-style group sampling with KL to a frozen reference avoids heavy heuristics yet yields consistent gains. - The work has strong CoT data engine. Two large CoT datasets align motions with reasoning traces and concise answers, providing good supervision for both directions. - Evaluations cover comprehensive metrics for both understanding and generation c
- Results are only on HumanML3D and KIT-ML, harder and more diverse settings (like long sequences and multi-person) aren’t covered. - CoM samples K=8 candidates with T=2 refinement, the paper calls overhead “modest” but reports no latency/throughput numbers. - Headline gains of experimental results are 4.17% BERT and 3% FID, and MoRL is not best on all metrics (like FID vs diffusion baselines).
- MoRL presents an integrated approach for both motion understanding and generation, offering a single framework that closes the gap between perception and synthesis in motion-language modeling. - The reward functions go beyond prior generic similarity metrics: the logical coherence reward for reasoning traces and explicit physical plausibility for motion generation are clear step-ups over most existing motion-language systems. - Quantitative results across linguistic (BLEU, ROUGE, CIDEr, BERTSc
- **Lack of supplementary materials and qualitative evidence (visualizations/videos)**: The submission provides neither a supplementary document nor qualitative visualizations (e.g., trajectory plots, attention/activation maps, sample reasoning traces aligned with frames) or videos of generated motions. For a generative motion system, the absence of side-by-side videos against baselines and ablations makes it difficult to assess realism, physical plausibility, temporal coherence, and failure mod
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Motion and Animation
