MotionGPT3: Human Motion as a Second Modality

Bingfan Zhu; Biao Jiang; Sunyi Wang; Shixiang Tang; Tao Chen; Linjie Luo; Youyi Zheng; Xin Chen

arXiv:2506.24086·cs.CV·November 4, 2025

MotionGPT3: Human Motion as a Second Modality

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen

PDF

Open Access 1 Models 3 Reviews

TL;DR

MotionGPT3 introduces a bimodal motion-language model that encodes motion in a continuous latent space and uses a dual-stream Transformer to improve understanding and generation, achieving faster convergence and state-of-the-art results.

Contribution

It proposes a novel bimodal model with continuous motion encoding and a dual-stream Transformer to reduce interference and accelerate training.

Findings

01

2x faster convergence in training loss

02

up to 4x faster convergence in validation

03

state-of-the-art performance on motion benchmarks

Abstract

With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The motivation of utilizing the continuous motion latent space for lossless motion encoding and the diffusion header to bridge the gap between the next-token generation framework is reasonable. - The dual-branch framework to preserve modality-specific information and the shared attention for cross-modal communication is well motivated, and the three-stage training schemes stabilize the optimization of the proposed framework. - Experimental results on benchmarks of the two tasks are strong, a

Weaknesses

- The paper claims continuous VAE for motion encoding is better, but lacks an experimental comparison on motion encoding and decoding quality with previous schemes. Specifically, how is the improvement of the continuous VAE compared to the recently stronger motion quantization methods, e.g., the residual VQ proposed by MoMask (CVPR 2024) and the 2D motion quantization in MoGenTS (NeurIPS 2024)? - Experiments are only conducted on the HumanML3D datasets. Adding more diverse datasets, e.g., Motion

Reviewer 02Rating 4Confidence 5

Strengths

1. The paper contains numerous figures and tables, as well as abundant visualization results, with a relatively clear overall structure. 2. Video demos are provided, demonstrating excellent performance. 3. Motion generation and motion understanding tasks are realized through two different branches and fine-tuning of the LLM. 4. The experimental results have achieved significant improvements.

Weaknesses

1. The description of Figure 2 and the method section is not clear enough, making it difficult to intuitively grasp the authors' entire training process design and the detailed reasoning procedure. 2. Although the paper achieves good results, the method feels relatively incremental and highly hierarchical, lacking overall simplicity. Compared with MotionGPT and MotionGPT2, it does not bring a strong sense of novelty. 3. The autoregressive continuous token proposed by the authors has been used

Reviewer 03Rating 4Confidence 4

Strengths

1. The technical design in this paper is highly targeted, with each core component—a continuous VAE, a dual-stream architecture, and three-stage training—precisely solving a specific problem, and its necessity is rigorously validated through ablation studies, resulting in a lean and non-redundant overall architecture. 2. The experimental validation is comprehensive, encompassing quantitative comparisons, qualitative examples, and thorough ablation studies. 3. The study ensures high reproducibi

Weaknesses

1. The bimodal branch architecture proposed in this paper to address cross-modal interference in motion-language modeling is not a particularly novel approach, as similar frameworks have been proposed in existing unified text-image understanding and generation work, such as BAGEL [1]. However, the paper lacks discussion on how the proposed method differs from these existing approaches. 2. Baseline comparisons are outdated, lacking recent models (e.g., MotionGPT-2 2024 [2]，MG-MotionLLM 2025 [3])

Code & Models

Models

🤗
OpenMotionLab/motiongpt3
model· ♡ 7
♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition