MotionGPT3: Human Motion as a Second Modality
Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen

TL;DR
MotionGPT3 introduces a bimodal motion-language model that encodes motion in a continuous latent space and uses a dual-stream Transformer to improve understanding and generation, achieving faster convergence and state-of-the-art results.
Contribution
It proposes a novel bimodal model with continuous motion encoding and a dual-stream Transformer to reduce interference and accelerate training.
Findings
2x faster convergence in training loss
up to 4x faster convergence in validation
state-of-the-art performance on motion benchmarks
Abstract
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention…
Peer Reviews
Decision·ICLR 2026 Poster
- The motivation of utilizing the continuous motion latent space for lossless motion encoding and the diffusion header to bridge the gap between the next-token generation framework is reasonable. - The dual-branch framework to preserve modality-specific information and the shared attention for cross-modal communication is well motivated, and the three-stage training schemes stabilize the optimization of the proposed framework. - Experimental results on benchmarks of the two tasks are strong, a
- The paper claims continuous VAE for motion encoding is better, but lacks an experimental comparison on motion encoding and decoding quality with previous schemes. Specifically, how is the improvement of the continuous VAE compared to the recently stronger motion quantization methods, e.g., the residual VQ proposed by MoMask (CVPR 2024) and the 2D motion quantization in MoGenTS (NeurIPS 2024)? - Experiments are only conducted on the HumanML3D datasets. Adding more diverse datasets, e.g., Motion
1. The paper contains numerous figures and tables, as well as abundant visualization results, with a relatively clear overall structure. 2. Video demos are provided, demonstrating excellent performance. 3. Motion generation and motion understanding tasks are realized through two different branches and fine-tuning of the LLM. 4. The experimental results have achieved significant improvements.
1. The description of Figure 2 and the method section is not clear enough, making it difficult to intuitively grasp the authors' entire training process design and the detailed reasoning procedure. 2. Although the paper achieves good results, the method feels relatively incremental and highly hierarchical, lacking overall simplicity. Compared with MotionGPT and MotionGPT2, it does not bring a strong sense of novelty. 3. The autoregressive continuous token proposed by the authors has been used
1. The technical design in this paper is highly targeted, with each core component—a continuous VAE, a dual-stream architecture, and three-stage training—precisely solving a specific problem, and its necessity is rigorously validated through ablation studies, resulting in a lean and non-redundant overall architecture. 2. The experimental validation is comprehensive, encompassing quantitative comparisons, qualitative examples, and thorough ablation studies. 3. The study ensures high reproducibi
1. The bimodal branch architecture proposed in this paper to address cross-modal interference in motion-language modeling is not a particularly novel approach, as similar frameworks have been proposed in existing unified text-image understanding and generation work, such as BAGEL [1]. However, the paper lacks discussion on how the proposed method differs from these existing approaches. 2. Baseline comparisons are outdated, lacking recent models (e.g., MotionGPT-2 2024 [2],MG-MotionLLM 2025 [3])
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition
