UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
Youxin Pang, Yong Zhang, Ruizhi Shao, Xiang Deng, Feng Gao, Xu Xiaoming, Xiaoming Wei, Yebin Liu

TL;DR
UniMo introduces a novel autoregressive framework that jointly models 2D videos and 3D human motions, enabling simultaneous generation and understanding of both modalities, which was previously unexplored due to their structural differences.
Contribution
The paper presents a unified modeling approach for 2D videos and 3D motions using token sequences, a new 3D motion tokenizer, and a sequence modeling strategy that integrates two tasks within one framework.
Findings
Successfully generates synchronized videos and 3D motions.
Accurately captures and reconstructs 3D human motion from videos.
Demonstrates potential for multimodal human-centric modeling.
Abstract
We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The motivation and task are clearly defined, and the proposed approach is technically sound and promising. 2. Extensive comparisons and ablation studies convincingly demonstrate the method’s effectiveness and isolate the impact of key components.
1. The paper compares the proposed motion tokenizer to SOLAMI, but SOLAMI employs multiple VQ-VAEs for body, hands, and inter-character relative transforms, whereas this work focuses only on body motion. The absence of SOLAMI’s detailed settings further undermines the validity of the comparison and may bias the results. 2. Since the main baselines generate only video, the current I2VM results are not sufficient to support the claim that motion supervision improves video generation. The evaluatio
- The motivation is clear and well-grounded, unifying I2VM and V2M provides a natural way to achieve mutually benefits between reconstruction and generation quality, and such joint modeling has been shown to be effective in other domains as well (e.g. GENMO [ICCV 2025]). - The design of key components (e.g., independent embeddings, SMPL-X token expansion) is well explained and ablated. - The framework consistently improves both motion reconstruction and generation quality, demonstrating the ef
- Clarity and organization could be improved. For example, the meaning of s=1 in Figure 5 could be clarified directly in the caption for easier reference. Several important quantitative results and ablations, such as the I2VM-only and V2M-only settings, are presented only in the supplementary material, though they are essential for substantiating the claimed benefits of joint modeling. - The Cosmos baseline used for comparison appears relatively weak, as human motion dynamics constitutes only a
- By performing joint modeling of 2D human videos and 3D motion, the framework ensures consistency between video and motion. Task-specific token arrangements are defined for each task, enabling effective generation. - By increasing the resolution of motion tokens using a scaling factor (s), the method improves motion prediction accuracy. Although the theoretical basis is limited, experimental results demonstrate its effectiveness. - The paper is easy to follow.
- Possible evaluation bias stemming from reliance on GVHMR pseudo ground truth. The framework is both trained and evaluated on data derived from GVHMR, which also serves as the reference for metric computation. This design could introduce a subtle circular bias, as the model may partially learn to replicate GVHMR characteristics rather than demonstrating genuine generalization. A clearer discussion on how this dependency is handled or additional evaluation using independent ground-truth would s
1. The tasks the paper tries to address are interesting and novel to me. It's intuitive to expect co-improvement by modeling the 3D motions and 2D videos in one framework. The Video2Motion would also be useful to extract 3D motion from in-the-wild videos. 2. The performance reported in the table and shown by the demo video is impressive. 3. The design of a GPT-like model and independent embeddings to unify two modalities is straightforward and intuitive. The experiments demonstrated the effectiv
1. One major issue is that the paper keeps mentioning "understanding task" several times. But I cannot find any design or experiment related to understanding, which I expect that the model will receive video or motion and text as input, and generate text. The use of LLM is also improper. I would suggest describing the model as a GPT-like model rather than a LLM-based model. 2. The design to extend the temporal dimension to match the number of visual tokens is relatively naive. The author mention
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
