EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer
Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang, Bing Deng, Junzhe Lu, Haoqian Wang, Jieping Ye

TL;DR
EchoMotion introduces a dual-modality diffusion transformer framework that jointly models appearance and human motion, significantly enhancing the generation of complex human action videos by leveraging synchronized 3D positional encoding and a two-stage training strategy.
Contribution
The paper presents EchoMotion, a novel framework that unifies human video and motion generation using a dual-branch architecture, MVS-RoPE encoding, and a large-scale dataset HuMoVe for improved coherence.
Findings
Explicit motion representation improves video plausibility.
Joint modeling enhances complex human action synthesis.
Unified approach outperforms appearance-only models.
Abstract
Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a…
Peer Reviews
Decision·ICLR 2026 Poster
1. This work proposed a Dual-Modality DiT architecture that accept input and output with different modality. 2. This proposed Motion-Video Synchronized RoPE is an interesting idea to add motion information to the model. 3. This paper proposed a new high-quality dataset for video, human motion and text.
1. The novelty of the Dual-Modality DiT and Motion-Video Synchronized RoPE is limited. The notion of multi-modality DiT is not new and the idea of adding motion information is well studied in human mesh and skeleton generation tasks. 2. There are only baseline model results of Wan-1.3B and Wan-5B which are not enough to give accurate evaluation of the proposed architecture. 3. There is no ablation study to show the effectiveness of each proposed block in the architecture. 4. The model efficienc
1. The paper clearly identifies a fundamental weakness in current human-centric video generation models for kinematic correctness and proposes to explicitly model the joint distribution of video and motion as a strong inductive bias to enhance the video generation performance; 2. The MVS-RoPE design is clear and well-justified to the non-trivial problem of aligning modalities with different temporal resolutions. 3. The creation of the 80,000-pair HuMoVe dataset is a substantial contribution to t
1. The paper does not provide a clear description of the specific "open-source datasets, movies, and the internet" used to build the HuMoVe dataset. Furthermore, the extracted motion could be noisy as the ground truth; 2. The framework's reliance on the SMPL model as its parametric motion representation creates an inherent bottleneck for fine-grained realism. SMPL is a whole-body model that offers very limited, or no, supervision for highly articulated and expressive areas like individual hand g
- The paper proposes the large-scale HuMoVe dataset. Since the dataset includes test captions, videos, and motion parameter pairs, it is highly useful for multi-modal modeling tasks. - MVS-RoPE that can be jointly applied to visual and motion embeddings is proposed. This encoding technique utilizes diagonal positioning to prevent interference between vision and motion latents, which is a reasonable approach (although more experimental evidence is needed to support this). - The paper is easy to
- The deep network structure is only a simple extension of existing networks. Except for MVS-RoPE, the network mainly uses self-attention on concatenated features for joint modeling, which is quite simple and straightforward. Discussion on whether other components could be improved to better support joint modeling would strengthen the paper. - The quantitative evaluation relies only on self-evaluation. Even if direct comparison with prior studies is difficult, the paper should include analyses
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition
