MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang; Hao Tang; Donglin Di; Zhilu Zhang; Wangmeng Zuo; Feng Gao; Siwei Ma; Shiliang Zhang

arXiv:2508.17404·cs.CV·February 25, 2026

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang

PDF

Open Access 3 Reviews

TL;DR

MoSA introduces a novel decoupled approach for human video generation that separately models structure and appearance, enabling more realistic and complex human motions with fine-grained control and improved interaction modeling.

Contribution

The paper presents MoSA, a structure-appearance decoupling framework for human video synthesis, and introduces a large-scale dataset with diverse complex motions.

Findings

01

MoSA outperforms existing methods on multiple evaluation metrics.

02

The decoupling approach improves structural coherence and motion realism.

03

The dataset enables better training and evaluation of human video models.

Abstract

Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Decoupling of motion and appearance to guide the video generation to more complex human motion makes sense 2. The Movid dataset is an important contribution to the field

Weaknesses

1. The method uses motion corerspondences to constrain video consistency. These correspondences can be noisy, especially for textureless regions. Since these correspondences are only estimated from RGB frames, ie., the final output within a diffusion process, unrolling the gradient from multi-step diffusion is expensive. 2. It is unclear how the 3D contract constraint is differentiable.

Reviewer 02Rating 4Confidence 5

Strengths

The motivation is interesting, and decoupling structure from appearance addresses a long-standing challenge in text-to-human video generation. Experiments on multiple benchmarks show substantial improvements compared with existing methods.

Weaknesses

Does the proposed method ensure physical realism—for example, the deformation of the trampoline shown in Figure 3? The overall framework just combines known ideas (DiT backbone + structural priors + control modules), thus the innovation is mainly architectural integration. This paper adopts a mask-based generation approach. How does this method handle problem where the human body is partially occluded?

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper proposes to use 3D to 2D structure guidance, and HADC integrates cleanly into DiT backbones. Experiments shows that it works with CogVideoX and Wan 2.1. Removing the structure branch or replacing 3D with direct 2D pose generation degrades plausibility (missing/occluded limbs). 2. To enhance motion coherence and physics, dense-tracking and contact terms are proposed and proved to be effect. 3. Dataset contribution addresses a real gap (full-body, complex motion) and is used in co

Weaknesses

1. Is projected 2D structure necessary for T2V? The authors argue that generating 3D keypoints and projecting to 2D improves plausibility and occlusion handling over directly predicting 2D skeletons. Ablations show failures like missing limbs when using a 2D structure generator, which supports that claim. However, the current metrics (FVD, CLIP similarity, and some VBench dimensions) are not targeted at structural correctness, so they don’t decisively isolate the benefit of 3D to 2D conditioning

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Face recognition and analysis · Image Enhancement Techniques