RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang

TL;DR
RealisMotion introduces a decomposed control framework for human video generation, enabling independent manipulation of motion, appearance, background, and actions in 3D space, resulting in highly controllable and realistic videos.
Contribution
The paper presents a novel framework that explicitly decouples key video elements and performs motion editing in 3D space, enhancing controllability and flexibility in human video synthesis.
Findings
Achieves state-of-the-art controllability and video quality
Enables flexible mix-and-match of video elements
Demonstrates effectiveness on benchmark and real-world data
Abstract
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The method section presents several non-trivial and well-designed components that contribute to the effectiveness of the proposed approach. For example, Lines 195–215 describe a technique to address the foot-sliding problem. These designs are valuable, though providing more intuitive explanations would improve the readability of the paper. 2. The visual results in the supplementary videos are impressive, demonstrating high-quality human motion that is well integrated with the environment. The
1. The paper adopts a relatively synthetic setup, and the required input format is not user-friendly. It needs a subject image, a background video, and a sequence of trajectory, orientation, and body poses. This means that the user-provided human motion must be naturally aligned with the background video, which would demand substantial effort from users. 2. It is unclear how users can accurately draw a 2D trajectory on the ground that aligns with the video background and can be projected into a
- The authors propose and implement a sophisticated 3D-conditioned video generation system. The paper is well-structured, mainly organized into two parts: 3D motion generation and editing, and 2D human video generation, making it easy to follow. - The system incorporates four key inputs: (1) a reference image, (2) a background video, (3) target translation and orientation, and (4) a motion sequence, allowing independent control over these four dimensions. - The paper systematically discusses how
**Discussion on System Limitations and Trade-offs** The system relies on 3D guidance for control and a video generation model to produce the final video. This introduces a natural trade-off between the richness of the 3D guidance and the generalization capacity of the video model, i.e., balancing 3D physical priors with video diffusion priors. For instance: - Using detailed 3D meshes as guidance enhances body shape consistency but limits the diversity of expressions (e.g., it might struggle to
1. RealisMotion introduces an unparalleled level of explicit, independent control over four fundamental video elements, including subject, background, trajectory, and action. 2. The framework successfully integrates 3D motion priors with modern video diffusion priors, i.e., WAN-2.1-T2V. Experimental results verify the effectiveness of the proposed method.
1. This reliance on large-scale internal data creates a major hurdle for reproducibility and makes it impossible for the academic community to verify the results or build upon the fine-tuned model. The performance gains might stem more from the scale and quality of this undisclosed training data than from the architectural novelty of RealisMotion itself. 2. The comparison against existing state-of-the-art methods (e.g., Animate Anyone, MotionCtrl, 3DTrajMaster) is inherently unfair if these meth
1. The paper tackles a central challenge in video generation: the full, independent control of subject, background, and motion. 2. The method successfully combines 3D physical priors with a learned video diffusion prior. 3. The method demonstrates clear state-of-the-art performance, achieving the lowest translation and rotation errors on its trajectory benchmark (Table 2) and the highest quality scores on the action control benchmark (Table 3).
1. The model is fine-tuned on an internal dataset of 3,300 hours of video. This is a major reproducibility flaw. It makes it impossible to distinguish whether the model's SOTA performance comes from the novel architecture or from this massive, proprietary dataset, which the baseline models did not use. 2. The primary benchmark for trajectory control, Trajectory 100, was created by the authors and is not public. This, combined with the internal training data, makes the SOTA claims difficult to ve
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
