3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin

TL;DR
3DTrajMaster introduces a novel 3D motion control framework for multi-entity video generation, enabling precise manipulation of 3D trajectories and poses, surpassing previous 2D-based methods in accuracy and generalization.
Contribution
The paper presents a new 3D motion grounded controller with a plug-and-play object injector and a large 360-Motion Dataset, advancing controllable multi-entity video synthesis.
Findings
Achieves state-of-the-art accuracy in 3D motion control.
Demonstrates improved generalization across diverse entities.
Provides a new dataset linking 3D assets with trajectories.
Abstract
This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality…
Peer Reviews
Decision·ICLR 2025 Poster
The paper addresses the lack of 6-DoF controllability of existing video generation methods. The method is well-motivated and method designs are clearly explained. The advantage of 6-DoF control over 2D motion control is clearly demonstrated in experiments.
* The section on related works discusses prior methods on motion control and motion synthesis tasks, but could also include discussions on techniques for injecting controls to video foundation models, including ControlNet [1] and methods that allow 2D image editing by manipulation attention maps. In particular, ControlNet [1] is currently mentioned but not cited in the paper. * The proposed dataset is restricted to human and animal categories, and locations remain to be in cities. Whether it's
The proposed 3D-motion grounded object injector, combining 6DoF pose sequences with entity descriptions, is an innovative contribution that extends beyond 2D control limitations. **Dataset Creation**: The construction of the 360°-Motion Dataset addresses a notable gap in available training data, particularly for multi-entity scenarios, using an innovative combination of GPT and UE. **Flexibility**: The plug-and-play nature of the proposed object injector facilitates broader applicability acros
**Dataset Limitation**: The reliance on synthetic data and a limited number of assets may hinder real-world generalization. The "city" setting constraint for the dataset also limits the diversity of possible outputs. **Generalizability**: The model's performance for generalized 3D scenes beyond those captured in the MatrixCity platform remains unclear. More evaluation of real-world, diverse datasets would strengthen the contributions. **Evaluation Scope**: While evaluation metrics like FVD and
1. The proposed method is the first to control entities’ motion with 3D trajectories in video generation. The task is novel and reasonable as 3D control signals can fully express the inherent 3D nature of motion and offer better controllability in video generation compared to 2D control signals. 2. The method design is clear and reasonable. 3. The paper constructs a new synthetic dataset for this task. The dataset potentially benefits the following video generation with 3D entity control. 4. Th
1. The dataset lacks diversity in terms of background and motion types. The setting is restricted to a "City" environment (as noted in the paper's Limitations section), and the actions are primarily limited to walking. Consequently, models trained on this dataset are also constrained in their generalizability. 2. Foot skating/floating issues are prevalent in the dataset. This appears to result from inconsistencies between the relative motion and global motion of the dynamic entities, which could
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Human Motion and Animation
MethodsDiffusion
