MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, Yue Ma

TL;DR
MultiMotion introduces a unified framework with Maskaware Attention Motion Flow and RectPC for precise, multi-object video motion transfer, overcoming previous limitations in diffusion transformer architectures.
Contribution
It presents Maskaware Attention Motion Flow for disentangling object motions and RectPC for efficient sampling, along with a new benchmark dataset for multi-object motion transfer evaluation.
Findings
Achieves semantically aligned, temporally coherent multi-object motion transfer.
Maintains high quality and scalability of diffusion transformer models.
Provides the first benchmark dataset for DiT-based multi-object motion transfer.
Abstract
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Pose and Action Recognition
