MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer
Samuel Teodoro, Yun Chen, Agus Gunawan, Soo Ye Kim, Jihyong Oh, Munchurl Kim

TL;DR
MotionGrounder is a novel diffusion transformer framework that enables multi-object motion transfer in videos, incorporating object grounding and alignment to improve controllability and realism.
Contribution
It introduces a multi-object motion transfer method with a flow-based motion signal, object-caption alignment loss, and a new grounding score, advancing multi-object controllability in diffusion-based video synthesis.
Findings
Outperforms recent baselines in quantitative evaluations.
Achieves better spatial and semantic alignment in generated videos.
Enhances multi-object controllability in diffusion transformer models.
Abstract
Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
