MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu

TL;DR
MotiMotion introduces a reasoning-based framework for motion-controlled video generation that enhances plausibility and interaction realism by refining trajectories and hallucinating secondary motions.
Contribution
The paper presents a novel reasoning-then-generation approach, a confidence-aware control scheme, and a new benchmark for more realistic motion-controlled video synthesis.
Findings
Produces more plausible object behaviors and interactions.
Outperforms existing methods in human evaluations.
Demonstrates effectiveness on the new MotiBench dataset.
Abstract
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
