Characterizing Motion Encoding in Video Diffusion Timesteps
Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava

TL;DR
This paper systematically characterizes how motion and appearance are encoded across timesteps in video diffusion models, revealing an early motion-dominant and a later appearance-dominant regime, and simplifies motion transfer by focusing on the motion-dominant phase.
Contribution
It introduces a quantitative protocol to map motion and appearance trade-offs across diffusion timesteps and proposes a simplified motion transfer method based on this characterization.
Findings
Identifies an early, motion-dominant regime in diffusion timesteps.
Establishes a late, appearance-dominant regime in diffusion timesteps.
Enables strong motion transfer without auxiliary modules or specialized objectives.
Abstract
Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Computer Graphics and Visualization Techniques
