Video Diffusion Models are Training-free Motion Interpreter and Controller
Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan

TL;DR
This paper reveals that video diffusion models inherently encode motion features, introduces a training-free motion control method using these features, and demonstrates its effectiveness in generating natural, controllable video motion.
Contribution
It uncovers the existence of interpretable motion-aware features in video diffusion models and proposes a training-free framework for motion control leveraging these features.
Findings
Motion-aware features are inherently encoded in diffusion models.
The proposed MOFT method enables training-free motion extraction.
The framework achieves competitive results in natural motion generation.
Abstract
Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks
MethodsSparse Evolutionary Training · Focus · Diffusion
