Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

TL;DR
Puppet-Master is an interactive video generation model that synthesizes part-level object motions from minimal user inputs, leveraging synthetic training data and novel attention mechanisms to outperform existing methods and generalize to real images.
Contribution
We propose Puppet-Master, a novel model that captures internal object dynamics for video synthesis, using synthetic data and all-to-first attention to improve motion realism and out-of-domain generalization.
Findings
Generates realistic part-level motions from minimal inputs.
Outperforms existing methods on real-world benchmarks.
Successfully generalizes to out-of-domain images in zero-shot settings.
Abstract
We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects, serving as a proxy for modeling object dynamics universally. Given an image of an object and a set of "drags" specifying the trajectory of a few points on the object, the model synthesizes a video where the object's parts move accordingly. To build Puppet-Master, we extend a pre-trained image-to-video generator to encode the input drags. We also propose all-to-first attention, an alternative to conventional spatial attention that mitigates artifacts caused by fine-tuning a video generator on out-of-domain data. The model is fine-tuned on Objaverse-Animation-HQ, a new dataset of curated part-level motion clips obtained by rendering synthetic 3D animations. Unlike real videos, these synthetic clips avoid confounding part-level motion with overall object and camera motion.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · Diffusion
