Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li; Chuanxia Zheng; Christian Rupprecht; Andrea Vedaldi

arXiv:2408.04631·cs.CV·August 29, 2025

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

PDF

Open Access 1 Datasets

TL;DR

Puppet-Master is an interactive video generation model that synthesizes part-level object motions from minimal user inputs, leveraging synthetic training data and novel attention mechanisms to outperform existing methods and generalize to real images.

Contribution

We propose Puppet-Master, a novel model that captures internal object dynamics for video synthesis, using synthetic data and all-to-first attention to improve motion realism and out-of-domain generalization.

Findings

01

Generates realistic part-level motions from minimal inputs.

02

Outperforms existing methods on real-world benchmarks.

03

Successfully generalizes to out-of-domain images in zero-shot settings.

Abstract

We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects, serving as a proxy for modeling object dynamics universally. Given an image of an object and a set of "drags" specifying the trajectory of a few points on the object, the model synthesizes a video where the object's parts move accordingly. To build Puppet-Master, we extend a pre-trained image-to-video generator to encode the input drags. We also propose all-to-first attention, an alternative to conventional spatial attention that mitigates artifacts caused by fine-tuning a video generator on out-of-domain data. The model is fine-tuned on Objaverse-Animation-HQ, a new dataset of curated part-level motion clips obtained by rendering synthetic 3D animations. Unlike real videos, these synthetic clips avoid confounding part-level motion with overall object and camera motion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rayli/Drag-a-Move-test-split
dataset· 94 dl
94 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · Diffusion