Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance
Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan

TL;DR
Motion Marionette introduces a zero-shot framework for rigid motion transfer that uses an internal spatial-temporal prior, enabling generalizable, temporally consistent, and controllable video synthesis from monocular videos and images.
Contribution
It proposes a novel internal prior based on 3D representations and motion trajectories, avoiding external priors and improving generalizability and temporal consistency.
Findings
Generalizes across diverse objects
Produces temporally consistent videos
Supports controllable motion transfer
Abstract
We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well written and easy to follow. 2. The internal‑prior formulation (SpaT) is category/semantics‑agnostic given it's leveraging multiple generalized approaches. 3. The use of 3DGS provide explicit control for objects.
1. The method’s core assumption of rigid motion significantly restricts its scope if the goal is to achieve motion-transferred video generation. Non‑rigid (cloth, humans) are largely out of scope but plays a big role in real-life videos. 2. The outputs are predominantly object-centric, whereas most competing baselines are designed for full-scene motion transfer. This mismatch in focus reduces the fairness and interpretability of quantitative and qualitative comparisons. 3. The choice of 3DGS as
- The core idea of creating an "internal" spatial-temporal prior that only captures motion is good. - Because the motion is represented as an explicit velocity field, the method allows for easy control over motion speed, camera view, and video length. The reported efficiency (under 3 minutes for transfer after prior extraction) is a strong practical advantage over slower generative models.
- The paper compares itself to diffusion models (DMT) and physics simulation (PhysGaussian), but it completely ignores the large body of work on flow-based or trajectory-conditioned video generation (like https://motion-prompting.github.io/). Methods that use optical flow or motion trajectories as guidance are highly relevant competitors, and not comparing against them makes the claim of a "new paradigm" feel overstated. - The paper doesn't give a good reason for using 3DGS. The motion is modele
1. The paper proposes a clear and principled method for constructing the Spatial-temporal (SpaT) prior. It effectively extracts 3D motion trajectories from the source video and distills them into a sequence of rigid transformations (rotation and translation) using the Umeyama algorithm. 2. This paper proposes a robust motion transfer process. It begins by applying the SpaT prior to the target's 3DGS representation to derive a velocity field. A crucial refinement stage is then introduced, featuri
1. The framework's performance is fundamentally capped by the quality of its inputs. The process of reconstructing 3D representations from a monocular video and a single target image is inherently error-prone. These initial inaccuracies in 3D geometry can propagate through the pipeline, compromising the visual quality and stability of the final rendered video. 2. The translation examples shown in Figure 2 are overly simplistic. For such straightforward movements, a much simpler baseline (e.g., u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Motion and Animation
