Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Manuel Kansy; Jacek Naruniec; Christopher Schroers; Markus Gross; Romann M. Weber

arXiv:2408.00458·cs.CV·May 27, 2025

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber

PDF

Open Access

TL;DR

This paper introduces motion-textual inversion, a novel method for semantic video motion transfer that uses motion embeddings with a pre-trained image-to-video model to generate realistic videos with controlled motion from a reference video.

Contribution

The paper proposes a new approach that leverages motion-textual embeddings and a pre-trained image-to-video model to improve semantic video motion transfer without requiring spatial alignment.

Findings

01

Outperforms existing methods in semantic video motion transfer

02

Enables high temporal motion granularity with multiple embedding tokens per frame

03

Generalizes across various domains and tasks such as reenactment and object motion control

Abstract

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation

MethodsFocus