Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber

TL;DR
This paper introduces motion-textual inversion, a novel method for semantic video motion transfer that uses motion embeddings with a pre-trained image-to-video model to generate realistic videos with controlled motion from a reference video.
Contribution
The paper proposes a new approach that leverages motion-textual embeddings and a pre-trained image-to-video model to improve semantic video motion transfer without requiring spatial alignment.
Findings
Outperforms existing methods in semantic video motion transfer
Enables high temporal motion granularity with multiple embedding tokens per frame
Generalizes across various domains and tasks such as reenactment and object motion control
Abstract
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation
MethodsFocus
