Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik

TL;DR
This paper explores how self-attention query features in text-to-video diffusion models influence motion, structure, and identity, revealing challenges and proposing efficient control methods for motion transfer and multi-shot video consistency.
Contribution
It uncovers the dual role of query features in governing motion and identity, and introduces a zero-shot motion transfer method and a training-free multi-shot video generation technique.
Findings
Q influences both layout and identity during denoising.
The proposed zero-shot motion transfer is 10 times more efficient.
The multi-shot technique maintains character identity across videos.
Abstract
Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10 times more efficient than existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Natural Language Processing Techniques
MethodsDiffusion
