Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation

Yuval Atzmon; Rinon Gal; Yoad Tewel; Yoni Kasten; Gal Chechik

arXiv:2412.07750·cs.CV·May 23, 2025

Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik

PDF

Open Access

TL;DR

This paper explores how self-attention query features in text-to-video diffusion models influence motion, structure, and identity, revealing challenges and proposing efficient control methods for motion transfer and multi-shot video consistency.

Contribution

It uncovers the dual role of query features in governing motion and identity, and introduces a zero-shot motion transfer method and a training-free multi-shot video generation technique.

Findings

01

Q influences both layout and identity during denoising.

02

The proposed zero-shot motion transfer is 10 times more efficient.

03

The multi-shot technique maintains character identity across videos.

Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10 times more efficient than existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Natural Language Processing Techniques

MethodsDiffusion