Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video   Generation

Guy Yariv; Yuval Kirstain; Amit Zohar; Shelly Sheynin; Yaniv Taigman,; Yossi Adi; Sagie Benaim; Adam Polyak

arXiv:2501.03059·cs.CV·January 7, 2025

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman,, Yossi Adi, Sagie Benaim, Adam Polyak

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a mask-based motion trajectory representation for image-to-video generation, significantly improving motion accuracy and temporal coherence in multi-object scenarios by decomposing the task into intermediate representation generation and video synthesis.

Contribution

The paper proposes a novel two-stage framework with a mask-based motion trajectory as an intermediate representation, enhancing motion realism and consistency in image-to-video generation.

Findings

01

Achieves state-of-the-art results in temporal coherence and motion realism.

02

Introduces a new benchmark for multi-object image-to-video generation.

03

Demonstrates superior performance on challenging benchmarks.

Abstract

We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

GuyYariv/SA-V-128-Benchmark
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications

MethodsSoftmax · Attention Is All You Need