Make It Move: Controllable Image-to-Video Generation with Text Descriptions
Yaosi Hu, Chong Luo, Zhenzhong Chen

TL;DR
This paper introduces a new task called Text-Image-to-Video generation (TI2V) that creates controllable videos from a static image and text, addressing challenges in aligning appearance and motion while handling uncertainty.
Contribution
It proposes the MAGE model with a novel motion anchor structure and a recursive transformer-based approach for controllable, diverse video generation from images and text.
Findings
MAGE effectively aligns appearance and motion for video generation.
TI2V demonstrates controllability and diversity in generated videos.
Experiments on new datasets show promising results for the approach.
Abstract
Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control in line with user intentions, a novel video generation task, named Text-Image-to-Video generation (TI2V), is proposed. With both controllable appearance and motion, TI2V aims at generating videos from a static image and a text description. The key challenges of TI2V task lie both in aligning appearance and motion from different modalities, and in handling uncertainty in text descriptions. To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor (MA) structure to store appearance-motion aligned representation. To model the uncertainty and increase the diversity, it further allows the injection of explicit condition and implicit randomness. Through three-dimensional axial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation
