TL;DR
This paper introduces CRAFT, a model that learns to generate realistic scene videos from natural language captions by predicting layouts, retrieving relevant segments, and fusing them, advancing video synthesis from descriptive text.
Contribution
The paper presents CRAFT, a novel model that explicitly predicts scene layouts, retrieves spatio-temporal segments, and fuses them to generate videos from captions, with sequential training and compositional learning.
Findings
CRAFT outperforms pixel generation methods in semantic fidelity and visual quality.
It generalizes well to unseen captions and video databases.
Demonstrated on the new FLINTSTONES dataset with over 25,000 videos.
Abstract
Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. CRAFT explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of CRAFT while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate CRAFT on semantic fidelity to caption, composition consistency, and visual quality. CRAFT outperforms direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
