Learning Semantic-Aware Dynamics for Video Prediction
Xinzhu Bei, Yanchao Yang, Stefano Soatto

TL;DR
This paper introduces a novel video prediction model that explicitly models scene semantics, dis-occlusions, and object-specific motion to generate more accurate future frames by decomposing scene layout and motion.
Contribution
It presents a new architecture that separately predicts scene layouts and motions, then fuses them with content-aware inpainting for improved video prediction accuracy.
Findings
Outperforms existing methods on benchmark datasets.
Effectively models object-specific motion and dis-occlusions.
Generates more coherent and semantically consistent future frames.
Abstract
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video. The scene layout (semantic map) and motion (optical flow) are decomposed into layers, which are predicted and fused with their context to generate future layouts and motions. The appearance of the scene is warped from past frames using the predicted motion in co-visible regions; dis-occluded regions are synthesized with content-aware inpainting utilizing the predicted scene layout. The result is a predictive model that explicitly represents objects and learns their class-specific motion, which we evaluate on video prediction benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging
MethodsInpainting
