Video Extrapolation in Space and Time
Yunzhi Zhang, Jiajun Wu

TL;DR
This paper introduces Video Extrapolation in Space and Time (VEST), a unified approach that leverages both novel view synthesis and video prediction to improve scene understanding in spatial-temporal contexts.
Contribution
The paper proposes a novel model that combines NVS and VP tasks using self-supervision, outperforming or matching state-of-the-art methods on real-world datasets.
Findings
Achieves better or comparable performance to state-of-the-art methods.
Effectively leverages complementary cues from spatial and temporal observations.
Demonstrates versatility on indoor and outdoor datasets.
Abstract
Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Advanced Image Processing Techniques
