Local Frequency Domain Transformer Networks for Video Prediction
Hafez Farazi, Jan Nogga, Sven Behnke

TL;DR
This paper introduces a novel, fully differentiable building block for video prediction that disentangles transformation, projection, and transformation tasks, enhancing interpretability and extending to motion segmentation.
Contribution
It proposes a new interpretable, differentiable module for video prediction that separates key tasks and can be extended for scene understanding and motion segmentation.
Findings
Effective on synthetic and real data
Enables motion segmentation and scene composition understanding
Produces reliable, interpretable predictions
Abstract
Video prediction is commonly referred to as forecasting future frames of a video sequence provided several past frames thereof. It remains a challenging domain as visual scenes evolve according to complex underlying dynamics, such as the camera's egocentric motion or the distinct motility per individual object viewed. These are mostly hidden from the observer and manifest as often highly non-linear transformations between consecutive video frames. Therefore, video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule targeting the formation and dynamics of the observed environment. Many of the deep learning-based state-of-the-art models for video prediction utilize some form of recurrent layers like Long Short-Term Memory (LSTMs) or Gated Recurrent Units (GRUs) at the core of their models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging
