Pair-wise Layer Attention with Spatial Masking for Video Prediction
Ping Li, Chenhan Zhang, Zheng Yang, Xianghua Xu, Mingli Song

TL;DR
This paper introduces the PLA-SM framework for video prediction, combining layer-wise semantic dependency enhancement and spatial feature masking to improve the quality of predicted frames by capturing detailed textures and spatiotemporal dynamics.
Contribution
The paper proposes a novel Pair-wise Layer Attention with Spatial Masking framework that enhances feature dependencies and utilizes spatial features more effectively for improved video prediction.
Findings
Outperforms existing methods on five benchmarks.
Enriches texture details in predicted frames.
Effectively captures spatiotemporal dynamics.
Abstract
Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Human Pose and Action Recognition
