VPTR: Efficient Transformers for Video Prediction
Xi Ye, Guillaume-Alexandre Bilodeau

TL;DR
This paper introduces efficient Transformer architectures for video prediction, including autoregressive and non-autoregressive models, utilizing local spatial-temporal attention and contrastive loss to improve speed and accuracy.
Contribution
It presents a novel Transformer block with local spatial-temporal attention and compares autoregressive and non-autoregressive models for video prediction.
Findings
Competitive performance with state-of-the-art models
Non-autoregressive model increases inference speed
Contrastive loss improves prediction quality
Abstract
In this paper, we propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism. Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed. In addition, a non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart. In order to avoid the prediction of very similar future frames, a contrastive feature loss is applied to maximize the mutual information between predicted and ground-truth future frame features. This work is the first that makes a formal comparison of the two types of attention-based video future frames prediction models over different scenarios. The proposed models reach a performance competitive with more complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Advanced Data Compression Techniques · Video Coding and Compression Technologies
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Absolute Position Encodings · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer
