VPTR: Efficient Transformers for Video Prediction

Xi Ye; Guillaume-Alexandre Bilodeau

arXiv:2203.15836·cs.CV·March 31, 2022

VPTR: Efficient Transformers for Video Prediction

Xi Ye, Guillaume-Alexandre Bilodeau

PDF

Open Access 1 Repo

TL;DR

This paper introduces efficient Transformer architectures for video prediction, including autoregressive and non-autoregressive models, utilizing local spatial-temporal attention and contrastive loss to improve speed and accuracy.

Contribution

It presents a novel Transformer block with local spatial-temporal attention and compares autoregressive and non-autoregressive models for video prediction.

Findings

01

Competitive performance with state-of-the-art models

02

Non-autoregressive model increases inference speed

03

Contrastive loss improves prediction quality

Abstract

In this paper, we propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism. Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed. In addition, a non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart. In order to avoid the prediction of very similar future frames, a contrastive feature loss is applied to maximize the mutual information between predicted and ground-truth future frame features. This work is the first that makes a formal comparison of the two types of attention-based video future frames prediction models over different scenarios. The proposed models reach a performance competitive with more complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiye20/vptr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Advanced Data Compression Techniques · Video Coding and Compression Technologies

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Absolute Position Encodings · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer