ConvTransformer: A Convolutional Transformer Network for Video Frame   Synthesis

Zhouyong Liu; Shun Luo; Wubin Li; Jingben Lu; Yufan Wu; Shilei Sun,; Chunguo Li; Luxi Yang

arXiv:2011.10185·cs.CV·June 3, 2021·60 cites

ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Zhouyong Liu, Shun Luo, Wubin Li, Jingben Lu, Yufan Wu, Shilei Sun,, Chunguo Li, Luxi Yang

PDF

Open Access 2 Repos

TL;DR

ConvTransformer introduces a novel convolutional Transformer architecture with multi-head self-attention for improved video frame synthesis, outperforming previous methods in quality and parallelization.

Contribution

This paper presents the first ConvTransformer architecture for video frame synthesis, combining convolutional and Transformer models with a new attention layer for better sequence learning.

Findings

01

Superior quality in video future frame extrapolation

02

More parallelizable than convolutional LSTM-based approaches

03

First application of ConvTransformer to video synthesis

Abstract

Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence. In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis. The core ingredient of ConvTransformer is the proposed attention layer, i.e., multi-head convolutional self-attention layer, that learns the sequential dependence of video sequence. ConvTransformer uses an encoder, built upon multi-head convolutional self-attention layer, to encode the sequential dependence between the input frames, and then a decoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Byte Pair Encoding · Residual Connection · Softmax · Adam · Attention Is All You Need · Dropout