Video Super-Resolution Transformer
Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool

TL;DR
This paper introduces a novel Transformer-based model for video super-resolution that incorporates spatial-temporal convolutional self-attention and optical flow-based feature alignment, significantly improving performance on benchmark datasets.
Contribution
It is the first to adapt Transformer architecture specifically for VSR by addressing data locality and feature alignment issues with new attention and feed-forward layers.
Findings
Outperforms existing VSR methods on benchmark datasets
Effectively exploits spatial-temporal locality in video data
Demonstrates the importance of feature alignment for VSR
Abstract
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Vision and Imaging · Image and Signal Denoising Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Vision Transformer · Label Smoothing · Residual Connection
