Rethinking Alignment in Video Super-Resolution Transformers
Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, Chao, Dong

TL;DR
This paper challenges the necessity of alignment modules in video super-resolution Transformers, showing they can perform well without alignment or with a new patch alignment method, leading to state-of-the-art results.
Contribution
It reveals that removing traditional alignment modules and using patch alignment can improve VSR Transformer performance and efficiency.
Findings
VSR Transformers can utilize unaligned multi-frame information effectively.
Existing alignment methods may sometimes harm VSR Transformer performance.
Patch alignment achieves state-of-the-art results on benchmarks.
Abstract
The alignment of adjacent frames is considered an essential operation in video super-resolution (VSR). Advanced VSR models, including the latest VSR Transformers, are generally equipped with well-designed alignment modules. However, the progress of the self-attention mechanism may violate this common sense. In this paper, we rethink the role of alignment in VSR Transformers and make several counter-intuitive observations. Our experiments show that: (i) VSR Transformers can directly utilize multi-frame information from unaligned videos, and (ii) existing alignment methods are sometimes harmful to VSR Transformers. These observations indicate that we can further improve the performance of VSR Transformers simply by removing the alignment module and adopting a larger attention window. Nevertheless, such designs will dramatically increase the computational burden, and cannot deal with large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Vision and Imaging
