Video Frame Interpolation with Transformer
Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, Jiaya Jia

TL;DR
This paper introduces a Transformer-based framework for video frame interpolation that effectively models long-range pixel correlations and utilizes a cross-scale attention mechanism to improve performance, achieving state-of-the-art results.
Contribution
The paper proposes a novel Transformer-based approach with cross-scale window attention for improved long-range correlation modeling in VFI.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively models long-range pixel dependencies.
Utilizes cross-scale attention to enhance multi-scale information aggregation.
Abstract
Video frame interpolation (VFI), which aims to synthesize intermediate frames of a video, has made remarkable progress with development of deep convolutional networks over past years. Existing methods built upon convolutional networks generally face challenges of handling large motion due to the locality of convolution operations. To overcome this limitation, we introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Further, our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other. This design effectively enlarges the receptive field and aggregates multi-scale information. Extensive quantitative and qualitative experiments demonstrate that our method achieves new state-of-the-art results on various benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Vision and Imaging · Image Processing Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings
