Cross-Attention Transformer for Video Interpolation
Hannah Halin Kim, Shuzhi Yu, Shuai Yuan, Carlo Tomasi

TL;DR
This paper introduces TAIN, a novel transformer-based neural network for video frame interpolation that leverages cross similarity and image attention modules to improve accuracy without flow estimation.
Contribution
The paper presents a new transformer module, Cross Similarity, and an Image Attention mechanism for efficient, flow-free video interpolation.
Findings
Outperforms flow-free methods on benchmarks
Achieves comparable results to flow-based methods
Offers computational efficiency during inference
Abstract
We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Advanced Vision and Imaging · Video Coding and Compression Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer
