Stand-Alone Inter-Frame Attention in Video Models
Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, and Tao Mei

TL;DR
This paper introduces SIFA, a novel inter-frame attention mechanism that explicitly models local deformations between video frames, improving video understanding models by better capturing motion dynamics.
Contribution
The paper proposes SIFA, a stand-alone inter-frame attention block that explicitly estimates local deformation for improved temporal feature aggregation in video models.
Findings
SIFA-Net and SIFA-Transformer outperform existing models on multiple datasets.
SIFA-Transformer achieves 83.1% accuracy on Kinetics-400.
The method effectively captures local deformations across frames.
Abstract
Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer
