Stand-Alone Inter-Frame Attention in Video Models

Fuchen Long; Zhaofan Qiu; Yingwei Pan; Ting Yao; Jiebo Luo; and Tao Mei

arXiv:2206.06931·cs.CV·June 15, 2022

Stand-Alone Inter-Frame Attention in Video Models

Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, and Tao Mei

PDF

Open Access 1 Repo

TL;DR

This paper introduces SIFA, a novel inter-frame attention mechanism that explicitly models local deformations between video frames, improving video understanding models by better capturing motion dynamics.

Contribution

The paper proposes SIFA, a stand-alone inter-frame attention block that explicitly estimates local deformation for improved temporal feature aggregation in video models.

Findings

01

SIFA-Net and SIFA-Transformer outperform existing models on multiple datasets.

02

SIFA-Transformer achieves 83.1% accuracy on Kinetics-400.

03

The method effectively captures local deformations across frames.

Abstract

Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fuchenustc/sifa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer