SSAN: Separable Self-Attention Network for Video Representation Learning

Xudong Guo; Xun Guo; Yan Lu

arXiv:2105.13033·cs.CV·May 28, 2021·1 cites

SSAN: Separable Self-Attention Network for Video Representation Learning

Xudong Guo, Xun Guo, Yan Lu

PDF

Open Access

TL;DR

The paper introduces SSAN, a novel video representation learning model that sequentially models spatial and temporal correlations using a separable self-attention module, leading to improved performance on multiple datasets.

Contribution

It proposes a separable self-attention module that models spatial and temporal dependencies sequentially, enhancing video understanding.

Findings

01

Outperforms state-of-the-art on Something-Something and Kinetics-400.

02

Achieves better results with shallower networks and fewer modalities.

03

Improves video retrieval performance on MSR-VTT and Youcook2.

Abstract

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization