SSAN: Separable Self-Attention Network for Video Representation Learning
Xudong Guo, Xun Guo, Yan Lu

TL;DR
The paper introduces SSAN, a novel video representation learning model that sequentially models spatial and temporal correlations using a separable self-attention module, leading to improved performance on multiple datasets.
Contribution
It proposes a separable self-attention module that models spatial and temporal dependencies sequentially, enhancing video understanding.
Findings
Outperforms state-of-the-art on Something-Something and Kinetics-400.
Achieves better results with shallower networks and fewer modalities.
Improves video retrieval performance on MSR-VTT and Youcook2.
Abstract
Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
