TL;DR
This paper introduces Gate-Shift Module (GSM), a lightweight component that enhances 2D CNNs for video action recognition by enabling adaptive spatio-temporal feature routing, achieving state-of-the-art results with minimal additional complexity.
Contribution
The paper proposes GSM, a novel, efficient module that improves 2D CNNs for video recognition by incorporating spatial gating for better spatio-temporal feature learning.
Findings
Achieves state-of-the-art on Something Something-V1 and Diving48 datasets.
Obtains competitive results on EPIC-Kitchens with less model complexity.
GSM introduces minimal additional parameters and computational overhead.
Abstract
Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space. In practice however, because of the large number of parameters and computations involved, they may under-perform in the lack of sufficiently large datasets for training them at scale. In this paper we introduce spatial gating in spatial-temporal decomposition of 3D kernels. We implement this concept with Gate-Shift Module (GSM). GSM is lightweight and turns a 2D-CNN into a highly efficient spatio-temporal feature extractor. With GSM plugged in, a 2D-CNN learns to adaptively route features through time and combine them, at almost no additional parameters and computational overhead. We perform an extensive evaluation of the proposed module to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Gate-Shift Networks for Video Action Recognition· youtube
