Gate-Shift-Fuse for Video Action Recognition
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

TL;DR
This paper introduces Gate-Shift-Fuse (GSF), a novel module that enhances 2D CNNs for video action recognition by adaptively modeling spatio-temporal features with minimal overhead, achieving state-of-the-art results.
Contribution
The paper proposes GSF, a data-driven, learnable spatio-temporal feature extraction module that can be integrated into 2D CNNs to improve video action recognition performance.
Findings
Achieves state-of-the-art results on five benchmarks.
Increases spatio-temporal modeling capacity with negligible overhead.
Demonstrates compatibility with popular 2D CNN architectures.
Abstract
Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Gait Recognition and Analysis
