StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li,, Limin Wang, Shilei Wen

TL;DR
StNet introduces a novel architecture for action recognition in videos that combines local and global spatial-temporal modeling using super-images and a new temporal Xception block, outperforming existing methods.
Contribution
The paper proposes a new spatial-temporal network architecture that stacks frames into super-images and employs a novel temporal Xception block for improved video action recognition.
Findings
Outperforms several state-of-the-art methods on Kinetics dataset.
Achieves a good balance between accuracy and model complexity.
Generalizes well to UCF101 dataset.
Abstract
Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Gait Recognition and Analysis
Methods3D Convolution · Convolution
