StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Dongliang He; Zhichao Zhou; Chuang Gan; Fu Li; Xiao Liu; Yandong Li,; Limin Wang; Shilei Wen

arXiv:1811.01549·cs.CV·December 12, 2018·23 cites

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li,, Limin Wang, Shilei Wen

PDF

Open Access 5 Repos

TL;DR

StNet introduces a novel architecture for action recognition in videos that combines local and global spatial-temporal modeling using super-images and a new temporal Xception block, outperforming existing methods.

Contribution

The paper proposes a new spatial-temporal network architecture that stacks frames into super-images and employs a novel temporal Xception block for improved video action recognition.

Findings

01

Outperforms several state-of-the-art methods on Kinetics dataset.

02

Achieves a good balance between accuracy and model complexity.

03

Generalizes well to UCF101 dataset.

Abstract

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Gait Recognition and Analysis

Methods3D Convolution · Convolution