STM: SpatioTemporal and Motion Encoding for Action Recognition

Boyuan Jiang; Mengmeng Wang; Weihao Gan; Wei Wu; Junjie Yan

arXiv:1908.02486·cs.CV·August 19, 2019·59 cites

STM: SpatioTemporal and Motion Encoding for Action Recognition

Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, Junjie Yan

PDF

Open Access

TL;DR

This paper introduces the STM network, a unified 2D framework that efficiently encodes spatiotemporal and motion features for video action recognition, outperforming state-of-the-art methods with minimal additional computation.

Contribution

The authors propose the STM block with CSTM and CMM modules, replacing residual blocks in ResNet to effectively combine spatiotemporal and motion features in a simple, efficient network.

Findings

01

Outperforms state-of-the-art on multiple datasets

02

Efficiently encodes features with limited extra computation

03

Works well on both temporal- and scene-related datasets

Abstract

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection