STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, Junjie Yan

TL;DR
This paper introduces the STM network, a unified 2D framework that efficiently encodes spatiotemporal and motion features for video action recognition, outperforming state-of-the-art methods with minimal additional computation.
Contribution
The authors propose the STM block with CSTM and CMM modules, replacing residual blocks in ResNet to effectively combine spatiotemporal and motion features in a simple, efficient network.
Findings
Outperforms state-of-the-art on multiple datasets
Efficiently encodes features with limited extra computation
Works well on both temporal- and scene-related datasets
Abstract
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
