A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar, Paluri

TL;DR
This paper investigates various spatiotemporal convolutional methods for video action recognition, demonstrating that factorized 3D convolutions and the novel R(2+1)D block improve accuracy over traditional 2D CNNs.
Contribution
It introduces the R(2+1)D convolutional block, combining spatial and temporal filtering, achieving state-of-the-art results in action recognition datasets.
Findings
3D CNNs outperform 2D CNNs in accuracy.
Factorizing 3D convolutions improves performance.
R(2+1)D achieves comparable or superior results to state-of-the-art.
Abstract
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsDense Connections · Batch Normalization · Average Pooling · Global Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Linear Warmup · Random Resized Crop · Weight Decay · SGD with Momentum
