Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition
Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh

TL;DR
This paper introduces 3D Residual Networks (ResNets) for action recognition in videos, demonstrating improved performance and reduced overfitting compared to shallower 3D CNNs, validated on ActivityNet and Kinetics datasets.
Contribution
The paper proposes a novel 3D ResNet architecture for video action recognition, enabling deeper networks that outperform shallow models like C3D without overfitting.
Findings
3D ResNets outperform shallow 3D CNNs like C3D.
Training on Kinetics reduces overfitting despite large model size.
Achieved state-of-the-art results on ActivityNet and Kinetics datasets.
Abstract
Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
