Initialization Strategies of Spatio-Temporal Convolutional Neural Networks
Elman Mansimov, Nitish Srivastava, Ruslan Salakhutdinov

TL;DR
This paper introduces initialization techniques for 3D convolutional layers in spatio-temporal ConvNets, leveraging 2D ImageNet weights to improve video understanding without training from scratch.
Contribution
It presents novel weight initialization strategies for 3D ConvNets using 2D ImageNet weights, enhancing temporal feature learning in videos.
Findings
Improved accuracy on UCF-101 dataset.
Effective initialization methods for 3D ConvNets.
Avoids training from scratch for spatio-temporal models.
Abstract
We propose a new way of incorporating temporal information present in videos into Spatial Convolutional Neural Networks (ConvNets) trained on images, that avoids training Spatio-Temporal ConvNets from scratch. We describe several initializations of weights in 3D Convolutional Layers of Spatio-Temporal ConvNet using 2D Convolutional Weights learned from ImageNet. We show that it is important to initialize 3D Convolutional Weights judiciously in order to learn temporal representations of videos. We evaluate our methods on the UCF-101 dataset and demonstrate improvement over Spatial ConvNets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
