In Defense of Image Pre-Training for Spatiotemporal Recognition
Xianhang Li, Huiyu Wang, Chen Wei, Jieru Mei, Alan Yuille, Yuyin Zhou,, and Cihang Xie

TL;DR
This paper advocates for the use of image pre-training in video recognition by decomposing spatial and temporal features, introducing STS convolution, and demonstrating improved accuracy and efficiency across multiple datasets.
Contribution
It introduces Spatial-Temporal Separable (STS) convolution for better spatiotemporal feature decomposition and redefines image pre-training as an effective initialization for 3D CNNs in video recognition.
Findings
STS convolution improves 3D CNN performance without extra parameters.
Pre-trained image models enhance video recognition accuracy.
The proposed method speeds up training while achieving better results.
Abstract
Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
Methods3D Convolution · Convolution
