In Defense of Image Pre-Training for Spatiotemporal Recognition

Xianhang Li; Huiyu Wang; Chen Wei; Jieru Mei; Alan Yuille; Yuyin Zhou,; and Cihang Xie

arXiv:2205.01721·cs.CV·August 3, 2022·1 cites

In Defense of Image Pre-Training for Spatiotemporal Recognition

Xianhang Li, Huiyu Wang, Chen Wei, Jieru Mei, Alan Yuille, Yuyin Zhou,, and Cihang Xie

PDF

Open Access 1 Repo

TL;DR

This paper advocates for the use of image pre-training in video recognition by decomposing spatial and temporal features, introducing STS convolution, and demonstrating improved accuracy and efficiency across multiple datasets.

Contribution

It introduces Spatial-Temporal Separable (STS) convolution for better spatiotemporal feature decomposition and redefines image pre-training as an effective initialization for 3D CNNs in video recognition.

Findings

01

STS convolution improves 3D CNN performance without extra parameters.

02

Pre-trained image models enhance video recognition accuracy.

03

The proposed method speeds up training while achieving better results.

Abstract

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsc-vlaa/image-pretraining-for-video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications

Methods3D Convolution · Convolution