Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Zhaofan Qiu; Ting Yao; Tao Mei

arXiv:1711.10305·cs.CV·November 29, 2017·251 cites

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Zhaofan Qiu, Ting Yao, Tao Mei

PDF

Open Access 2 Repos

TL;DR

This paper introduces Pseudo-3D Residual Networks (P3D ResNet), a novel architecture that efficiently captures spatio-temporal features in videos by recycling 2D CNN components, achieving superior performance on multiple benchmarks.

Contribution

The paper proposes a new P3D ResNet architecture that combines 2D CNN modules to simulate 3D convolutions, reducing computational costs and improving video representation learning.

Findings

01

P3D ResNet outperforms 3D CNN and 2D CNN on Sports-1M dataset by 5.3% and 1.8%.

02

Pre-trained P3D ResNet demonstrates superior generalization across five benchmarks.

03

The architecture enhances structural diversity and deep learning capacity for video tasks.

Abstract

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3 \times 3 \times 3$ convolutions with $1 \times 3 \times 3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3 \times 1 \times 1$ convolutions to construct temporal connections on adjacent feature maps in time.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection