V4D:4D Convolutional Neural Networks for Video-level Representation   Learning

Shiwen Zhang; Sheng Guo; Weilin Huang; Matthew R. Scott and; Limin Wang

arXiv:2002.07442·cs.CV·February 19, 2020·48 cites

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R. Scott and, Limin Wang

PDF

Open Access 1 Repo

TL;DR

V4D introduces 4D convolutional neural networks for video-level representation learning, capturing long-range temporal evolution while maintaining strong 3D features, leading to significant performance improvements.

Contribution

The paper proposes a novel 4D residual block and integrates it into 3D CNNs for hierarchical long-range video modeling, a new approach in video representation learning.

Findings

01

V4D outperforms recent 3D CNNs on three benchmarks.

02

The 4D residual blocks effectively capture inter-clip interactions.

03

V4D achieves state-of-the-art results in video recognition tasks.

Abstract

Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MalongTech/research-v4d
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Batch Normalization · Residual Block · Residual Connection