V4D:4D Convolutional Neural Networks for Video-level Representation Learning
Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R. Scott and, Limin Wang

TL;DR
V4D introduces 4D convolutional neural networks for video-level representation learning, capturing long-range temporal evolution while maintaining strong 3D features, leading to significant performance improvements.
Contribution
The paper proposes a novel 4D residual block and integrates it into 3D CNNs for hierarchical long-range video modeling, a new approach in video representation learning.
Findings
V4D outperforms recent 3D CNNs on three benchmarks.
The 4D residual blocks effectively capture inter-clip interactions.
V4D achieves state-of-the-art results in video recognition tasks.
Abstract
Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Batch Normalization · Residual Block · Residual Connection
