Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization   for Efficient Video Classification

Youngwan Lee; Hyung-Il Kim; Kimin Yun; Jinyoung Moon

arXiv:2012.00317·cs.CV·April 23, 2021

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

Youngwan Lee, Hyung-Il Kim, Kimin Yun, Jinyoung Moon

PDF

Open Access 1 Repo

TL;DR

VoV3D introduces an efficient 3D architecture for video classification that combines temporal aggregation and depthwise spatiotemporal factorization, achieving superior accuracy with fewer parameters and less computation.

Contribution

The paper proposes VoV3D, a novel architecture integrating T-OSA and D(2+1)D modules for efficient and effective temporal modeling in video classification.

Findings

01

VoV3D-L has 6x fewer parameters and 16x less computation than previous models.

02

VoV3D surpasses state-of-the-art temporal modeling methods on benchmarks.

03

VoV3D demonstrates better temporal modeling than X3D with similar capacity.

Abstract

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

youngwanLEE/VoV3D
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods

Methods1x1 Convolution · Convolution · Concatenated Skip Connection · Depthwise Convolution · Batch Normalization · One-Shot Aggregation