Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification
Youngwan Lee, Hyung-Il Kim, Kimin Yun, Jinyoung Moon

TL;DR
VoV3D introduces an efficient 3D architecture for video classification that combines temporal aggregation and depthwise spatiotemporal factorization, achieving superior accuracy with fewer parameters and less computation.
Contribution
The paper proposes VoV3D, a novel architecture integrating T-OSA and D(2+1)D modules for efficient and effective temporal modeling in video classification.
Findings
VoV3D-L has 6x fewer parameters and 16x less computation than previous models.
VoV3D surpasses state-of-the-art temporal modeling methods on benchmarks.
VoV3D demonstrates better temporal modeling than X3D with similar capacity.
Abstract
Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
Methods1x1 Convolution · Convolution · Concatenated Skip Connection · Depthwise Convolution · Batch Normalization · One-Shot Aggregation
