SmallBigNet: Integrating Core and Contextual Views for Video Classification
Xianhang Li, Yali Wang, Zhipeng Zhou, Yu Qiao

TL;DR
SmallBigNet enhances video classification by integrating core and contextual views through a dual-branch network, improving accuracy and robustness while maintaining model compactness.
Contribution
The paper introduces SmallBigNet, a novel architecture that combines small and big view branches with shared convolutions to improve video representation learning.
Findings
Outperforms recent state-of-the-art methods on Kinetics400 and other benchmarks.
Achieves high accuracy with model size comparable to 2D CNNs.
Demonstrates improved robustness and discriminative power in video classification.
Abstract
Temporal convolution has been widely used for video classification. However, it is performed on spatio-temporal contexts in a limited view, which often weakens its capacity of learning video representation. To alleviate this problem, we propose a concise and novel SmallBig network, with the cooperation of small and big views. For the current time step, the small view branch is used to learn the core semantics, while the big view branch is used to capture the contextual semantics. Unlike traditional temporal convolution, the big view branch can provide the small view branch with the most activated video features from a broader 3D receptive field. Via aggregating such big-view contexts, the small view branch can learn more robust and discriminative spatio-temporal representations for video classification. Furthermore, we propose to share convolution in the small and big view branch, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
SmallBigNet: Integrating Core and Contextual Views for Video Classification· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
MethodsConvolution
