More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, David Cox

TL;DR
This paper introduces a lightweight, efficient video action recognition architecture combining a deep low-resolution subnet with a compact high-resolution subnet, significantly reducing computational costs while maintaining or improving accuracy.
Contribution
The paper proposes a novel Big-Little Network architecture with depthwise temporal aggregation, enabling high efficiency and accuracy in video recognition with reduced resource requirements.
Findings
Achieves 3-4x reduction in FLOPs compared to baseline.
Uses 2x less memory while maintaining performance.
Performs well on Kinetics, Something-Something, and Moments-in-Time.
Abstract
Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by times in FLOPs and times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
