More Is Less: Learning Efficient Video Representations by Big-Little   Network and Depthwise Temporal Aggregation

Quanfu Fan; Chun-Fu Chen; Hilde Kuehne; Marco Pistoia; David Cox

arXiv:1912.00869·cs.CV·July 27, 2021·90 cites

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, David Cox

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight, efficient video action recognition architecture combining a deep low-resolution subnet with a compact high-resolution subnet, significantly reducing computational costs while maintaining or improving accuracy.

Contribution

The paper proposes a novel Big-Little Network architecture with depthwise temporal aggregation, enabling high efficiency and accuracy in video recognition with reduced resource requirements.

Findings

01

Achieves 3-4x reduction in FLOPs compared to baseline.

02

Uses 2x less memory while maintaining performance.

03

Performs well on Kinetics, Something-Something, and Moments-in-Time.

Abstract

Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by $3 \sim 4$ times in FLOPs and $\sim 2$ times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IBM/bLVNet-TAM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging