NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition

Boyang Xia; Wenhao Wu; Haoran Wang; Rui Su; Dongliang He; Haosen Yang,; Xiaoran Fan; Wanli Ouyang

arXiv:2207.10388·cs.CV·July 22, 2022·1 cites

NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition

Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang,, Xiaoran Fan, Wanli Ouyang

PDF

Open Access

TL;DR

NSNet introduces a novel approach for efficient video recognition by suppressing non-salient frames and leveraging dual-level supervisions, achieving state-of-the-art accuracy and significantly faster inference speeds.

Contribution

The paper proposes NSNet, a new framework that effectively distinguishes salient from non-salient frames using pseudo labels and dual supervisions, improving efficiency and accuracy.

Findings

01

Achieves 2.4 to 4.3 times faster inference speed.

02

Outperforms existing methods on four benchmarks.

03

Balances accuracy and efficiency effectively.

Abstract

It is challenging for artificial intelligence systems to achieve accurate video recognition under the scenario of low computation costs. Adaptive inference based efficient video recognition methods typically preview videos and focus on salient parts to reduce computation costs. Most existing works focus on complex networks learning with video classification based objectives. Taking all frames as positive samples, few of them pay attention to the discrimination between positive samples (salient frames) and negative samples (non-salient frames) in supervisions. To fill this gap, in this paper, we propose a novel Non-saliency Suppression Network (NSNet), which effectively suppresses the responses of non-salient frames. Specifically, on the frame level, effective pseudo labels that can distinguish between salient and non-salient frames are generated to guide the frame saliency learning. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image Fusion Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings