NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang,, Xiaoran Fan, Wanli Ouyang

TL;DR
NSNet introduces a novel approach for efficient video recognition by suppressing non-salient frames and leveraging dual-level supervisions, achieving state-of-the-art accuracy and significantly faster inference speeds.
Contribution
The paper proposes NSNet, a new framework that effectively distinguishes salient from non-salient frames using pseudo labels and dual supervisions, improving efficiency and accuracy.
Findings
Achieves 2.4 to 4.3 times faster inference speed.
Outperforms existing methods on four benchmarks.
Balances accuracy and efficiency effectively.
Abstract
It is challenging for artificial intelligence systems to achieve accurate video recognition under the scenario of low computation costs. Adaptive inference based efficient video recognition methods typically preview videos and focus on salient parts to reduce computation costs. Most existing works focus on complex networks learning with video classification based objectives. Taking all frames as positive samples, few of them pay attention to the discrimination between positive samples (salient frames) and negative samples (non-salient frames) in supervisions. To fill this gap, in this paper, we propose a novel Non-saliency Suppression Network (NSNet), which effectively suppresses the responses of non-salient frames. Specifically, on the frame level, effective pseudo labels that can distinguish between salient and non-salient frames are generated to guide the frame saliency learning. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image Fusion Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
