TL;DR
This paper introduces BaS-Net, a novel background suppression network for weakly-supervised temporal action localization, which effectively distinguishes background from action frames using a two-branch architecture and auxiliary background class.
Contribution
It proposes a new network architecture with background suppression capability and an asymmetrical training strategy for improved weakly-supervised action localization.
Findings
Outperforms state-of-the-art on THUMOS'14 and ActivityNet benchmarks.
Effectively suppresses background activations to enhance localization accuracy.
Demonstrates robustness across different datasets.
Abstract
Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
