Background-Click Supervision for Temporal Action Localization
Le Yang, Junwei Han, Tao Zhao, Tianwei Lin, Dingwen Zhang, Jianxin, Chen

TL;DR
This paper introduces BackTAL, a novel background-click supervision method for weakly supervised temporal action localization, which improves performance by focusing on background frame labels and advanced modeling techniques.
Contribution
It proposes background-click supervision and two-fold modeling to enhance action localization accuracy over existing methods.
Findings
BackTAL outperforms previous weakly supervised methods on three benchmarks.
Background-click supervision effectively reduces background errors in localization.
The proposed modules improve the distinction between action and background frames.
Abstract
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
