Weakly Supervised Action Selection Learning in Video
Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, Guangwei Yu

TL;DR
This paper introduces Action Selection Learning (ASL), a novel weakly supervised approach that improves video action localization by capturing the concept of 'actionness' and outperforms existing methods on popular benchmarks.
Contribution
The paper proposes ASL, a class-agnostic training method that enhances weakly supervised action localization by modeling 'actionness', reducing class bias in frame selection.
Findings
ASL outperforms baselines on THUMOS-14 and ActivityNet-1.2.
ASL achieves 10.3% and 5.7% relative improvements.
Actionness is crucial for effective weakly supervised localization.
Abstract
Localizing actions in video is a core task in computer vision. The weakly supervised temporal localization problem investigates whether this task can be adequately solved with only video-level labels, significantly reducing the amount of expensive and error-prone annotation that is required. A common approach is to train a frame-level classifier where frames with the highest class probability are selected to make a video-level prediction. Frame level activations are then used for localization. However, the absence of frame-level annotations cause the classifier to impart class bias on every frame. To address this, we propose the Action Selection Learning (ASL) approach to capture the general concept of action, a property we refer to as "actionness". Under ASL, the model is trained with a novel class-agnostic task to predict which frames will be selected by the classifier. Empirically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
