TL;DR
This paper introduces a weakly-supervised approach for spatiotemporal action localization in videos using actor proposals and an attention mechanism, achieving state-of-the-art results with only video class labels.
Contribution
It proposes an actor-supervised architecture with actor proposals and attention, enabling effective action localization with minimal supervision.
Findings
Achieves state-of-the-art weakly-supervised localization performance.
Competitive with some fully-supervised methods.
Effective on multiple action datasets.
Abstract
This paper addresses the problem of spatiotemporal localization of actions in videos. Compared to leading approaches, which all learn to localize based on carefully annotated boxes on training video frames, we adhere to a weakly-supervised solution that only requires a video class label. We introduce an actor-supervised architecture that exploits the inherent compositionality of actions in terms of actor transformations, to localize actions. We make two contributions. First, we propose actor proposals derived from a detector for human and non-human actors intended for images, which is linked over time by Siamese similarity matching to account for actor deformations. Second, we propose an actor-based attention mechanism that enables the localization of the actions from action class labels and actor proposals and is end-to-end trainable. Experiments on three human and non-human action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
