Pointly-Supervised Action Localization
Pascal Mettes, Cees G. M. Snoek

TL;DR
This paper introduces a point-supervised approach for spatio-temporal action localization in videos, reducing annotation costs by replacing bounding boxes with sparse point annotations, and demonstrates competitive performance with robustness to noise.
Contribution
It proposes a novel point-supervised training method leveraging spatio-temporal proposals and pseudo-points, offering an effective alternative to box-supervision for action localization.
Findings
Achieves comparable accuracy to box-supervision with fewer annotations
Robust to sparse and noisy point annotations
Outperforms recent weakly-supervised methods
Abstract
This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a Multiple Instance Learning optimization. During inference, we introduce pseudo-points,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
