TL;DR
This paper introduces a novel weakly-supervised, few-shot action localization method that learns to identify actions in untrimmed videos using minimal examples and class label data, without needing boundary annotations.
Contribution
It proposes the first end-to-end trainable network for weakly-supervised, few-shot action localization using Temporal Similarity Matrices and Class Activation Maps.
Findings
Achieves comparable or better performance than fully-supervised methods.
Effective in localizing actions with only one or few examples.
Works on untrimmed videos with minimal annotation.
Abstract
Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
