Few-Shot Transformation of Common Actions into Time and Space

Pengwan Yang; Pascal Mettes; Cees G. M. Snoek

arXiv:2104.02439·cs.CV·April 7, 2021

Few-Shot Transformation of Common Actions into Time and Space

Pengwan Yang, Pascal Mettes, Cees G. M. Snoek

PDF

Open Access

TL;DR

This paper proposes a novel few-shot transformer method for localizing common actions in videos without class labels, achieving effective spatio-temporal localization even with noisy support videos and extending to pixel-level localization.

Contribution

Introduces a new few-shot transformer architecture for joint spatio-temporal action localization without requiring proposals or labels, advancing the state-of-the-art in few-shot action localization.

Findings

01

Effective localization on AVA and UCF101-24 datasets

02

Performs well even with noisy support videos

03

Extensible to pixel-level localization

Abstract

This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on our reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications