TL;DR
TA2N introduces a two-stage alignment approach for few-shot action recognition, addressing temporal and spatial misalignments to improve matching accuracy and achieve state-of-the-art results.
Contribution
The paper proposes a novel two-stage alignment network that sequentially corrects temporal and spatial misalignments in videos for few-shot action recognition.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Effectively aligns action duration and evolution across videos.
Demonstrates robustness to temporal and spatial variations.
Abstract
Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support). The majority of current approaches follow the metric learning paradigm, which learns to compare the similarity between videos. Recently, it has been observed that directly measuring this similarity is not ideal since different action instances may show distinctive temporal distribution, resulting in severe misalignment issues across query and support videos. In this paper, we arrest this problem from two distinct aspects -- action duration misalignment and action evolution misalignment. We address them sequentially through a Two-stage Action Alignment Network (TA2N). The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e.g. background). Next, the second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
