Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition
Fei Guo, Li Zhu, YiWang Wang, Jing Sun

TL;DR
This paper introduces TSA-MLT, an end-to-end framework for few-shot action recognition that filters irrelevant frames and leverages multi-level features with a fusion loss, achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes TSA-MLT, a novel method combining task-specific frame filtering and multi-level feature fusion with a fusion loss for improved few-shot video action recognition.
Findings
Achieves state-of-the-art results on HMDB51 and UCF101 datasets.
Demonstrates competitive performance on Kinetics and Something-Something V2 datasets.
Effectively filters irrelevant frames and exploits multi-level features for better recognition.
Abstract
In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension. In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive. However, some video frames may relate little to the action, and only using single frame-level or segment-level features may not mine enough information. We address these problems sequentially through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)". The first module (TSA) aims at filtering the action-irrelevant frames for action duration alignment. Affine Transformation for frame sequence in the time dimension is used for linear sampling. The second module (MLT) focuses on the Multiple-level feature of the support prototype and query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Residual Connection
