On the Importance of Spatial Relations for Few-shot Action Recognition

Yilun Zhang; Yuqian Fu; Xingjun Ma; Lizhe Qi; Jingjing Chen; Zuxuan; Wu; Yu-Gang Jiang

arXiv:2308.07119·cs.CV·August 15, 2023

On the Importance of Spatial Relations for Few-shot Action Recognition

Yilun Zhang, Yuqian Fu, Xingjun Ma, Lizhe Qi, Jingjing Chen, Zuxuan, Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper emphasizes the significance of spatial relations in few-shot action recognition and introduces a novel Spatial Alignment Cross Transformer (SA-CT) that effectively leverages spatial and temporal information, achieving competitive results.

Contribution

The paper proposes a new SA-CT model that focuses on spatial relations and integrates temporal data, advancing few-shot action recognition methods.

Findings

01

SA-CT performs comparably to temporal-based methods without using temporal info.

02

Adding the Temporal Mixer improves video representation and overall accuracy.

03

Large-scale pretrained models enhance few-shot action recognition performance.

Abstract

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection