TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition
Mina Bishay, Georgios Zoumpourlis, Ioannis Patras

TL;DR
TARN is a novel neural network that uses temporal attention and meta-learning to improve few-shot and zero-shot action recognition by comparing variable-length video representations without fine-tuning.
Contribution
Introduces a temporal attentive relation network with attention mechanisms and deep-distance learning for improved few-shot and zero-shot action recognition.
Findings
Outperforms state-of-the-art in few-shot action recognition
Achieves competitive results in zero-shot action recognition
Does not require fine-tuning or additional memory representations
Abstract
In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognition) or a video and a semantic representation such as word vector (in the case of zero-shot action recognition). By contrast to other works in few-shot and zero-shot action recognition, we a) utilise attention mechanisms so as to perform temporal alignment, and b) learn a deep-distance measure on the aligned representations at video segment level. We adopt an episode-based training scheme and train our network in an end-to-end manner. The proposed method does not require any fine-tuning in the target domain or maintaining additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
