TL;DR
The paper introduces Action Transformer, a fully self-attentional model for real-time human action recognition using 2D pose data, outperforming complex architectures and establishing a new benchmark dataset.
Contribution
It presents a novel self-attentional architecture for HAR and introduces MPOSE2021, a large-scale dataset for benchmarking real-time action recognition.
Findings
Action Transformer outperforms existing models on MPOSE2021.
The approach achieves low latency and high accuracy in real-time HAR.
The dataset facilitates standardized evaluation for short-time HAR.
Abstract
Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Layer Normalization · Dropout · Multi-Head Attention · Label Smoothing
