SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
Jeonghyeok Do, Munchurl Kim

TL;DR
SkateFormer introduces a skeletal-temporal transformer that partitions joints and frames to efficiently capture key relations for improved human action recognition from skeleton data.
Contribution
It proposes a novel partition-based self-attention mechanism that models skeletal and temporal relations efficiently, surpassing existing GCN and transformer methods.
Findings
Outperforms recent state-of-the-art methods on benchmark datasets.
Effectively captures key skeletal-temporal relations with partitioned attention.
Reduces computational resources compared to full self-attention models.
Abstract
Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Anomaly Detection Techniques and Applications
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Residual Connection · Softmax · Linear Layer · Multi-Head Attention · EfficientNet
