ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
Aman Anand, Amir Eskandari, Elyas Rahsno, Farhana Zulkernine

TL;DR
This paper introduces ASMa, a novel self-supervised learning method for skeleton action recognition that uses asymmetric spatio-temporal masking strategies and a feature alignment module, resulting in improved accuracy and efficiency, especially on edge devices.
Contribution
The paper proposes a new asymmetric spatio-temporal masking approach with a learnable alignment module and model compression, advancing skeleton action representation learning.
Findings
Outperforms existing SSL methods by 2.7-4.4% in fine-tuning accuracy.
Achieves up to 5.9% improvement in transfer learning on noisy datasets.
Reduces model size by 91.4% and triples inference speed on edge devices.
Abstract
Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Context-Aware Activity Recognition Systems
