ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Aman Anand; Amir Eskandari; Elyas Rahsno; Farhana Zulkernine

arXiv:2602.06251·cs.CV·February 9, 2026

ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Aman Anand, Amir Eskandari, Elyas Rahsno, Farhana Zulkernine

PDF

Open Access

TL;DR

This paper introduces ASMa, a novel self-supervised learning method for skeleton action recognition that uses asymmetric spatio-temporal masking strategies and a feature alignment module, resulting in improved accuracy and efficiency, especially on edge devices.

Contribution

The paper proposes a new asymmetric spatio-temporal masking approach with a learnable alignment module and model compression, advancing skeleton action representation learning.

Findings

01

Outperforms existing SSL methods by 2.7-4.4% in fine-tuning accuracy.

02

Achieves up to 5.9% improvement in transfer learning on noisy datasets.

03

Reduces model size by 91.4% and triples inference speed on edge devices.

Abstract

Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Context-Aware Activity Recognition Systems