Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence
Ruizhuo Xu, Linzhi Huang, Mei Wang, Jiani Hu, Weihong Deng

TL;DR
Skeleton2vec introduces a self-supervised learning framework for skeleton-based action recognition that leverages high-level contextualized features and a motion-aware masking strategy, resulting in superior performance on benchmark datasets.
Contribution
It proposes a novel transformer-based teacher encoder for contextualized target representations and a motion-aware tube masking strategy to enhance spatio-temporal learning.
Findings
Outperforms previous methods on NTU-60, NTU-120, and PKU-MMD datasets.
Achieves state-of-the-art results in skeleton-based action recognition.
Utilizes high-level features and motion priors for improved self-supervised learning.
Abstract
Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Anomaly Detection Techniques and Applications
MethodsFocus
