TL;DR
This paper introduces a self-supervised learning method for skeleton-based action recognition that leverages inter-skeleton contrastive learning and specialized augmentations to learn robust spatio-temporal features, achieving state-of-the-art results.
Contribution
It proposes a novel inter-skeleton contrastive learning framework with skeleton-specific augmentations for improved skeleton action representation.
Findings
Achieves state-of-the-art performance on PKU and NTU datasets.
Effective in action recognition, retrieval, and semi-supervised learning.
Outperforms previous self-supervised methods on benchmark datasets.
Abstract
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
