STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences
Soroush Mehraban, Mohammad Javad Rajabi, Andrea Iaboni, Babak Taati

TL;DR
STARS introduces a self-supervised tuning method combining masked prediction and contrastive learning to improve 3D skeleton-based action recognition, especially in few-shot scenarios, achieving state-of-the-art results.
Contribution
The paper proposes a novel self-supervised tuning approach that enhances clustering and generalization in skeleton-based action recognition without relying on hand-crafted augmentations.
Findings
Achieves state-of-the-art results on NTU-60, NTU-120, and PKU-MMD benchmarks.
Significantly improves few-shot action recognition performance.
Outperforms masked prediction models in generalization tasks.
Abstract
Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including…
Peer Reviews
Decision·Submitted to ICLR 2025
- The approach is interesting and tries to leverage advantage of both masked-auto encoder and contrastive learning based pre-training. - Simplicity of the approach is a strength of the paper. Approach needs minimal additional training to achieve the improvements. - The paper is well written and easy to follow for the most part. - The presented ablations and design decisions could be helpful to the community.
- Prior work: While simplicity is the strength for this paper. It proposes a combination of two existing approaches. The authors must make it clear if the two stages of training have any differences from the original approaches. A missing reference which also discusses differences in representations of MAE and CL-based pre-training for images and simple ways to use both [a]. Why was the proposed approach used instead of adapting one of the approaches in Section 2.2 for skeleton-based representat
1. This paper proposes the STARS framework, which combines MAE with contrastive learning, and can significantly improve the output representation of the MAE encoder with only a small amount of contrastive tuning. 2. Extensive experiments and ablation studies have been conducted on three large-scale 3D skeleton action recognition datasets, effectively proving the effectiveness of the method, and in most cases, reaching the state-of-the-art performance level.
1. Compared to some other contrastive learning methods (such as AimCLR, CMD), the STARS method only relies on single-view sequences for operation and does not use two different augmented views. Theoretically, its performance may be limited under cross-view evaluation on the NTU dataset. However, the cross-view evaluation experiment results in Table 1 and Table 3 are better than them. The article lacks relevant explanatory analysis. 2. As shown in the experimental results of Table 1, when the pre
1. The paper is well-written, and the techniques sound reliable. 2. The work provides comprehensive experiments that are effective.
1. The novelty of this paper is limited. It seems like the composition of existing methods and the contribution is not clear. 2. Multi-stages pertaining is more complex than previous studies. Although the training time is decreased, the computation overhead must be considered.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsContrastive Learning
