STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban; Mohammad Javad Rajabi; Andrea Iaboni; Babak Taati

arXiv:2407.10935·cs.CV·November 11, 2025·1 cites

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban, Mohammad Javad Rajabi, Andrea Iaboni, Babak Taati

PDF

Open Access 1 Repo 3 Reviews

TL;DR

STARS introduces a self-supervised tuning method combining masked prediction and contrastive learning to improve 3D skeleton-based action recognition, especially in few-shot scenarios, achieving state-of-the-art results.

Contribution

The paper proposes a novel self-supervised tuning approach that enhances clustering and generalization in skeleton-based action recognition without relying on hand-crafted augmentations.

Findings

01

Achieves state-of-the-art results on NTU-60, NTU-120, and PKU-MMD benchmarks.

02

Significantly improves few-shot action recognition performance.

03

Outperforms masked prediction models in generalization tasks.

Abstract

Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

- The approach is interesting and tries to leverage advantage of both masked-auto encoder and contrastive learning based pre-training. - Simplicity of the approach is a strength of the paper. Approach needs minimal additional training to achieve the improvements. - The paper is well written and easy to follow for the most part. - The presented ablations and design decisions could be helpful to the community.

Weaknesses

- Prior work: While simplicity is the strength for this paper. It proposes a combination of two existing approaches. The authors must make it clear if the two stages of training have any differences from the original approaches. A missing reference which also discusses differences in representations of MAE and CL-based pre-training for images and simple ways to use both [a]. Why was the proposed approach used instead of adapting one of the approaches in Section 2.2 for skeleton-based representat

Reviewer 02Rating 5Confidence 4

Strengths

1. This paper proposes the STARS framework, which combines MAE with contrastive learning, and can significantly improve the output representation of the MAE encoder with only a small amount of contrastive tuning. 2. Extensive experiments and ablation studies have been conducted on three large-scale 3D skeleton action recognition datasets, effectively proving the effectiveness of the method, and in most cases, reaching the state-of-the-art performance level.

Weaknesses

1. Compared to some other contrastive learning methods (such as AimCLR, CMD), the STARS method only relies on single-view sequences for operation and does not use two different augmented views. Theoretically, its performance may be limited under cross-view evaluation on the NTU dataset. However, the cross-view evaluation experiment results in Table 1 and Table 3 are better than them. The article lacks relevant explanatory analysis. 2. As shown in the experimental results of Table 1, when the pre

Reviewer 03Rating 5Confidence 5

Strengths

1. The paper is well-written, and the techniques sound reliable. 2. The work provides comprehensive experiments that are effective.

Weaknesses

1. The novelty of this paper is limited. It seems like the composition of existing methods and the contribution is not clear. 2. Multi-stages pertaining is more complex than previous studies. Although the training time is decreased, the computation overhead must be considered.

Code & Models

Repositories

TaatiTeam/STARS
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis

MethodsContrastive Learning