Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning
Jeonghyeok Do, Yun Chen, Geunhyuk Youk, Munchurl Kim

TL;DR
SLiM introduces a decoder-free masked modeling framework for skeleton representation learning, combining contrastive learning with masked modeling, leading to state-of-the-art accuracy and significantly improved efficiency.
Contribution
It is the first to propose a decoder-free masked modeling approach for skeleton learning, integrating contrastive learning with masked modeling in a unified framework.
Findings
Achieves state-of-the-art performance on downstream tasks.
Reduces inference computational cost by 7.89x compared to existing MAE methods.
Effectively captures discriminative features without a decoder.
Abstract
The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
