Foundation Model for Skeleton-Based Human Action Understanding
Hongsong Wang, Wanjiang Weng, Junbo Wang, Fang Zhao, Guo-Sen Xie, Xin Geng, Liang Wang

TL;DR
This paper introduces a unified Transformer-based framework for skeleton-based human action understanding, achieving superior performance across diverse benchmarks and tasks by leveraging dense spatio-temporal encoding and self-supervised training.
Contribution
The paper proposes a novel unified skeleton-based dense representation learning framework with a Transformer encoder, feature decorrelation, and multi-perspective consistency training, enabling broad applicability and improved accuracy.
Findings
Outperforms state-of-the-art on 25 benchmarks
Effective in dense, coarse, and transferred action prediction tasks
Enhances feature learning through multi-view and multi-modal consistency
Abstract
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Biomedical Text Mining and Ontologies
