MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

Wei Wei; Shaojie Zhang; Yonghao Dang; Jianqin Yin

arXiv:2508.12948·cs.CV·August 19, 2025

MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

Wei Wei, Shaojie Zhang, Yonghao Dang, Jianqin Yin

PDF

Open Access

TL;DR

MaskSem introduces a semantic-guided masking strategy and hybrid high-order motion reconstruction to improve 3D skeleton-based action recognition, especially for complex motions in human-robot interaction.

Contribution

The paper proposes MaskSem, a novel framework that uses Grad-CAM guided masking and hybrid high-order motion targets to enhance self-supervised learning of complex motion patterns.

Findings

01

Improves recognition accuracy on NTU60, NTU120, and PKU-MMD datasets.

02

Enhances model's understanding of complex motion patterns.

03

Suitable for human-robot interaction applications.

Abstract

Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model's ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Human Pose and Action Recognition