# Modeling the Internal and Contextual Attention for Self-Supervised Skeleton-Based Action Recognition

**Authors:** Wentian Xin, Yue Teng, Jikang Zhang, Yi Liu, Ruyi Liu, Yuzhi Hu, Qiguang Miao

PMC · DOI: 10.3390/s25216532 · Sensors (Basel, Switzerland) · 2025-10-23

## TL;DR

This paper introduces a new self-supervised method for skeleton-based action recognition that improves performance by modeling attention in both internal and contextual features.

## Contribution

The paper proposes MICA, a novel language-skeleton contrastive learning framework with feature modulation and frequency feature learning components.

## Key findings

- MICA achieves remarkable action recognition performance on benchmark datasets like NTU RGB+D 60 and 120.
- On the PKU-MMD dataset, MICA outperforms classical methods by at least 4.6%.
- The framework effectively captures internal and contextual attention information.

## Abstract

Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. To overcome these problems, we explore intra-modality feature-wise self-similarity and inter-modality instance-wise cross-consistency, and discover two inherent correlations that benefit recognition: (i) Global Perspective expresses how action semantics carry a broad and high-level understanding, which supports the use of globally discriminative feature representations. (ii) Focus Adaptation refers to the role of the frequency spectrum in guiding attention toward key joints by emphasizing compact and salient signal patterns. Building upon these insights, we propose a novel language–skeleton contrastive learning framework comprising two key components: (a) Feature Modulation, which constructs a skeleton–language action conceptual domain to minimize the expected information gain between vision and language modalities. (b) Frequency Feature Learning, which introduces a Frequency-domain Spatial–Temporal block (FreST) that focuses on sparse key human joints in the frequency domain with compact signal energy. Extensive experiments demonstrate the effectiveness of our method achieves remarkable action recognition performance on widely used benchmark datasets, including NTU RGB+D 60 and NTU RGB+D 120. Especially on the challenging PKU-MMD dataset, MICA has achieved at least a 4.6% improvement over classical methods such as CrosSCLR and AimCLR, effectively demonstrating its ability to capture internal and contextual attention information.

## Full-text entities

- **Diseases:** PKU (MESH:D010661)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12608129/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12608129/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/PMC12608129/full.md

---
Source: https://tomesphere.com/paper/PMC12608129