# Motion feature extraction based on semi-supervised learning and long short-term memory network in digital dance

**Authors:** Xue Yang, Hanmin Sun, Yin Lyu, Yang Sun

PMC · DOI: 10.3389/fnbot.2026.1743288 · Frontiers in Neurorobotics · 2026-01-29

## TL;DR

This paper introduces a new method using semi-supervised learning and LSTM-CNN to extract motion features from dance movements, enabling accurate 3D key-point mapping with minimal labeled data.

## Contribution

A novel LSTM-CNN framework with semi-supervised learning and OHEM strategy for efficient 3D key-point estimation from depth sequences with limited annotations.

## Key findings

- The proposed model achieved 96.9% recognition accuracy on the MSR-Action3D dataset with 20% labeled samples.
- On a self-established dataset, the model reached 97.99% accuracy with 35% reduced training time.
- Low RMSE (≤ 0.032) confirmed high spatial precision in key-point prediction.

## Abstract

Digital-image technology has broadened the creative space of dance, yet accurately capturing the semantic correspondence between low-level motion data and high-level dance key-points remains challenging, especially when labeled data are scarce. We aim to establish a lightweight, semi-supervised pipeline that can extract discriminative motion features from depth sequences and map them to 3-D key-points of dancers in real time. To achieve pixel-level alignment between dance movement targets and high-dimensional sensory data, we propose a novel LSTM-CNN (Long Short Term Memory-Convolutional Neural Network) framework. Temporal-context features are first extracted by LSTM, after which multi-dimensional spatial features are captured by three convolutional layers and one max-pooling layer; the fused representation is finally regressed to 3-D body key-points. To relieve class imbalance caused by complex postures, an online hard-example mining (OHEM) strategy together with a Dice-cross-entropy weighted loss (3:1) is embedded into semi-supervised learning, enabling the network to converge with only 20% labeled samples. Experiments on the public MSR-Action3D dataset (567 sequences, 20 actions) yielded an average recognition rate of 96.9%, surpassing the best comparison method (MSST) by 1.1%. On our self-established dataset (99 sequences, 11 actions) the accuracy reached 97.99% while training time was reduced by 35% compared with the previous best Multi_perspective_MHPCs approach. Both datasets show low RMSE (≤ 0.032) between predicted and ground-truth key-points, confirming spatial precision. The results demonstrate that the proposed model can reliably track subtle dance gestures under limited annotation, offering an efficient, low-cost solution for digital choreography, motion-style transfer and interactive stage performance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12894404/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12894404/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12894404/full.md

---
Source: https://tomesphere.com/paper/PMC12894404