# Lightweight Multi-Scale Framework for Human Pose and Action Classification

**Authors:** Alireza Saber, Mohammad-Mehdi Hosseini, Amirreza Fateh, Mansoor Fateh, Vahid Abolghasemi

PMC · DOI: 10.3390/s26041102 · 2026-02-08

## TL;DR

This paper introduces a lightweight deep learning framework for human pose and action classification that achieves high accuracy with a small model size.

## Contribution

The novel contribution is a modular attention-based architecture with a Swin Transformer backbone and three attention modules for effective multi-scale feature fusion.

## Key findings

- The model achieves 90.40% accuracy on the 6-class Yoga-82 dataset.
- It outperforms state-of-the-art methods on Stanford 40 Actions with 94.28% accuracy.
- The model maintains high performance with only 0.79 million parameters.

## Abstract

Human pose classification, along with related tasks such as action recognition, is a crucial area in deep learning due to its wide range of applications in assisting human activities. Despite significant progress, it remains a challenging problem because of high inter-class similarity, dataset noise, and the large variability in human poses. In this paper, we propose a lightweight yet highly effective modular attention-based architecture for human pose classification, built upon a Swin Transformer backbone for robust multi-scale feature extraction. The proposed design integrates the Spatial Attention module, the Context-Aware Channel Attention Module, and a novel Dual Weighted Cross Attention module, enabling effective fusion of spatial and channel-wise cues. Additionally, explainable AI techniques are employed to improve the reliability and interpretability of the model. We train and evaluate our approach on two distinct datasets: Yoga-82 (in both main-class and subclass configurations) and Stanford 40 Actions. Experimental results show that our model outperforms state-of-the-art baselines across accuracy, precision, recall, F1-score, and mean average precision, while maintaining an extremely low parameter count of only 0.79 million. Specifically, our method achieves accuracies of 90.40% and 87.44% for the 6-class and 20-class Yoga-82 configurations, respectively, and 94.28% for the Stanford 40 Actions dataset.

## Full-text entities

- **Genes:** SFTPA1 (surfactant protein A1) [NCBI Gene 653509] {aka COLEC4, ILD1, PSP-A, PSPA, SFTP1, SFTPA1B}
- **Diseases:** injury to (MESH:D014947)
- **Chemicals:** GMP (MESH:C066524), Grad- (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12944327/full.md

---
Source: https://tomesphere.com/paper/PMC12944327