An Algorithm for Identifying Unsafe Behaviors of Miners Based on the Improved AlphaPose
Xiaopei Liu, Cong Song, Feng Tian

TL;DR
This paper introduces an improved AlphaPose algorithm to better identify unsafe behaviors of miners in complex underground environments using video surveillance.
Contribution
The novel RS-AlphaPose algorithm integrates enhanced detection and attention mechanisms for improved accuracy in miner behavior recognition.
Findings
The RS-AlphaPose algorithm achieved 72.5% average accuracy on the COCO2017 dataset, 2.2% higher than the base model.
On a miner behavior dataset, the algorithm reached 94.5% accuracy for identifying unsafe behaviors like climbing and crossing.
The method effectively handles complex underground conditions like occlusion and chaotic backgrounds.
Abstract
Utilizing video surveillance in mines to identify unsafe behaviors of miners is an important technical means for preventing coal mine accidents and achieving safety control. However, the complex underground environment (such as chaotic backgrounds, personnel occlusion, etc.) severely affects the estimation of human postures and feature extraction, resulting in low accuracy of unsafe behavior identification. To address this issue, this paper proposes a miner unsafe behavior recognition algorithm based on improved AlphaPose (RS-AlphaPose). Firstly, the improved real-time detection Transformer (RTDETR) is adopted to replace the original target detection network. Through the deformable attention mechanism and the addition of small target detection layers, the target detection ability in complex scenes is enhanced. Secondly, the sliding window attention and channel attention mechanisms are…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Technologies in Various Fields
1. Introduction
Coal mine accidents not only seriously threatens the safety of miners and disrupts the normal production order of the mine, but also causes significant economic losses. Research shows that the unsafe behavior of miners is the main cause of coal mine accidents [1]. Real-time monitoring of the behavior of personnel in coal mines through video surveillance is an important means to improve safety levels in coal mines. Consequently, the study of miner unsafe behavior recognition technology has very important practical significance.
Human behavior recognition is a crucial research direction in the field of computer vision. Early approaches to human behavior recognition predominantly relied on traditional machine learning techniques. The features extracted by such methods are susceptible to the influences of scene variations, lighting conditions, and other environmental factors, leading to relatively low recognition accuracy. Benefiting from the significant improvement in computing power, deep learning-based methods have undergone rapid development [2]. These methods can be generally categorized into CNN and spatio-temporal fusion-based approaches [3,4], Transformer-based methods [5,6], and graph convolutional network-based techniques [7,8], among others. However, owing to the challenging conditions in coal mine environments, such as poor illumination and frequent human occlusion, the direct application of the aforementioned methods fails to achieve accurate and reliable recognition of the behaviors of underground personnel.
In recent years, the identification of unsafe behaviors of personnel in underground coal mines has gradually attracted the attention of researchers. Methods based on deep learning have become the mainstream in this field, and can mainly be divided into two categories: static behavior recognition methods based on object detection and dynamic behavior recognition methods based on human skeletons. The former is based on single-frame images for recognition, and its core idea is to transform the problem of miner behavior recognition into the detection of human targets in specific behavioral states through object detection models. For example, Xin et al. [9] proposed an improved YOLOV8n model to detect miner actions such as sitting, standing, operating, leaning, and falling; Wang et al. [10] used deformable convolution and SimAM attention mechanism to improve YOLOv8, achieving the recognition of five types of miner unsafe actions including falling, not wearing a helmet, illegal riding, crossing equipment, and entering dangerous areas; Yao et al. [11] proposed an attentional multi-scale cascaded feature fusion-based YOLOv5s model to address issues such as small targets, low light, and coal dust occlusion for the recognition of miner actions such as helmet wearing and falling; Li et al. [12] used the YOLOv8 model for miner helmet detection; Gao et al. [13] proposed an end-to-end MSD-DETR model to detect the behavior of miners in mine overhead passenger conveyors. Xu et al. [14], based on the YOLOv5 framework, introduced adaptive feature fusion modules at the head and tail of the backbone network, respectively, for deep data processing, achieving the recognition of unsafe behaviors such as construction workers not wearing safety helmets or reflective vests. These methods enable fast inference and perform well in recognizing single static behaviors without interaction, such as not wearing a helmet and leaning against equipment, but they struggle to handle complex dynamic behaviors.
To address the recognition of complex dynamic behaviors among miners, skeleton-based methods first extract human keypoints using models such as OpenPose [15], AlphaPose [16], and HRNet [17]. They then construct pose skeleton sequences and apply temporal models such as ST-GCN [18], LSTM [19], and TCN [20] to achieve behavior recognition, making them able to be widely used. Relevant scholars have conducted extensive research on such methods. Wang et al. [21] improved the YOLOv7-Pose model to extract the keypoints of miner skeletons and recognize miner actions such as standing, operating, and sitting; Shi et al. [22] optimized the target detection module of AlphaPose and integrated spatio-temporal graph convolutional networks to achieve miner behavior recognition, balancing speed and accuracy. In addition, Cao et al. [23] improved the behavior recognition ability in complex environments by reconstructing the graph structure and introducing self-attention mechanisms, and Yang et al. [24] proposed the ANODE-GCN model to address the problem of missing skeleton kkeypoints which all of these scholars making progress in specific scenarios. Wang et al. [25] combined the spatio-temporal dual-branch structure with the transposed attention representation mechanism to construct a behavior recognition model for coal miners, which enhanced the model’s classification ability for highly similar behaviors.
Based on the above analysis, the behavior recognition method using the human skeleton can significantly reduce the interference from environmental factors and has made progress in the identification of miners’ unsafe behaviors. However, due to the complex underground environment, this field still faces two core challenges:
- (1)The monitoring area in the underground is equipped with numerous devices, most of which have similar colors to miners’ clothes, which brings significant interference to the detection of human targets and affects subsequent behavior recognition;
- (2)The occurrence of underground scenes with personnel occlusion is frequent. The existing algorithms have limited ability to estimate postures and extract features in the presence of occlusion, directly resulting in low behavior recognition accuracy.
To address the above problems, this paper takes the AlphaPose model as the baseline for posture estimation, and conducts research on the algorithm for identifying unsafe behaviors of personnel in underground coal mines. The main contributions are as follows:
- (1)Based on RTDETR, a lightweight convolutional structure was introduced to reduce the computational burden, a deformable attention mechanism was integrated to dynamically focus on blurry or occluded areas, and a dedicated small target detection layer was added to enhance the ability to capture tiny miner targets, only addressing the issue of low accuracy in human target detection in complex mine environments.
- (2)In the single-person pose estimation network, the Swin Transformer sliding window multi-head attention mechanism was introduced, aiming to model local human structures and infer the positions of keypoints in the presence of occlusion. At the same time, combined with the channel attention mechanism, the weights of different feature channels were dynamically adjusted to suppress background noise caused by dust and equipment, thereby improving the stability and accuracy of keypoint extraction in chaotic environments.
- (3)For the optimization of skeleton representation for miner behavior analysis, the traditional global keypoint set was simplified to a 13-point skeleton representation, retaining discriminative information while reducing computational costs.
2. Materials and Methods
2.1. Analysis of the AlphaPose Algorithm
Given the complexity of the mining environment, this paper adopts a top-down AlphaPose human pose estimation framework to achieve efficient and accurate human pose recognition. The structure of AlphaPose is illustrated in Figure 1.
First, Faster R-CNN is employed to extract human object candidate boxes. Subsequently, these candidate boxes are fed into the Regional Multi-Person Pose Estimation (RMPE) framework for pose estimation. RMPE consists of three components: a Symmetric Spatial Transformation Network, Parametric Pose Non-Maximum Suppression, and a Pose-Guided Region-Based Bounding Box Generator. This framework primarily addresses issues of inaccurate bounding box localization and pose redundancy, thereby enhancing the accuracy and efficiency of pose estimation.
The Symmetric Spatial Transformation Network (SSTN), as the core module of the RMPE framework, consists of two parts: the Spatial Transformation Network and the Inverse Spatial Transformation Network. The Spatial Transformation Network (STN) corrects biased human bounding boxes to enhance pose estimation accuracy. Within the single-person pose estimation module, this network adopts a multi-layer stacked Hourglass network structure, which efficiently detects multiple human joint points, such as shoulders, knees, and elbows. By rapidly identifying these key regions, the system can swiftly infer the overall human pose, thereby improving the algorithm’s response speed. Furthermore, the stacked architecture enables the network to process input feature maps across multiple levels. This multi-scale approach allows the model to extract diverse information at different hierarchical levels, resulting in deeper feature representations. Finally, considering the scale variations among human joints, this network structure flexibly handles such differences. It not only accurately identifies local joints but also effectively detects overall human features. Consequently, the network achieves more precise pose estimation.
2.2. RS-AlphaPose Model for Identifying Unsafe Behaviors of Miners
Although AlphaPose demonstrates excellent performance across multiple scenarios, its application in identifying unsafe behaviors among miners remains constrained by the complex mine environment. This manifests as insufficient localization accuracy for small targets and the tendency for key human body points to be lost in occlusion scenarios, directly compromising the reliability of behavior recognition. To address this, this paper proposes the RS-AlphaPose human pose estimation network by enhancing the AlphaPose architecture. Its dynamic unsafe behavior recognition workflow is illustrated in Figure 2.
Firstly, the input video is processed by the miner object detection module (Algorithm of PDP-RTDETR) to accurately locate the position of the miners. Secondly, the detection area is input into Human Keypoints Extection Module, which employs a Swin Transformer backbone network with a fused sliding window attention mechanism and a SENet channel attention mechanism to achieve robust keypoint estimation. Finally, the ST-GCN is used to model the temporal sequence of keypoints and identify the miner’s behavior. In Figure 2, the images displayed below each module are the corresponding output results of that module.
2.2.1. Optimization of Object Detection Module
AlphaPose significantly improves human detection accuracy by integrating SSTN and PGPG technologies, yet its target detection still relies on the Faster R-CNN algorithm. As an early two-stage object detection method, Faster R-CNN achieves high accuracy but struggles with feature extraction in complex underground coal mine environments. Moreover, its large convolutional layers and feature maps increase network parameters, ultimately reducing computational efficiency.
RTDETR is an end-to-end one-stage detection algorithm that operates without Anchor design. It employs a lightweight transformer architecture for object detection, streamlining computational processes and reducing parameter counts while maintaining high detection accuracy. In complex coal mining scenarios, RTDETR demonstrates superior region localization capabilities compared to Faster R-CNN. When integrated with AlphaPose’s pose estimation algorithm, RTDETR significantly enhances pose estimation precision, providing robust support for human behavior analysis in complex environments.
To address the challenges of large-scale target variations and insufficient feature extraction in mining environments, the RTDETR model is optimized to enhance detection performance in complex settings. First, a PConv-Block architecture is designed and integrated into the backbone network, forming the lightweight feature extraction module BasicBlock-PConv. This reduces network parameters and computational load, accelerating processing speed. Second, a deformable attention mechanism is introduced, enabling the model to dynamically adjust attention positions based on input image content. This flexible approach allows the network attention to focus on relevant regions and capture target information in high-resolution images, enhancing the robustness of the network model. Finally, a small object detection layer is added to strengthen the extraction of small object features in shallow networks, improving detection accuracy while reducing false negatives and false positives. The improved network architecture is illustrated in Figure 3.
The improved RTDETR model replaces the original two-stage object detection algorithm Faster R-CNN in AlphaPose. This enhanced model enables rapid and accurate localization and extraction of human bodies, providing a solid foundation for subsequent dynamic behavior recognition.
2.2.2. Improvement of SPPE Network
The single-person pose estimation network architecture in AlphaPose is FastPose, which consists of SEResnet, DUC (Dense Upsampling Convolution), Pixshuffle, and the output convolution layer. This structure uses the relatively simple ResNet50 as the backbone network, with limited feature extraction capabilities. In recent years, the Transformer architecture has demonstrated strong performance in computer vision tasks, especially in feature representation. Therefore, this study introduces Swin Transformer [26] as the backbone network for AlphaPose. The hierarchical structure and sliding window mechanism of Swin Transformer enable it to effectively process features at different scales. However, feature map importance varies across different channels. To address this, the SE-Swin Transformer architecture was developed. This structure combines Swin Transformer with SENet’s channel attention mechanism. SENet learns weight relationships between channels, adaptively adjusting each channel’s relative importance. This enhances the semantic expression of global information and improves human pose feature extraction in complex backgrounds or occlusion scenarios. Based on this, the S-FastPose network is established to more accurately extract keypoint information of human skeletons. The improved FastPose network model structure is shown in Figure 4.
The S-FastPose architecture comprises modules such as Swin Transformer, DUC, and Pixshuffle. Building upon the original Swin Transformer, it incorporates the SENet module to construct the SE-Swin Transformer network structure. The SENet module enhances the network’s expressive power by adaptively calibrating channel features, enabling the network to focus more on important features. Multiple SE-Swin Transformer blocks are stacked to progressively extract and amplify features. This architectural design enables the S-FastPose network to more accurately capture key skeletal points in complex scenes, thereby improving human pose estimation accuracy.
2.3. Behavior Recognition Based on ST-GCN
ST-GCN is a deep learning model that integrates graph convolutional networks with time series analysis [27]. Due to its outstanding spatio-temporal data processing capabilities, it is widely applied in human pose estimation and action recognition. This paper employs it for dynamic unsafe behavior recognition based on skeleton sequences extracted by an improved AlphaPose. The extracted skeleton information is arranged frame-by-frame, with the same joint connecting adjacent frames, yielding the spatio-temporal graph of the skeleton sequence as shown in Figure 5.
In the skeleton sequence spatio-temporal graph, dots represent body joints, solid lines indicate connections between joints, dashed lines denote connections between the same joint across adjacent frames, and the green arrow in the lower left corner signifies changes in the temporal dimension. As shown in Figure 5, this graph is composed of N skeleton joints and a T-frame skeleton sequence, which can be expressed as:
where V denotes the all the joints in a skeleton sequence, and E is the edge set. v_ti_ denotes the i-th joint on frame t, T denotes the total number of frames, and N denotes the number of joints, taken as 17 here. E_S_ represents the set of connecting edges between different key nodes within the same frame, defined as E_S_ = {v_i_, v_j_|(i, j) ∈ H}, where H is the set of natural joint connections in the human body. This subset describes the internal skeletal connections of the human body in each frame, representing the spatial information of human motion. E_F_ denotes the set of the inter-frame edges, which connect the same joints in consecutive frames E_F_ = {v_ti_, v(t+1)i}. This subset describes the connections between the same joint nodes of the human body across different frames, representing the temporal information of human motion. These two sets of edges ultimately jointly construct the spatio-temporal graph of the human skeleton, enabling spatio-temporal convolutions to simultaneously capture both spatial and temporal information.
The spatio-temporal graph data of skeleton sequences is fed into the ST-GCN model. After batch normalization, the input undergoes convolution through nine spatio-temporal graph convolution layers. The feature vector is then averaged via a pooling layer to obtain a fixed-size representation. Finally, the fully connected layer outputs the recognition result through a Softmax classifier. The ST-GCN recognition workflow is illustrated in Figure 6.
3. Experimental and Analysis
3.1. Datasets
(1)Pose Estimation Dataset
The human skeleton point extraction model was trained using the MSCOCO2017 dataset. The training set annotations were sourced from person keypoints train 2017.json, comprising over 50,000 images, while the validation set utilized person keypoints val 2017.json, containing approximately 5000 images. On average, each image in these datasets contains two human instances. Furthermore, the MSCOCO pose estimation dataset defines 17 key joints as human body landmarks. The improved pose estimation model is trained based on this dataset. The specific joint labels and names are shown in Table 1.
In the COCO dataset, AlphaPose outputs information for 17 human keypoints. These skeletal points cover major body regions, enabling effective description of human poses and movements. The connectivity relationships among these joints are illustrated in Figure 7a.
The advantage of this design lies in the ability to clearly represent the spatial relationships between different parts of the human body through structured connection relationships, and to adapt to various postures ranging from simple to complex. Considering in the identification of miners’ unsafe behaviors, the facial keypoints contribute little to the recognition of actions such as walking, climbing, and jumping. Therefore, in this study, by omitting the four facial kkeypointsof the left eye, right eye, left ear, and right ear, while maintaining the original connection relationships of the kekeypointsuch as the neck, shoulders, elbows, hips, and knees, the number of skeleton keykeypointss reduced to 13, as shown in Figure 7b. This adjustment optimized the model’s input representation, reduced the computational cost, and simultaneously improved the recognition accuracy and efficiency.
To adapt to the graph structure of these 13 keypoints, we retrained the ST-GCN model. Specifically, we adjusted the input dimensions and parameters of the graph convolutional layers to match the topological structure of the 13 nodes, and correspondingly modified the input dimension of the final classification layer. All comparative experiments were conducted based on this adapted and retrained model to ensure the fairness and consistency of the evaluation.
(2)Self-built Behavior Recognition Dataset
This study collected monitoring videos in areas such as the tail end of the belt conveyor, the coal mining face, and the excavation face of a certain coal mine in Shaanxi Province. The study used fixed mine monitoring cameras for collection, with the highest resolution of the cameras reaching 5 million pixels, and the frame rate ranging from 1 to 30 FPS. The videos were collected from 20 cameras, and the original video formats are FLV and MPEG-4. During the data processing stage, FFmpeg was used to extract frames and crop the monitoring videos, setting appropriate frame rates for different target categories and movement speeds to generate image samples. The collection conditions included low illumination, dust interference, equipment obstruction, and self-obstruction by humans. The obstructions mainly came from belt conveyors, supports, and mine tunnel equipment, and there were also small targets of miners caused by the near-far size effect. These videos contained three behaviors of miners: walking, climbing and crossing. Their labels were respectively “xingzou”, “panpa”, and “kuayue”. Some video segments in the dataset are shown in Figure 8.
During the construction of the miner action dataset, the original collected videos were first edited to remove irrelevant actions and redundant segments, retaining only the complete sequences of the target actions to ensure the quality of the dataset. After processing, each single action video clip was approximately 10 s long, with 300 video clips for each behavior category, totaling 900 video clips in the dataset. Next, LabelImg was used to manually delineate the miner’s body area in each frame of the video, generating precise bounding box annotations, and cross-validation was adopted. Subsequently, the processed videos were input into the improved Alphapose network to extract the keypoint information of the human skeleton, with each frame generating corresponding data. Finally, the motion features corresponding to each video were stored in a json file named after the action. All videos were processed according to the above steps, and then all json files were divided into training, validation and test sets in an 8:1:1 ratio.
Furthermore, to avoid information leakage caused by the high similarity between adjacent frames, we adopt a grouping strategy based on capture batch, camera ID, and time segment before dataset splitting. This ensures that clips from the same original continuous video source (i.e., the same camera and time period) are not assigned to different subsets simultaneously, and the test set is excluded from all training and parameter tuning processes.
3.2. Performance Evaluation of Improved Pose Estimation Network
3.2.1. Implementation Details
Experiments were conducted on the Linux Ubuntu 18.04 operating system with an RTX 1080 Ti GPU. The development environment was built based on the PyTorch framework, with specific configurations of PyTorch 1.1.0 and Python 3.8. The hyperparameters for model training were set as follows: the batch size was set to 32, the number of epochs was set to 100, and the optimizer used was Adam, where the learning rate was set as 0.001 and the learning rate scheduler was configured as None; no data augmentations were applied, and a fixed random seed of 0 was used for all experiments.
We select Kps AP (Keypoints Average Precision) as the evaluation metric. In the COCO Keypoints evaluation, the keypoint similarity is defined using OKS as:
where denotes the keypoint error distance, s denotes the target scale term, denotes the kkeypointscale constant, and denotes the visibility flag. Evaluation is conducted based on the official COCO evaluation API. On the val2017 dataset, the average precision (AP) is computed under different OKS thresholds within the specified threshold range , and the mean value is taken as the final result. A higher value indicates greater accuracy of the algorithm in keypoint detection. Therefore, a larger Kps AP value signifies superior algorithm performance.
3.2.2. Train Results and Analysis
Figure 9 shows the trend of loss values and training accuracy during the training process of the RS-Alphapose model. The training process iterated for 100 epochs. The loss value decreased rapidly as the training rounds increased. When the epoch reached 88, the curve stabilized, indicating that the network achieved good convergence. Similarly, for the training accuracy, the model’s accuracy climbed rapidly in the early stage, indicating that the model quickly learned the basic feature patterns of the data. As the training rounds increased, the accuracy showed a continuous upward trend, reflecting the model’s ability to fit complex features gradually improving. When the training entered the later stage, the growth rate of accuracy slowed down and stabilized, eventually converging, indicating that the model had fully learned the effective information of the training data.
3.2.3. Experimental Results of Improved AlphaPose
To evaluate the effectiveness of improved human pose estimation algorithms in practical applications, images of a slightly occluded human in a coal mine environment were selected as test samples and input into the model for inference testing. The test results of the RS-Alphapose model are presented in Figure 10a,c show the original input images, where the primary challenges arise from partial occlusion of human body parts and the dim, low-contrast lighting conditions in the scene. These factors can lead to incorrect connections between human keypoints or a failure to accurately detect the human pose. In contrast, subfigures (b) and (d) illustrate the corresponding estimated human poses inferred from images (a) and (c), respectively.
It can be clearly observed from Figure 10a that the miner’s body is partially occluded during climbing, while in Figure 10c, the miner is obstructed by mining equipment. Nevertheless, the proposed algorithm in this study is still able to accurately estimate the human pose under such occlusion, achieving relatively high precision in human detection and connecting the full-body keypoints. Even for occluded body parts, the algorithm can recognize them with reasonable accuracy. This demonstrates that the algorithm is capable of making plausible predictions for occluded human body parts based on the current pose, indicating its strong adaptability to scenarios with a certain degree of occlusion.
3.2.4. Comparison of Pose Estimation Models
Comparative experiment was conducted on the MSCOCO dataset using the OpenPose [16], Mask RCNN pose, and AlphaPose [17] baseline model algorithms, with the experimental results shown in Table 2.
As shown in Table 2, the improved AlphaPose network outperforms other pose estimation algorithms in the extraction accuracy of key skeletal points of homo sapiens. The detection accuracy of this algorithm on the COCO2017 datasets reaches 72.5%, representing improvements of 2.2%, 11.3%, and 9.4% over the baseline model, OpenPose, and Mask RCNN, respectively. The experimental results demonstrate that the improved algorithm achieves better extraction of keypoints of human body, verifying the effectiveness of the algorithm. In terms of parameters, the improved RS-Alphapose model has 32.2 million parameters. Although this is slightly higher than that of the original Alphapose model, it achieves an average precision (AP) of 72.5% on the COCO2017 test set, representing a 2.2% improvement over the baseline model. Consequently, a favorable balance between model accuracy and parameter quantity is realized.
3.3. Behavior Recognition Model Validation and Results Analysis
The specific training configuration for the miner action recognition model is as follows: the deep learning framework runs on a cloud server with the Linux Ubuntu 18.04 operating system, CUDA version 10.0, PyTorch 1.1.0, a training period set to 200 epochs, and a batch size of 16. In this paper, ST-GCN is combined with the improved AlphaPose algorithm and other similar pose estimation algorithms, and validated on a self-built dataset of miners’unsafe actions. Three behaviors of underground miners are identified: panpa (climbing), kuayue (crossing), and xingzou (walking). The evaluation metrics adopted are precision, recall and F1-score.
3.3.1. Behavior Recognition Results
This paper employs the improved AlphaPose algorithm to extract human skeleton keypoint sequence data, which is then fed into ST-GCN for identifying unsafe miner behaviors, and the temporal window size for ST-GCN is set to 30 frames. To validate the accuracy of the proposed method in classifying miner actions, experimental verification was conducted on a self-built miner action recognition dataset. The results are shown in Table 3. It can be seen the overall average Precision, Recall and F1-score of the model reach 94.5%, 94.4% and 94.4% respectively, indicating that the proposed method exhibits favorable stability across different action categories. Moreover, the inference speed of the tested model can reach 22 frames per second.
Figure 11 shows the confusion matrix results of the model on the test set. It can be seen that the diagonal elements of the matrix occupy a dominant position, indicating that most samples were correctly classified, and the overall classification performance is good. The recognition accuracy rates of the three behaviors types within the rows, xingzou, panpa, and kuayue, are all above 93%, which indicates that the model can effectively distinguish different action categories, with only a small amount of confusion between different categories.
Figure 12 presents an example of the visualization results. In Figure 12a, the confidence of the “panpa” category is 74.1%; in Figure 12b,c, the confidence of the “kuayue” category is 82.5% and 83.4% respectively; in Figure 12d, the confidence of “xingzou” is 87.2%. The proposed method in this paper can accurately identify human behaviors in complex environments. Although Figure 12a correctly identified the behavior of “panpa”, the confidence was only 74.1%, which was due to the presence of human occlusion, resulting in a lower confidence level.
It can also be observed from Figure 12 that uneven illumination, dust interference, and cluttered backgrounds exist in the scenes shown in (b–d). Furthermore, the color tone of the miners’ clothing is similar to that of the background. Nevertheless, the algorithm proposed in this paper can still accurately recognize the behaviors of the miners. This demonstrates that the algorithm proposed in this paper is effective.
To further evaluate the applicability of the algorithm proposed in this paper, scenarios including partial occlusion, severe occlusion, similar actions, and excessive illumination are selected for testing in the experiment, and the test results are shown in Figure 13. It can be observed from the figure that the algorithm fails to achieve effective recognition under extreme conditions, showing severe occlusion (more than half of the human body), over-exposure, and similar actions.
From Figure 13a, it can be seen that due to strong light overexposure and glare causing pixel saturation in the lower body area of the human target, the human contour and keypoints information of the lower limbs are severely lacking, resulting in failure of posture estimation and incomplete skeleton sequence, thus making it impossible for subsequent sequential motion recognition to output effective results. In Figure 13b, since the postures and motion trajectories of the panpa and kuayue-like movements within a short time segment have small variations, and the spatial relationships of the hip, knee, and ankle key joints are highly similar, they exhibit strong similarity in movements, thereby increasing the difficulty of discriminative modeling in time series. Therefore, the panpa sample was mistakenly identified as kuayue by the model during the transition stage.
3.3.2. Comparison of Behavior Recognition Models
The proposed miner unsafe behavior recognition model validation is conducted on our self-built dataset of unsafe miner behavior, which includes three dynamic actions performed by underground workers—climbing, crossing, and walking. The ST-GCN is used as the behavior classifier, and its performance is compared using skeleton sequences extracted by both the proposed RS-AlphaPose and several pose estimation algorithms.
Table 4 presents the comparative results of five different algorithm models in the miner behavior recognition task. The compared models include OpenPose, Mask RCNN, HRNet, MoveNet, as well as the models before and after improvement, and all the models were tested on our self-build dataset.
The experimental results show that the performance of the algorithm proposed in this paper is superior to other methods, but with moderate parameters. This verifies the superiority of the algorithm proposed in this paper in the task of behavior recognition, enabling accurate identification of unsafe behaviors of underground miners, and proves the effectiveness of the improvement measures.
To verify the real-time performance, experiments were implemented on a platform equipped with an NVIDIA GeForce GTX 1080 Ti graphics card (NVIDIA Corporation, Santa Clara, CA, USA) and operated on the Ubuntu 18.04 LTS operating system (Canonical Ltd., London, UK). An end-to-end inference pipeline was adopted to measure the system-level throughput (FPS) and single sample latency (ms). The continuous video stream with an input resolution of 1920 × 1080 and a frame rate of 16 was tested, and the average value of 1000 frames and the P95 latency were statistically analyzed. The results show that the end-to-end throughput of this system reaches approximately 22 FPS, the average inference latency per frame is about 45 ms, and the P95 latency is about 85 ms. This system can meet the real-time monitoring requirements of underground video streams.
3.4. Ablation Experiments
To further clarify the contributions of each component within the proposed framework, we conducted a series of ablation experiments on the detection, pose estimation, and action recognition modules. All experiments were performed under the same training configuration to ensure fair comparisons.
Due to complex environmental factors, miner targets typically exhibit characteristics such as small scale, low contrast, and susceptibility to occlusion, placing high demands on the robustness of detection models. Training of the improved RTDETR network was conducted on a cloud server with the Ubuntu 20.04 operating system and Python 3.8 programming language, and the training environment was built based on the PyTorch 1.13.1 deep learning framework. The CPU is AMD EPYC 7601 and the GPU is NVIDIA RTX 3060 Ti (12 GB video memory). The training hyperparameters were set as follows: batch size was 16, the number of training epochs was 200, the Adam optimizer was adopted, and the learning rate was set to 0.001 with no learning rate scheduler employed. Data augmentations consisting of random horizontal flip (probability = 0.5) and random rotation within ±10° were applied, and a fixed random seed of 0 was used across all experiments.
Table 5 presents performance comparisons achieved by introducing PConv-Block, deformable attention modules, and small object detection layers to the baseline RT-DETR model.
Experimental results demonstrate that PConv-Block effectively enhances the ability to distinguish foreground and background features, exhibiting superior feature representation capabilities in mining surveillance scenarios characterized by dim lighting and significant dust interference. The deformable attention module improves the model’s spatial deformation modeling capacity by introducing an adaptive sampling mechanism, enabling relatively accurate localization even when subjects undergo substantial pose changes or partial occlusions occur. The small object detection layer delivers the most significant improvement in the AP small metric, primarily because miners typically occupy only a minimal pixel area in raw surveillance images. The FPS of the improved model slightly decreases to 28.2, while the AP and small-object detection accuracy are significantly improved, which overall reflects a reasonable trade-off between the number of parameters and inference speed.
To evaluate the impact of pose estimation networks on subsequent skeleton-based action recognition performance, a comparative analysis was conducted on three variants of pose estimation models, namely FastPose, Swin Transformer, and the proposed SE-Swin. The relevant experimental results are shown in Table 6.
Experimental results demonstrate that the Swin Transformer, benefiting from its hierarchical self-attention mechanism, significantly outperforms SEResNet-based architectures by achieving more comprehensive fusion of global and local features. This capability proves particularly crucial in scenarios involving miners with substantial pose variations and complex limb movements. The subsequent integration of SENet endows the network with adaptive feature relabeling capabilities, effectively enhancing critical information representation while suppressing background noise interference prevalent in underground coal mine surveillance footage. The proposed SE-Swin achieves a 2.2% improvement over FastPose in AP metrics, validating the significant role of network depth design and attention mechanisms in boosting keypoint localization accuracy and stability. Furthermore, an analysis was conducted on the impact of redundant facial keypoints in skeleton representations on subsequent action recognition performance, with the experimental results shown in Table 7.
Despite the fact that, from a structural information perspective, reducing the number of keypoints may lead to the loss of human body topology information, experimental results indicate that removing four facial keypoints resulted in a slight improvement in recognition accuracy. This phenomenon indicates that facial landmarks play a relatively limited role in modeling task-related behaviors during excavation operations. Their inclusion may introduce noise interference and exacerbate model overfitting to some extent. Therefore, employing a 13-point human skeleton configuration can provide a more compact and efficient feature representation while maintaining adequate motion discrimination capabilities.
4. Discussion
The miner behavior recognition method proposed in this paper based on the RS-Alphapose algorithm has achieved relatively good recognition results on the self-built dataset. However, it still has certain limitations. Firstly, this algorithm often fails to produce satisfactory results in complex scenarios with dense crowds and severe occlusions. This is mainly because in such complex situations, the extraction of human keypoints is greatly affected, and usually there will be keypoints missing, resulting in the failure of subsequent action recognition. Secondly, the self-built dataset in this paper is still insufficient in terms of quantity and diversity. The video data mainly comes from a single mining area, so the generalization ability for different mining areas and different working scenarios cannot be systematically verified. Thirdly, this algorithm mainly focuses on single-person behavior recognition. In future research, the data scale will be expanded to cover real working scenarios in multiple mining areas and different occupations, and the integration of multi-target tracking and occlusion technology will be explored to further improve the robustness and generalization ability of the model in complex mine environments.
5. Conclusions
This paper proposes RS-AlphaPose, a spatio-temporal feature fusion algorithm based on an improved AlphaPose. Addressing the limitations of AlphaPose in object detection and keypoint feature extraction, it incorporates the RTDETR object detection algorithm to replace the original Faster R-CNN detection algorithm, thereby enhancing the network’s capability for miner target recognition. The Swin Transformer module is integrated into the single-person pose estimation network. Its unique hierarchical structure and sliding window mechanism effectively process features at different scales, significantly enhancing the model’s ability to extract local features within images. Additionally, the algorithm incorporates the SENet channel attention mechanism, further strengthening the semantic expression of global information. This enables the extraction of human pose features even in complex backgrounds or occlusion scenarios, thereby improving overall model performance. Validation on the self-built miner action recognition dataset and the COCO pose dataset demonstrates that RS-AlphaPose achieves a detection accuracy of 72.5% on the pose dataset, representing a 2.2% improvement over the baseline model. Moreover, for different posture estimation networks, combined with ST-GCN, the average F1-score of this algorithm on the self-built mine action recognition dataset is 94.4%, which is 7.4%, 5.4%, 3.2%, 3.3%, and 2.2% higher than OpenPose, Mask RCNN, Alphapose, HRNet, and MoveNet models, respectively. It achieves accurate recognition of miner climbing, crossing, and other dynamic unsafe behaviors in the self-built dataset.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Yang L. Wang X. Zhu J. Qin Z. Influencing Factors, Formation Mechanism, and Pre-control Methods of Coal Miners′ Unsafe Behavior: A Systematic Literature Review Front. Public Health 20221079201510.3389/fpubh.2022.79201535321199 PMC 8936589 · doi ↗ · pubmed ↗
- 2Kumar P. Chauhan S. Awasthi L.K. Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions Arch. Comput. Methods Eng.20243117921910.1007/s 11831-023-09986-x · doi ↗
- 3Gupta D.K. Singh A.K. Gupta N. Vishwakarma D.K. SDL-Net: A Combined CNN & RNN Human Activity Recognition Model Proceedings of the International Conference in Advances in Power, Signal, and Information Technology (APSIT)Bhubaneswar, India 9–11 June 2023
- 4Yang J. Dong X. Liu L. Zhang C. Shen J. Yu D. Recurring the Transformer for Video Action Recognition Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)New Orleans, LA, USA 18–24 June 2022
- 5Han H. Zeng H. Kuang L. Han X. Xue H. A human activity recognition method based on Vision Transformer Sci. Rep.2024141531010.1038/s 41598-024-65850-338961136 PMC 11222487 · doi ↗ · pubmed ↗
- 6Fahim M.A. Arefin M.S. A Vision Transformer Based Model for Human Action Recognition Proceedings of the International Conference on Electrical, Computer and Communication Engineering (ECCE)Chittagong, Bangladesh 13–15 February 2025
- 7Wu Z. Ding Y. Wan L. Li T. Nian F. Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition Pattern Recognit.202515911110610.1016/j.patcog.2024.111106 · doi ↗
- 8Liu F. Wang C. Tian Z. Du S. Zeng W. Advancing skeleton-based human behavior recognition: Multi-stream fusion spatiotemporal graph convolutional networks Complex Intell. Syst.2025119410.1007/s 40747-024-01743-2 · doi ↗
