Vision-Based Hand Function Evaluation with Soft Robotic Rehabilitation Glove

Mukun Tong; Michael Cheung; Yixing Lei; Mauricio Villarroel; Liang He

PMC · DOI:10.3390/s26010138·December 25, 2025

Vision-Based Hand Function Evaluation with Soft Robotic Rehabilitation Glove

Mukun Tong, Michael Cheung, Yixing Lei, Mauricio Villarroel, Liang He

PDF

Open Access

TL;DR

This paper introduces a vision-based system to evaluate hand function during rehabilitation using a soft robotic glove, enabling accurate and quantitative tracking of hand movements.

Contribution

The paper proposes an RGB-based evaluation system with fine-tuned hand estimation to overcome occlusion and enable accurate tracking with soft robotic gloves.

Findings

01

The system achieves tracking errors lower than 10° using metrics like mean per joint angle error and range of motion.

02

New benchmarks like APCK, MPJAVE, and SPARC error are introduced to assess movement stability and smoothness.

03

The solution is extensible and adaptable for clinical and home-based rehabilitation assessment.

Abstract

Advances in robotic technology for hand rehabilitation, particularly soft robotic gloves, have significant potential to improve patient outcomes. While vision-based algorithms pave the way for fast and convenient hand pose estimation, most current models struggle to accurately track hand movements when soft robotic gloves are used, primarily due to severe occlusion. This limitation reduces the applicability of soft robotic gloves in digital and remote rehabilitation assessment. Furthermore, traditional clinical assessments like the Fugl-Meyer Assessment (FMA) rely on manual measurements and subjective scoring scales, lacking the efficiency and quantitative accuracy needed to monitor hand function recovery in data-driven personalised rehabilitation. Consequently, few integrated evaluation systems provide reliable quantitative assessments. In this work, we propose an RGB-based evaluation…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures4

Click any figure to enlarge with its caption.

Funding1

—Podium Institute for Sports Medicine and Technology

Keywords

soft robotic glovecomputer visionhand pose estimationhand rehabilitationquantitative evaluation

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStroke Rehabilitation and Recovery · Soft Robotics and Applications · Robot Manipulation and Learning

Full text

1. Introduction

Hand function plays a crucial role in daily activities, and loss of hand motor abilities including the joint range of motion (ROM) can significantly impact a person’s quality of life [1]. In hand function rehabilitation, soft robotic gloves and assistive technologies have gained emerging interest for aiding hand function recovery. Soft robotic gloves are wearable assistive devices with compliant, pneumatically or tendon-driven actuators that support finger flexion and extension during rehabilitation tasks, enabling safe, adaptive and user-comfortable assistance for motor recovery [2,3,4,5]. While recent advances in computer vision have shown excellent results in bare-hand posture estimation, applying these models to evaluate hand function with soft robotic gloves remains challenging [6,7]. The primary difficulty lies in accurately tracking hand poses when the patient’s hand is fully or partially covered by soft robotic gloves. Issues like occlusion, finger slippage and glove size variability also contribute to significant tracking errors [8,9,10]. When parts of the hand or fingers are hidden from the camera’s view, current vision-based models struggle to maintain accuracy [11,12,13,14].

Moreover, an effective approach to quantitatively evaluate hand function as well as the effectiveness of the latest technologies, including soft robotic gloves, remains an open challenge. Clinically, the ROM is often assessed through structured tests like the Fugl-Meyer Assessment (FMA) [15]. These standardised tests evaluate the hand motor function, balance and sensation of patients by scoring their performance in various movements across multiple joints and fingers as well as in both hands. However, they typically rely on manual measurements and subject scoring scales, which can yield inconsistent results and demand considerable staff time, thereby limiting their usability. Quantitative metrics like the joint ROM in degrees provide more precise information to help clinicians understand patient recovery stages and further develop personalised recovery plans.

Recent advances have explored embedding sensors in soft robotic gloves to quantitatively measure joint movements [16]. For example, flexible sensors and inertial measurement units (IMUs) can provide real-time data on finger joint angles, offering valuable information for feedback control of gloves in rehabilitation [17]. Nevertheless, incorporating embedded sensors would significantly increase the cost of soft robotic gloves, and many flexible sensors still face challenges with reliability and durability under repeated use [16,18]. In contrast, using camera-based systems to measure the posture of soft robotic gloves and evaluate rehabilitation performance presents a promising solution for fast, reliable and low-cost implementation in both clinical and home environments [19,20].

The motivation for this research lies in overcoming these limitations. We propose a hand motion evaluation system integrating camera-based hand tracking with minimal fine-tuning of the Hand Mesh Reconstruction (HaMeR) model [7] to manage occlusions caused by gloves or hand positioning, as shown in Figure 1. It compensates for occluded areas, ensuring continuous tracking even in difficult scenarios. Additionally, this method introduces quantitative evaluation metrics, including the range of motion (ROM) and mean per joint angle error (MPJAE), which provide standardised assessments in rehabilitation and clearer insights into movement dynamics. Unlike models that require frequent recalibration, this method offers an adaptable and scalable solution for various rehabilitation applications, achieving tracking errors below 10° even under challenging conditions. The system reduces the need for extensive retraining, making it suitable for long-term rehabilitation in both clinical and home-based settings.

2. Materials and Methods

2.1. Overview of the Vision-Based Evaluation System

This computer vision-based system was constructed to quantitatively evaluate the hand recovery process with the assistance of soft robotic gloves. We aimed to fine-tune a powerful hand pose estimation model for our scenarios where hands are wearing rehabilitation gloves as well as define the metrics for evaluation.

We applied the HaMeR [7] model for fine-tuning, which is a fully transformer-based approach designed for reconstructing 3D hand meshes from monocular images or video frames. Its architecture leverages a large-scale vision transformer (ViT) as the backbone (also known as H-ViT), which processes the image patches and returns a series of output tokens. It also includes a transformer head to transfer the tokens into MANO and camera parameters, which can then be transformed to joint positions and hand meshes [21]. HaMeR excels in capturing complex hand configurations, thanks to its capacity to scale with larger datasets and utilise powerful deep learning architectures. It consistently outperforms previous state-of-the-art methods in hand pose benchmarks, particularly in challenging wild scenarios, such as hands interacting with objects or other hands, or hands captured from different viewpoints. However, this model still could not be applied to our situation due to the lack of training data captured from hands with rehabilitation gloves, as the testing results turn out to be unsatisfactory in Section 3. Thus, a fine-tuning method is proposed in our system to help bridge this gap in performance.

In evaluation, different from qualitative measurements in clinical tests, we naturally applied the join angle accuracy from hand pose estimation tasks to our framework, including the angle error and the percentage of correctly predicted joints within different error thresholds. These are commonly used metrics for hand pose estimation, especially when joint angles are taken into consideration, from which we could easily calculate the ROM for each joint, which is a comparatively accurate standard of measurement.

Our training and fine-tuning framework is shown in Figure 2. Firstly, the model was trained following the process from the original HaMeR [7], computing the 3D joint loss, 2D joint loss and mesh loss. After training on large datasets, the transformer head was then fine-tuned on our small-scale dataset collected from gloved hands to fit to our environment, merely considering the 3D joint loss. The details will be explicated in Section 2.3.

2.2. Data Collection

As our dataset included RGB images and corresponding 3D joint positions for calibration, we used a motion capture system to capture the data we needed, as shown in Figure 2. Here, we abbreviate the joints of a finger as the metacarpophalangeal joint (MCP), proximal interphalangeal joint (PIP) and distal interphalangeal joint (DIP). Note that the joints of the thumb, from the wrist to the tip, are abbreviated as the trapeziometacarpal joint (TM), MCP and interphalangeal joint (IP). For clarity, the thumb joints (TM–MCP–IP) follow a similar notation scheme to the remaining fingers (MCP–PIP–DIP), and the same abbreviations are used interchangeably when no ambiguity arises. Covered by green cloth, the system used 8 NOKOV MARS4H motion capture cameras (NOKOV Science & Technology Co., Ltd., Beijing, China) to locate the joint coordinates and 1 RGB camera (720 p, 30 fps) facing towards the hand to capture images. To represent the positions of the joints, we used grey spherical markers stuck on the glove. Considering the mutual inference of the markers when all 21 joints of the hands were simultaneously present, in each image, only one finger and the wrist were focused on. Thus, there were five markers to capture.

Our dataset consisted of ∼3000 images for training, ∼1000 for validation and ∼1000 for testing, with corresponding 3D joint positions, captured from one subject. However, current automatic 2D hand pose annotation methods such as MMPose [22] have been proven to perform poorly, especially when dealing with hand poses with self- and object occlusions. That aside, manually annotating 2D keypoints for each view is fairly expensive for data collection. However, our experiments have shown that 3D data alone is enough for fine-tuning. In addition, mesh data requiring scanners to collect is also unnecessary in our framework.

2.3. Training and Fine-Tuning

In the training process, the dataset was contributed by the authors of HaMeR [7], which consists of 40,400 RGB images in total. The model computes the 3D joint position losses, 2D joint position losses and mesh losses. For the ground-truth 3D joint positions $[eqn]$ , 2D joint positions $[eqn]$ and MANO parameters $[eqn]$ and the predicted 3D joint positions X, 2D joint positions x and MANO parameters $[eqn]$ , the loss is calculated as follows [7]:

[eqn]

where $[eqn]$ refer to the weights of the 3D (0.05), 2D (0.01) and MANO parameter (0.0005) losses, respectively. Different from the training process, we merely utilised the 3D joint losses for fine-tuning:

[eqn]

This simplified process gets rid of complex annotation of the 2D ground truth joint positions and supervises the level of the 3D joints, encouraging consistency in the 3D space.

2.4. Implementation Details

All the experiments were conducted on a single Tesla V100 GPU, with Python 3.10.15, PyTorch 2.6.0 + CUDA 12.6 on a Linux workstation. Numerical and scientific computations were performed with NumPy 1.26.1 and SciPy 1.14.1, while image processing and visualisation relied on OpenCV 4.11.0 and Matplotlib 3.9.2. The MANO hand model used in this work corresponds to version 1.2, obtained from the official release [21]. All software packages were installed in a controlled environment to ensure reproducibility.

In training, all the settings remained the same as those provided by the authors of HaMeR [7], with the number of epochs being 1000, the learning rate being $[eqn]$ and the weight decay being $[eqn]$ In the fine-tuning process, the number of epochs was 40, with a learning rate of $[eqn]$ and weight decay of $[eqn]$ . Aside from that, as in our data collection, the markers were stuck on the glove rather than the skin, and there was an offset between the real joint positions and those of the markers. As the offset of the markers on the fingers was the same, the model could adapt to the offset during the fine-tuning process. However, the marker representing the wrist was comparatively far from the ground truth, and thus we added 40° to the output MCP angle (since the marker was a bit higher than the skin). Note that this did not influence the fine-tuning process, as the angles were calculated from the joint positions and not included in the loss computation. We compare the performance of our fine-tuned HaMeR model with the model without fine-tuning as the baseline to prove the effect of our framework.

3. Results

In addition to the original HaMeR model, we included Hamba as a representative interaction-aware RGB-based baseline [23]. Hamba has demonstrated robustness in challenging hand–object interaction scenarios and serves as a competitive comparison for evaluating performance under glove-induced occlusion.

3.1. Joint Angle Accuracy

To evaluate the accuracy of three joint angles per finger, we chose the mean per joint angle error (MPJAE) [24] to measure the error:

[eqn]

where N is the number of joints and $[eqn]$ and $[eqn]$ are the ground-truth and predicted angles, respectively. Based on the percentage of correct keypoints (PCK) score [25], we introduce the angle PCK (APCK) score, which represents the percentage of correctly predicted angles with an error threshold (unit: °). Subsequently, we could draw the curve of the APCK score with different thresholds and calculate the area under the curve (AUC).

We present the MPJAE and angle PCK scores with error thresholds of 5° and 10° in Table 1, where we abbreviate our fine-tuned HaMeR model as HaMeR-F. We observed that over 60% of the angle prediction reached an error lower than 5°, and more than 80% of the angles were accurately predicted with a threshold of 10°. The APCK score curves with different error thresholds are illustrated in Figure 3a, where the AUC of our fine-tuned model reached almost twice those of the Hamba model and HaMeR model without fine-tuning. However, it can also be seen that the model performed worst for the MCP angle. This phenomenon can be partly explained by the huge offset of the wrist, which is hard to concisely rectify in different environments and with different movements.

We also demonstrate the ground-truth and predicted ROMs of the joints in Table 2, each presented as the minimum and maximum angle of the joint. That aside, to observe the dynamic estimation of the joint angles, we tested our model on a 10 s (300 frames) video recording movements of the middle finger and compared the predicted angle curves with the ground truth in Figure 3b. In general, the output results manifested the finger movements in the temporal sequence well, further validating its feasibility in ROM monitoring.

3.2. Qualitative Results

We demonstrate here the visualisation of 3D hand joints for the thumb, middle finger and pinky finger in Figure 4. After fine-tuning, the model became consistently robust and accurate in predicting the joint angles as well as the finger poses, even with the glove.

3.3. Ablation Study: 3D Joint Accuracy and Kinematic Smoothness

In this ablation study, we further analysed both the 3D joint accuracy and the kinematic quality of the reconstructed motion. The 3D accuracy was quantified with the mean per joint position error MPJPE, which measures the Euclidean distance between the predicted and ground-truth joint positions in millimetres. Beyond static positional accuracy, we introduce two kinematic-level metrics to assess temporal fidelity:

Mean per Joint Angular Velocity Error MPJAVE This was computed as the mean absolute difference between the predicted and ground-truth angular velocities (in °/s) of each joint. This metric reflects how well the model captured the motion dynamics, i.e., the speed consistency of each finger joint during flexion and extension. A smaller MPJAVE indicates more temporally stable motion estimation.
Angular SPARC Error: This is derived from the spectral arc length SPARC metric, a frequency-domain measure of motion smoothness [26]. The SPARC quantifies how smoothly a joint angle trajectory evolves over time by integrating the curvature of its amplitude spectrum. Here, we report the absolute SPARC difference between the estimated and ground-truth trajectories, where lower values denote higher smoothness consistency.

Table 3 summarises the per-joint results of the baseline models (HaMeR and Hamba) and the fine-tuned model (HaMeR-F). While fine-tuning significantly improved the spatial accuracy in terms of joint position and angle estimation, the gains in temporal metrics such as the MPJAVE and SPARC were more limited. This is expected, as the fine-tuning process optimises the per-frame spatial alignment using 3D joint supervision without explicitly modeling temporal dynamics or motion smoothness priors. Consequently, improvements in velocity consistency and spectral smoothness arise primarily as indirect effects of reduced spatial noise rather than from dedicated temporal constraints. Incorporating sequence-level supervision or explicit temporal regularisation is therefore a promising direction for further improving kinematic smoothness.

4. Discussion

The paper propels the use of soft-robotic-glove-based hand and finger tracking into more accessible rehabilitation environments, particularly by improving its integration into clinical settings. With the fine-tuned vision-based model and the use of an RGB-based camera, this approach offers a promising solution for estimating the ROMs for hands accurately. This method also offers standardised assessments for rehabilitation by building quantitative evaluation metrics, such as the ROM, MPJAE, MPJPE, MPJAVE and SPARC errors, which provide clearer insights into movement dynamics.

However, it should be noted that the current study focused on a subject-specific calibration scenario, with all training, validation and testing data collected from a single participant wearing one soft robotic glove under controlled conditions. This setting is aligned with practical rehabilitation use cases, where soft robotic gloves are typically calibrated to individual users due to variations in hand size, glove fit and motor ability. With calibration, the proposed framework already enables accurate and continuous estimation of joint angles and functional metrics (e.g., ROM and kinematic consistency) using only an RGB camera, providing a low-cost and easily deployable evaluation tool that can be readily adopted for subject-level assessment, system benchmarking and algorithm validation.

At the same time, cross-subject generalisation remains a well-recognised challenge in vision-based hand pose estimation. Prior studies have shown that models trained on limited subject populations often struggle to transfer reliably to unseen individuals with different hand geometries, appearance characteristics and motion patterns, particularly under occlusion and domain shifts [27]. Recent systematic reviews further indicate that despite rapid progress in deep learning-based methods, robust generalisation across diverse real-world conditions and subject populations is still an open problem due to anatomical variability, occlusions and dataset bias [28]. Thus, the present study is positioned as an initial feasibility investigation rather than a cross-subject evaluation.

The evaluation was therefore conducted using isolated single-finger motions, fixed camera viewpoints and uniform backgrounds to intentionally isolate the effect of visual occlusion in a controlled setting. In addition, the wrist angle correction applied in this study was used only during post-processing for evaluation purposes and did not affect model training or optimisation. This subject-specific adjustment further motivates the need for systematic calibration strategies. Building on the current framework, future work will expand the dataset to include multiple subjects and more diverse motion patterns, investigate cross-subject generalisation strategies and explore more comprehensive functional and kinematic metrics that better capture movement coordination, temporal consistency and rehabilitation-relevant performance beyond joint-level accuracy.

Although clinical relevance is a central motivation of this work, therapist involvement in the present study was limited and did not include qualitative clinical assessments. Consequently, the study did not include validation of patient populations or direct alignment with established clinical assessment scales, such as the Fugl-Meyer Assessment (FMA) scale [15]. The quantitative metrics reported here (ROM, MPJAE, MPJPE, MPJAVE and angular SPARC error) are therefore intended as complementary, objective descriptors rather than replacements for clinician-rated scales. From a clinical workflow perspective, the proposed system is best regarded as a supportive tool for assessments with subject-specific calibration, enabling quantitative monitoring of rehabilitation progress alongside established protocols. There will also be limitations when extending the approach to pathological movement patterns such as spasticity, abnormal synergies or compensatory motions, which will require pathology-specific data collection, temporal modeling and therapist-guided validation. Future work will focus on clinical studies to relate the proposed quantitative metrics to established clinical assessments and to facilitate translation from laboratory evaluation to real-world rehabilitation practice.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Meng F. Liu C. Li Y. Hao H. Li Q. Lyu C. Wang Z. Ge G. Yin J. Ji X. Personalized and Safe Soft Glove for Rehabilitation Training Electronics 202312253110.3390/electronics 12112531 · doi ↗
2Zhang Y. Orban M. Wu Y. Liu C. Wang J. Elsamanty M. Yang H. Guo K. A review of soft robotics and soft rehabilitation gloves: Exploring alternative soft robots actuation techniques Int. J. Intell. Robot. Appl.202591368139310.1007/s 41315-025-00474-y · doi ↗
3Zhang T. Zheng K. Tao H. Liu J. A Soft Wearable Modular Assistive Glove Based on Novel Miniature Foldable Pouch Motor Unit Adv. Intell. Syst.20257250027410.1002/aisy.202500274 · doi ↗
4Proulx C.E. Beaulac M. David M. Deguire C. HachéC. Klug F. Kupnik M. Higgins J. Gagnon D.H. Review of the effects of soft robotic gloves for activity-based rehabilitation in individuals with reduced hand function and manual dexterity following a neurological event J. Rehabil. Assist. Technol. Eng.20207205566832091813010.1177/205566832091813032435506 PMC 7223210 · doi ↗ · pubmed ↗
5Kottink A.I. Nikamp C.D. Bos F.P. Sluis C.K.v.d. Broek M.v.d. Onneweer B. Stolwijk-Swüste J.M. Brink S.M. Voet N.B. Rietman J.S. Therapy effect on hand function after home use of a wearable assistive soft-robotic glove supporting grip strength P Lo S ONE 202419 e 030671310.1371/journal.pone.030671338990858 PMC 11239026 · doi ↗ · pubmed ↗
6Jiang C. Xiao Y. Wu C. Zhang M. Zheng J. Cao Z. Zhou J.T. A 2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Vancouver, BC, Canada 18–22 June 202388468855
7Pavlakos G. Shan D. Radosavovic I. Kanazawa A. Fouhey D. Malik J. Reconstructing hands in 3D with transformers Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Seattle, WA, USA 14–20 June 202498269836
8Hampali S. Rad M. Oberweger M. Lepetit V. Honnotate: A method for 3D annotation of hand and object poses Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Seattle, WA, USA 13–19 June 202031963206