# Beyond peak accuracy: a stability-centric framework for reliable multimodal student engagement assessment

**Authors:** Ismail Said Almuniri, Hitham Alhussian, Norshakirah Aziz, Sallam O. F. Khairy, AlWaleed Sulaiman AlAbri, Zaid Fawaz Jarallah, Saidu Yahaya, Shamsuddeen Adamu

PMC · DOI: 10.1038/s41598-025-31215-7 · 2026-01-02

## TL;DR

This paper introduces a new framework for assessing student engagement using multimodal data, focusing on stability and interpretability to improve reliability.

## Contribution

The novel contribution is a stability-centric framework combining class-aware loss, temporal augmentation, and SHAP-based interpretability for multimodal student engagement assessment.

## Key findings

- The framework achieved a mean accuracy of 0.901 and mean macro F1 of 0.847, outperforming existing models.
- Temporal augmentation and ensemble diversity were identified as key contributors to model stability.
- SHAP-based analysis provided reliable interpretability, linking predictions to behavioral and cognitive cues.

## Abstract

Accurate assessment of student engagement is central to technology-enhanced learning, yet existing models remain constrained by class imbalance, instability across data splits, and limited interpretability. This study introduces a multimodal engagement assessment framework that addresses these issues through three complementary strategies: (1) class-aware loss functions to alleviate class imbalance, (2) temporal data augmentation and heterogeneous ensembling to enhance model stability, and (3) SHAP-based analysis of the most stable component for reliable interpretability. Reliability was established through repeated cross-validation with multiple seeds across seven deep learning architectures and the proposed ensemble. The framework established a mean accuracy of 0.901 ± 0.043 and a mean macro F1 of 0.847 ± 0.068, surpassing baselines such as ResNet (Accuracy = 0.917), Inception (Macro F1 = 0.862), and LightGBM (Accuracy = 0.922). Ablation studies highlighted temporal augmentation and ensemble diversity as key contributors, while sensitivity analyses confirmed robustness with variance consistently below 0.07 across seeds and folds. Efficiency profiling established MCNN and TimeCNN as the optimal deployment architecture, combining near-optimal accuracy with superior computational efficiency. SHAP-based interpretation was extended to provide feature-level and class-wise attribution, revealing consistent relationships between predictions and behavioral or cognitive cues. Overall, the study demonstrates that balanced evaluation and ensemble stability are essential for reliable engagement assessment.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** confusion (MESH:D003221), fatigue (MESH:D005221), DL (MESH:D007859)
- **Chemicals:** MDL (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12764593/full.md

---
Source: https://tomesphere.com/paper/PMC12764593