Detection of Cortical Arousals in Sleep Using Multimodal Wearable Sensors and Machine Learning

Murat Kucukosmanoglu; Sarah Conklin; Kanika Bansal; Sena Kaya; Yumna Anwar; Quang Dang; Golshan Kargosha; Justin Brooks; Cody Feltch; Nilanjan Banerjee

PMC · DOI:10.21203/rs.3.rs-6574148/v1·May 16, 2025

Detection of Cortical Arousals in Sleep Using Multimodal Wearable Sensors and Machine Learning

Murat Kucukosmanoglu, Sarah Conklin, Kanika Bansal, Sena Kaya, Yumna Anwar, Quang Dang, Golshan Kargosha, Justin Brooks, Cody Feltch, Nilanjan Banerjee

PDF

Open Access

TL;DR

This study introduces a wearable device and machine learning framework to detect sleep disruptions in children with ADHD.

Contribution

A noninvasive framework using wearable sensors and machine learning to detect cortical arousals in sleep.

Findings

01

Movement intensity features were most effective for arousal detection.

02

Random Forest model achieved a ROC AUC of 0.94 in detecting cortical arousals.

03

The framework was tested in a pediatric ADHD cohort with sleep concerns.

Abstract

Cortical arousals are brief brain activations that disrupt sleep continuity and contribute to cardiovascular, cognitive, and behavioral impairments. Although polysomnography is the gold standard for arousal detection, its cost and complexity limit use in long-term or home-based monitoring. This study presents a noninvasive machine learning based framework for detecting cortical arousals using the RestEaze™ system, a leg-worn wearable that records multimodal physiological signals including accelerometry, gyroscope, photoplethysmography (PPG), and temperature. Across multiple methods tested, including logistic regression, XGBoost, and Random Forest classifiers, we found that features related to movement intensity were the most effective in identifying cortical arousals, while heart rate variability had a comparatively lower impact. The framework was evaluated in 14 children with…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases5

attention-deficit/hyperactivity disorder restless leg syndrome cardiovascular, cognitive, and behavioral impairments ADHD sleep disruption

Figures7

Click any figure to enlarge with its caption.

b](#F4)) presented arousals that occurred in distinct temporal clusters during the early and late portions of the recording. The model maintained high temporal precision, correctly identifying contiguous arousal periods while avoiding false positives during quiescent intervals. Subject C ([Fig. 4c](#F4)) exhibited a sparser distribution of arousals. The model’s predictions closely matched the few true events, with overclassification toward the end.

a](#F6)), the middle panel shows the filtered waveform with clearly resolved peaks ([Fig. 6b](#F6)), and the bottom panel plots the computed PPG signal quality over time ([Fig. 6c](#F6)). This quality metric, ranging from 0 to 1, reflects the reliability of the signal for physiological analysis.

Equations1

Keywords

cortical arousalsRestEazeADHDwearablesmachine learningsleep monitoring

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Sleep and Wakefulness Research · Sleep and Work-Related Fatigue

Full text

Introduction

Cortical arousals are brief interruptions in electroencephalographic (EEG) activity that fragment sleep without full awakening. Although transient, these arousals contribute to autonomic activation and disrupted sleep pattern, with growing evidence linking them to hypertension, cognitive decline, and elevated cardiovascular risk^1–3^. Total sleep duration less than 5 hours per night is considered high-risk for cardiovascular morbidity and mortality^4^. Disrupted or insufficient sleep has also been associated with systemic inflammation, metabolic dysfunction, and increased all-cause mortality^5^. Elevated rates of sleep disturbances, including cortical and autonomic arousals, have also been observed in children with attention-deficit/hyperactivity disorder (ADHD)^6–8^. Early and accurate detection of these arousals may offer clinical insights into the relationship between poor sleep quality and daytime behavioral symptoms that may reveal patterns that differ by clinical subtype.

Polysomnography remains the gold standard for detecting cortical arousals^9,10^, yet its high cost, complexity, and requirement for overnight clinical supervision limit its use for large-scale or long-term monitoring^11^. Consumer sleep technologies, such as sleep trackers, offer a non-invasive, scalable approach to sleep monitoring, with the potential to support early identification of sleep fragmentation in home environments. While these devices offer greater accessibility, they often suffer from poor agreement with polysomnography, particularly in detecting brief or motionless arousals^12^. A multicenter validation study involving 11 wearable, nearable, and airable consumer sleep trackers confirmed substantial variation in performance across devices, with some showing macro F1-scores as low as 0.26 when compared to Polysomnography^13^. However, the growing integration of wearable sleep technologies into daily life offers a valuable opportunity to develop advanced frameworks that can effectively use these technologies to detect clinically relevant features of sleep.

One promising solution involves tracking leg movements during sleep, which frequently occur alongside cortical arousals, especially in populations with conditions like restless leg syndrome, periodic limb movement disorder, or ADHD^14–17^. Recent studies using wearable leg sensors have shown that leg movements during sleep features can effectively distinguish arousals, and that leg-EEG signal coupling may reflect deeper physiological mechanisms of sleep disruption^18,19^. In this study, we evaluate multimodal sensor data from a leg-worn wearable, RestEaze^™^, to detect cortical arousals using interpretable machine learning models, with the aim of advancing practical and reliable sleep health monitoring solutions outside of traditional clinical settings.

The RestEaze^™^ system integrates accelerometry, gyroscope, photoplethysmography (PPG), and temperature sensors, offering a comprehensive view of movement and physiological dynamics during sleep. In a prior pilot study using a similar platform, we introduced neuro-extremity analysis, a novel approach that employed Granger causal modeling to assess the temporal and directional relationships between cortical arousals and leg movements^20^. That study revealed that textile-based capacitive sensors showed stronger temporal and spectral coupling with EEG-theta oscillations than inertial sensors, and more accurately identified expert-labeled cortical arousals. These findings support the hypothesis that leg movements and cortical arousals are driven by coordinated activity within a shared central arousal system. The current study builds upon this work by incorporating PPG and temperature sensors into the previously studied system and focusing exclusively on inertial sensors for movement detection, as they were found to reliably capture arousal-related leg movements while avoiding the redundancy and implementation challenges associated with textile-based capacitive sensors. This setup allows extraction of heart rate (HR) and heart rate variability (HRV) features that may offer additional insight into autonomic activation during sleep^21–23^.

Results

Sleep is composed of two main states: rapid eye movement (REM) sleep and non-rapid eye movement (NREM) sleep. NREM includes three stages: N1, N2, and N3, which progress from light to deep sleep. These stages repeat in cycles throughout the night^24^. We began by examining the distribution of cortical arousals across sleep stages to establish a physiological context for the classification task. Arousals occurred most frequently during N2 sleep, with a mean proportion of 56.77% (95% confidence interval [CI]: 46.14–67.40%), followed by N1 at 17.47% (95% CI: 8.15–26.79%), REM at 13.17% (95% CI: 4.43–21.90%), and N3 at 12.60% (95% CI: 7.17–18.02%), averaged across subjects. This distribution aligns with established sleep physiology: N2 sleep not only comprises a larger portion of total sleep time but also has a lower arousal threshold, making it more prone to cortical arousals due to its transitional nature between wakefulness and deeper sleep stages^24^. Similarly, the elevated rate of arousals during N1 reflects its light sleep status and proximity to wakefulness. Interestingly, we also observed notable levels of arousals during N3 and REM sleep, suggesting increased cortical arousal beyond the lighter stages. This pattern may support prior findings showing that adolescents with ADHD and learning disorders exhibit increased cortical arousal during N2 and N3 sleep, particularly in central and frontal brain regions^25^.

To enable real-time detection of these arousal events using wearable data, we implemented and evaluated machine learning models designed to classify arousals from multimodal physiological signals. We evaluated the performance of three machine learning classifiers: Logistic Regression, XGBoost, and Random Forest for detecting cortical arousals based on multimodal physiological data from a leg-worn wearable device on full cohort of 14 children with ADHD, a population known to experience elevated levels of sleep fragmentation and frequent cortical arousals^6^. We chose these models to represent different levels of complexity and explainability: Logistic Regression as a simple linear baseline, Random Forest as a robust ensemble method, and XGBoost as a state-of-the-art gradient boosting algorithm.

All models were trained using a leave-one-subject-out cross-validation (LOOCV) approach to ensure robust subject-independent evaluation. The classification task involved identifying arousal events versus non-arousal periods. Evaluation metrics included class-wise precision, recall, F1-score, and overall Receiver Operating Characteristic - Area Under the Curve (ROC-AUC).

Model training and performance

The performance of each model is summarized in Table 1. While all three classifiers showed high accuracy in detecting non-arousal periods (Class 0), their ability to detect arousal events (Class 1) varied considerably. Logistic Regression achieved a Class 1 F1-score of 0.57 and a ROC-AUC of 0.90. XGBoost improved precision but had lower recall for Class 1, with a resulting F1-score of 0.61 and a ROC-AUC of 0.93. Random Forest achieved the best balance, with a Class 1 F1-score of 0.65 and the highest ROC-AUC of 0.94. Based on these results, the Random Forest model was selected for further analysis. Table 1 summarizes the performance of each model.

Feature Importance

Figure 1 presents the ranked list of the most important features contributing to cortical arousal classification, as determined by the Random Forest model. These features were predominantly derived from accelerometer and gyroscope signals, with a smaller contribution from HR and HRV metrics. The most important features included statistical, energy-based, and entropy-related measures. Importantly, standard deviation, root mean square (RMS), maximum, and range from the x-axis of the accelerometer appeared prominently in the ranking. This suggests that lateral leg movement (x-direction) plays a critical role in arousal episodes, consistent with biomechanical patterns observed during limb movement–related arousals.

Entropy-based features such as spectral entropy from both accelerometer and gyroscope signals were also among the top-ranked predictors. These features reflect the signal complexity or irregularity during sleep and are useful for capturing subtle variations in movement associated with arousals. Similarly, RMS AUC (Root Mean Square Area Under the Curve) quantifies cumulative signal energy, which is often elevated during microarousals due to brief bursts of leg activity.

Other contributing features included HRV-derived indices such as HRV Higuchi fractal dimension (HRV-HFD), HRV Cardiac Sympathetic Index (HRV-CSI), and HRV Fuzzy Entropy (HRV-FuzzyEn), all of which reflect beat-to-beat HRV complexity, physiological markers known to fluctuate during autonomic arousals^26^. However, they were less important than movement-based metrics, suggesting a stronger motor component to arousals in children with ADHD. Similarly, temperature-based features were not among the top-ranked predictors, indicating minimal relevance to arousal classification in this context.

In addition to feature rankings, we analyzed PPG signal quality across arousal categories. The mean PPG quality score was 0.818 (95% CI: 0.738–0.899) during non-arousal periods and 0.488 (95% CI: 0.420–0.556) during arousal events. This significant decline in signal quality during arousals suggests increased motion artifacts or sensor dropout, which may explain the lower importance of PPG-derived features in the final model.

Agreement with Ground Truth

Figure 2 shows the model prediction of the arousal rates against the true arousal rates (ground truth). In this study, arousal rate refers to the number of 60-second windows that contain at least one cortical arousal event, normalized per hour of total sleep time. The predicted rates exhibited a strong correlation with the ground truth, yielding a Spearman’s rank correlation coefficient

[eqn]

These results show a strong relationship, suggesting that the model successfully preserves subject-wise ranking in arousal frequency, which is crucial for estimating severity and comparing individuals.

The fitted linear regression line further supports the alignment between predicted and true values. The slope below 1.0 indicates underestimation at higher arousal rates, yet the close clustering of points around the line reflects consistency in the overall prediction trend. The regression slope was statistically significant (p < 0.01), with a 95% CI of [0.383, 1.050].

To further assess agreement, a Bland–Altman analysis was conducted (Fig. 3). This plot shows the differences between predicted and true arousal rates as a function of their average, both expressed in arousals per hour. The mean difference was + 0.88 arousals/hour (Predicted – True), indicating a slight overall tendency of the model to overestimate arousal frequency. The 95% limits of agreement ranged from − 1.40 to + 3.17 arousals/hour.

Temporal Prediction Patterns

To evaluate model behavior across time, we visualized prediction sequences for three subjects who showed distinct arousal patterns. Figure 4 shows minute-by-minute comparisons between predicted and true arousals across the sleep duration.

For Subject A (Fig. 4a), who exhibited frequent and widely distributed arousals, the model effectively captured both isolated and clustered events throughout the night. Minute-by-minute inspection showed that most predictions were temporally aligned with ground truth, with several pre-arousal predictions appearing within one to two minutes of labeled events.

In contrast, Subject B (Fig. 4b) presented arousals that occurred in distinct temporal clusters during the early and late portions of the recording. The model maintained high temporal precision, correctly identifying contiguous arousal periods while avoiding false positives during quiescent intervals. Subject C (Fig. 4c) exhibited a sparser distribution of arousals. The model’s predictions closely matched the few true events, with overclassification toward the end.

The agreement between predicted and true arousals is quantified using Arousals (Class 1) F1-scores: 0.62 (a), 0.68 (b), and 0.54 (c). These scores indicate strong model performance given the substantial class imbalance, where arousals make up only ~ 6% of the data. For context, random guessing would yield an F1-score near 0.06, making the observed values highly meaningful. These subject-level, minute-by-minute visualizations highlight the model’s adaptability to inter-individual variability in sleep and arousal patterns.

Discussion

This study demonstrates the feasibility of using multimodal wearable sensors and machine learning to detect cortical arousals during sleep, offering an accessible alternative to traditional in-clinic polysomnography. Among the classifiers tested, the Random Forest model achieved the best balance between recall and precision, yielding the highest ROC-AUC of 0.94. This result is consistent with Random Forest’s ability to handle complex patterns, feature interactions, and imbalanced data. Its ensemble-based architecture and embedded feature selection likely contributed to its robustness in this complex real-world dataset. Compared to Logistic Regression, which assumes linearity, and XGBoost, which can be sensitive to hyperparameter tuning in small datasets, the Random Forest model proved particularly effective at capturing subtle, subject-specific arousal signatures.

Feature importance analysis further revealed that the most predictive signals were derived from accelerometry and gyroscope data, particularly features reflecting signal variability and complexity, such as root mean square amplitude, standard deviation, and spectral entropy. These findings are consistent with prior work suggesting that leg movements are linked with cortical arousals^14,16,17^. Entropy measures likely captured the fragmented nature of movement during arousals. In contrast, HR and HRV features extracted from PPG contributed less prominently to model performance. This was not entirely unexpected, as the original sampling rate of 25 Hz may be insufficient for accurate HRV estimation. Prior work has shown that HRV metrics like Standard Deviation of NN Intervals (SDNN) and Root Mean Square of Successive Differences (RMSSD) require significantly higher sampling rates to ensure reliability, at least 50 Hz for SDNN and 100 Hz or more for RMSSD without interpolation^27^. Additionally, signal quality issues further limited the reliability of PPG-derived features. These noises, primarily motion artifacts and high-frequency noise, are inevitable in wearable-based health and well-being monitoring systems and can significantly impact peak detection accuracy^28^. In our study, the average PPG signal quality declined from 0.818 during non-arousal periods to 0.488 during arousal. This indicates a consistent drop in signal quality during arousal events.

Interestingly, the model predicted more arousals than were annotated by experts, particularly in subjects with sparse arousal profiles (Subject C). Rather than representing pure false positives, these predictions may reflect physiological events, such as sub-threshold arousals or autonomic activations, that were not captured by EEG-based criteria. This raises the possibility that wearable sensors may detect some physiological markers of sleep disruption that fall outside the boundaries of current clinical scoring systems. Indeed, prior research has shown that physiological changes surrounding arousal events can be significant, often extending beyond the boundaries of EEG-defined arousals^29,30^. These findings highlight how machine learning and wearables can improve sleep assessment beyond conventional methods. Additionally, the use of 60-second windows may have contributed to some discrepancy by grouping multiple arousals into a single event or capturing signal fluctuations surrounding true arousals.

Lastly, our subject-independent and interpretable framework provides minute-level temporal precision, making it suitable for clinical applications that require generalizable detection. It shows promise for individuals with ADHD, a group often underserved by traditional sleep diagnostics. Pediatric restless legs syndrome, for example, can cause significant sleep disruption, behavioral issues, and impaired daytime functioning that mimic ADHD symptoms^31,32^. While ADHD’s recognized subtypes (inattentive, hyperactive-impulsive, and combined) are well-described, their association with distinct sleep profiles remains unclear, highlighting the need for detailed pediatric sleep assessment^33^. Refined at-home monitoring could help identify specific sleep disorders and support more personalized, subtype-targeted treatments for pediatric ADHD. Building on these findings, this work presents multiple opportunities for future development. Priorities include expanding to larger and more diverse datasets, using deep learning to model long-range patterns, and incorporating continuous arousal scoring to reflect subtle physiological changes. Real-world feedback such as sleep staging, user experiences, and device usability will be vital for transforming this research into a practical home-based health solution. Ultimately, these efforts aim to bring clinical-quality sleep analytics into everyday environments through smart and accessible wearables.

Conclusion

This study presents a non-invasive, wearable-based framework for detecting cortical arousals using multimodal physiological signals from a leg-worn device. Among the classifiers evaluated, the Random Forest model performed best, achieving a ROC-AUC of 0.94 and showing strong alignment with expert-labeled EEG annotations. Key predictive features, such as leg movement variability and signal entropy, support the role of movement-related physiological signals as markers of central arousals. These findings demonstrate the potential of systems like RestEaze^™^ for clinically meaningful, at-home sleep monitoring. Future work should include larger, more diverse populations and explore continuous arousal scoring to enhance clinical relevance.

Methods

Participants and Data Acquisition

Physiological and movement data were collected from 14 children diagnosed with ADHD using the RestEaze^™^ Movement Analyzer, a wireless, leg-worn wearable designed for non-intrusive sleep monitoring and arousal detection. More details about the RestEaze^™^ can be found in previous publication^18^. As illustrated in Fig. 5, the RestEaze^™^ device integrates multiple synchronized sensors:

A 3-D accelerometer and 3-D gyroscope embedded within an inertial measurement unit (IMU) for leg movement and orientation tracking,A PPG sensor for capturing cardiovascular dynamics, andObject and ambient temperature sensors for thermal signature during sleep.

The accelerometer (X, Y, Z axes), gyroscope (X, Y, Z axes), and PPG channels (IR, red, green LEDs) were all sampled at 25 Hz, providing high-resolution capture of biomechanical and cardiovascular signals. Temperature data was sampled at 0.2 Hz, appropriate for monitoring slow-changing thermal conditions.

This setup enables continuous, multimodal recording throughout the night, capturing both fine-grained leg movements and physiological fluctuations associated with cortical arousals. Across the 14 participants, the average total sleep time was approximately 7.59 hours per subject, totaling 106.32 hours of recorded sleep data. Data collection was conducted during natural sleep in a home or clinical setting.

All study procedures were approved by the Institutional Review Board of Johns Hopkins University. Research was conducted in accordance with the Declaration of Helsinki and all relevant ethical guidelines and regulations, including obtaining informed consent from all participants and/or their legal guardians.

Cortical Arousals Rate

Cortical arousals (ground truth) were identified and scored according to the guidelines set by the American Academy of Sleep Medicine (AASM)^34^, which define arousals as abrupt shifts in EEG frequency, including alpha, theta, or activity exceeding 16 Hz, that last for at least 3 seconds and occur after a minimum of 10 seconds of uninterrupted sleep^21^. Arousal rate was calculated as the number of 60-second windows labeled with at least one cortical arousal event, normalized per hour of total sleep time. Specifically, if any arousal occurred within a given 60-second segment, the entire window was labeled as an arousal window (Class 1). The resulting arousal rate, expressed in arousal windows per hour, provides a temporally consistent metric for comparing arousal frequency across individuals.

In addition to cortical arousals, sleep stages, and limb movements were scored manually by trained technicians according to the AASM guidelines^34^. Bilateral limb movement events were also manually annotated, whereas leg movement channels were scored using an automated algorithm via the Sleepware G3 platform (Philips Respironics, US). Final scoring was reviewed and confirmed by a board-certified sleep physician and AASM fellow.

Preprocessing and Feature Generation

All raw sensor signals were processed using a unified preprocessing pipeline (see Fig. 5), which included filtering, segmentation into 60-second non-overlapping windows, and modality-specific feature extraction. The choice of a 60-second window was guided by the need to balance temporal resolution with physiological interpretability. Each one-minute segment contains sufficient cardiac cycles (typically 60–100 beats) to allow reliable estimation of HR and HRV, while also being short enough to detect changes in physiological state over time.

For the PPG signal, the preprocessing began with upsampling to 200 Hz using linear interpolation. This step was essential for achieving the temporal resolution required for accurate peak detection and compatibility with feature extraction functions that assume higher sampling rates. Several methods did not perform at the native 25 Hz resolution, especially those involving frequency-domain HRV metrics. The upsampled signal was then bandpass filtered between 0.2 and 5 Hz using a Butterworth filter to remove baseline drift and suppress motion artifacts. The filter was implemented in Python 3.11 using the butter and filtfilt functions from the scipy.signal module, which apply zero-phase forward and reverse filtering to avoid phase distortion^35^.

Following filtering, we evaluated several peak detection strategies to identify heartbeats from the PPG waveform. Among these, the ppg-findpeaks function from the NeuroKit2 library^36^ provided reliable results in terms of peak timing consistency and robustness to signal noise. Figure 6 shows the effects of preprocessing: the top panel displays the raw PPG signal with notable baseline fluctuations (Fig. 6a), the middle panel shows the filtered waveform with clearly resolved peaks (Fig. 6b), and the bottom panel plots the computed PPG signal quality over time (Fig. 6c). This quality metric, ranging from 0 to 1, reflects the reliability of the signal for physiological analysis.

Once peaks were detected, HR and HRV features were extracted from each 60-second window. HR metrics included minimum, maximum, and mean HR. HRV features encompassed time-domain measures (e.g., RMSSD, SDNN), frequency-domain indices (e.g., low-frequency/high-frequency ratio), and nonlinear metrics such as entropy, coefficient of signal irregularity, coefficient of variation of intervals, and fractal complexity (e.g., Higuchi fractal dimension).

Signals from the 3-D accelerometer and 3-D gyroscope were high-pass filtered with a cutoff frequency of 0.2 Hz to reduce low-frequency drift and artifacts. Each axis (X, Y, Z) was segmented into non-overlapping 60-second windows and processed to extract statistical features (mean, standard deviation, variance, skewness, kurtosis, minimum, maximum, and range), signal energy features (RMS and AUC), and spectral characteristics (dominant frequency and spectral entropy). Object and ambient temperature signals were not filtered but were similarly segmented into 60-second windows and processed to extract basic descriptive statistics, including mean, median, standard deviation, minimum, maximum, and range.

All features across modalities were combined into a unified feature matrix indexed by timestamp and subject ID. Arousal labels were resampled into 60-second non-overlapping windows to match the feature segmentation. A window was labeled as an arousal event if it contained any arousal occurrence within its duration, ensuring sensitivity to even brief arousal activity. This binary labeling approach allowed the model to learn from both isolated and clustered arousal events, supporting robust temporal prediction. The dataset was imbalanced, with arousal windows (Class 1) comprising 6.6% of the data and non-arousal windows (Class 0) accounting for 93.4%, reflecting the rarity of cortical arousals during sleep.

While this approach simplifies the classification task, it introduces a limitation: multiple arousals occurring within the same 60-second window are treated as a single event. This may underestimate the actual number of arousals in windows with dense activity. We initially experimented with shorter windows (e.g., 30 seconds) to capture finer temporal dynamics. However, this led to increased false positives, likely because pre- and post-arousal changes over the signals extended beyond the arousal itself. Thus, the 60-second window length was selected as an optimal trade-off between capturing relevant signal changes and maintaining specificity. Additionally, arousals that spanned multiple windows, a potential source of edge effects, were observed in approximately 10% of cases. Given that most arousals lasted 8 to 12 seconds, this level of boundary overlap was considered acceptable within the 60-second segmentation framework.

Machine Learning Framework and Feature Selection

We evaluated and compared the performance of three classifiers:

Logistic Regression

As a baseline, we trained a Logistic Regression model with L2 regularization (Ridge penalty), which helps prevent overfitting and handles multicollinearity. The model was trained with subject-level z-scored features, class balancing, and LOOCV. Hyperparameters, including the regularization strength, were tuned using RandomizedSearchCV with 50 randomized iterations. While it offers greater interpretability, it lacks the capacity to model nonlinear interactions present in physiological time-series data.

Gradient-Boosted Decision Tree Model (XGBoost)

We also implemented XGBoost, a high-performance gradient-boosted decision tree model that incorporates both first- and second-order gradients. We tuned hyperparameters including learning rate, tree depth, subsampling rate, and L1/L2 penalties using RandomizedSearchCV with 50 randomized iterations. All training followed the same LOOCV protocol as the previous model.

Bagged Tree Ensemble Model (Random Forest)

We used a Random Forest classifier, known for its robustness to noise, ability to model nonlinear relationships and embedded feature importance analysis. Hyperparameters were optimized using RandomizedSearchCV with 50 randomized iterations. Tuned parameters included the number of trees, maximum depth, minimum samples per split and leaf node, and feature subsampling ratio. All training followed the same LOOCV protocol as the other models. The best-performing hyperparameters for each model, selected based on cross-validation performance across folds, are summarized in Table 2.

To account for inter-individual variability in physiological signals, all features were standardized per subject using z-score normalization. Columns with excessive missingness were removed, and the remaining missing values were imputed using subject-level k-nearest neighbors^37^. This method estimates missing values by averaging the feature values from the most similar observations in the dataset. Dimensionality reduction and feature selection were performed using Recursive Feature Elimination^38^ within the training folds to retain only the most informative features for classification.

A LOOCV scheme was used, where each subject was held out in turn as the test fold while the remaining subjects were used for training. This approach ensured strict subject-level separation and prevented data leakage, supporting robust evaluation of model generalizability.

To address the natural class imbalance between arousal and non-arousal events, a two-step resampling strategy was applied within each training fold. First, Tomek Links^39^ were removed to clean the decision boundary, followed by Random Undersampling^40^ to balance the class distribution during model fitting. Importantly, the held-out test subject was never undersampled, preserving the original data distribution for evaluation. Thresholds for classification were selected based on the precision-recall curve computed on the raw (non-resampled) version of the training data, ensuring that decision thresholds reflected realistic class ratios. The selected threshold was then applied to the test fold.

Together, these classifiers enabled direct performance comparisons. The outputs were evaluated using window-based overlap metrics and correlation analyses, described in the next section.

Model Comparison and Evaluation

Model performance was assessed using both classification-based metrics and agreement-based statistical analyses, with careful consideration given to subject-level separation through LOOCV. For each model, the area under the ROC-AUC was computed to quantify overall discriminative ability. In addition, precision, recall, and F1-score, defined in equations (1) through (3), were calculated separately for arousal (Class 1) and non-arousal (Class 0) classes on a per-window basis. These equations quantify the performance of the model in different aspects:

[eqn]

[eqn]

[eqn]

To ensure equal contribution from each subject and prevent performance estimates from being skewed by subjects with longer recordings or more events, all metrics (precision, recall, F1-score) were first computed individually for each left-out subject in the LOOCV framework. The final reported values (Table 1) represent the mean of per-subject metrics, formalized as:

[eqn]

Where:

$[eqn]$ Subject-averaged metric (e.g., precision, recall, F1-score)

S

Total number of subjects

[eqn]

In addition to discrete classification metrics, we evaluated the agreement between predicted arousals and ground truth arousals across subjects. The predicted arousal rate for each subject, defined as the number of arousal events per hour of total sleep time, was compared with the true arousal rate using Spearman’s rank correlation coefficient (ρ) and Kendall’s tau (τ) to assess monotonic relationships. Agreement between predicted and true arousal rate were further examined using Bland–Altman analysis^41^, which visualizes the bias and limits of agreement between model estimates and expert-scored references.

Feature Importance Analysis

After model training and evaluation, we analyzed feature importances using the Random Forest model trained on the entire dataset to capture generalizable patterns across all subjects. Random Forest determines feature importance by evaluating the total decrease in node impurity, such as Gini impurity, each feature contributes across all decision trees in the ensemble. Features that result in larger impurity reductions when used for splitting are considered more important^42^. This approach allows the model to naturally account for nonlinear relationships and feature interactions. To enhance interpretability and reduce noise from low-importance variables, we selected the top ranked features for post hoc analysis. This number was chosen empirically: including more than 30 features resulted in only marginal improvements in classification performance while increasing model complexity and risk of overfitting. The selected features represented a balanced trade-off between performance and interpretability and were used in downstream visualizations and interpretation.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Morgan B. J. Neurocirculatory consequences of abrupt change in sleep state in humans. J. Appl. Physiol. 80, 1627–1636 (1996).8727549 10.1152/jappl.1996.80.5.1627 · doi ↗ · pubmed ↗
2Xue Y. Durative sleep fragmentation with or without hypertension suppress rapid eye movement sleep and generate cerebrovascular dysfunction. Neurobiol. Dis. 184, 106222 (2023).37419254 10.1016/j.nbd.2023.106222 · doi ↗ · pubmed ↗
3Chouchou F. Sympathetic overactivity due to sleep fragmentation is associated with elevated diurnal systolic blood pressure in healthy elderly subjects: the PROOF-SYNAPSE study. Eur. Heart J. 34, 2122–2131 (2013).23756334 10.1093/eurheartj/eht 208 · doi ↗ · pubmed ↗
4Cappuccio F. P., Cooper D., D’Elia L., Strazzullo P. & Miller M. A. Sleep duration predicts cardiovascular outcomes: a systematic review and meta-analysis of prospective studies. Eur. Heart J. 32, 1484–1492 (2011).21300732 10.1093/eurheartj/ehr 007 · doi ↗ · pubmed ↗
5Duan D., Kim L. J., Jun J. C. & Polotsky V. Y. Connecting insufficient sleep and insomnia with metabolic dysfunction. Ann. N Y Acad. Sci. 1519, 94–117 (2023).36373239 10.1111/nyas.14926 PMC 9839511 · doi ↗ · pubmed ↗
6Wajszilber D., Santiseban J. A. & Gruber R. Sleep disorders in patients with ADHD: impact and management challenges. Nat. Sci. Sleep. 10, 453–480 (2018).30588139 10.2147/NSS.S 163074 PMC 6299464 · doi ↗ · pubmed ↗
7Lal C., Strange C. & Bachman D. Neurocognitive impairment in obstructive sleep apnea. Chest 141, 1601–1610 (2012).22670023 10.1378/chest.11-2214 · doi ↗ · pubmed ↗
8Owens J. A. A clinical overview of sleep and attention-deficit/hyperactivity disorder in children and adolescents. J. Can. Acad. Child. Adolesc. Psychiatry J. Acad. Can. Psychiatr Enfant Adolesc. 18, 92–102 (2009).PMC 268749419495429 · pubmed ↗