Combining Signals for EEG-Free Arousal Detection during Home Sleep Testing: A Retrospective Study
Safa Boudabous, Juliette Millet, Emmanuel Bacry

TL;DR
This study shows that combining simple physiological signals can help detect sleep arousals without EEG during home sleep tests.
Contribution
The study introduces a method to detect arousal using non-EEG signals in home sleep testing with deep learning.
Findings
Combining multiple signals improved arousal detection performance over single-signal models.
Thoracic effort, heart rate, and a wake/sleep indicator achieved 61.59% precision and 56.46% recall.
The method offers a competitive alternative to EEG-based arousal detection for home sleep devices.
Abstract
Introduction: Accurately detecting arousal events during sleep is essential for evaluating sleep quality and diagnosing sleep disorders, such as sleep apnea/hypopnea syndrome. While the American Academy of Sleep Medicine guidelines associate arousal events with electroencephalogram (EEG) signal variations, EEGs are often not recorded during home sleep testing (HST) using wearable devices or smartphone applications. Objectives: The primary objective of this study was to explore the potential of alternatively relying on combinations of easily measurable physiological signals during HST for arousal detection where EEGs are not recorded. Methods: We conducted a data-driven retrospective study following an incremental device-agnostic analysis approach, where we simulated a limited-channel setting using polysomnography data and used deep learning to automate the detection task. During the…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10- —Agence Nationale de la Recherche
- —NIH-NHLBI Association of Sleep Disorders with Cardiovascular Health across Ethnic Groups
- —National Heart, Lung, and Blood Institute
- —NCATS
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsObstructive Sleep Apnea Research · EEG and Brain-Computer Interfaces · Non-Invasive Vital Sign Monitoring
1. Introduction
Frequent arousals during sleep are an influential marker of a pathologically disordered sleep night. Several sleep disorders are known to cause recurrent arousal outbreaks during sleep [1]. One such disorder is Obstructive Sleep Apnea/Hypopnea Syndrome (OSAHS) [2,3]. In OSAHS patients, arousals occur frequently during the recovery phase after an apnea/hypopnea event. The arousal allows the restoration of normal breathing and offsets hypoxia due to inspiratory resistance linked to upper airway narrowing. Arousals do not refer to long periods of full behavioral wakefulness often perceived during sleep. They are rather brief transient interruptions of sleep that often go unnoticed but disrupt sleep cycles significantly. The frequency of sleep arousals is often estimated during polysomnography (PSG). PSG is the standard sleep study commonly used for assessing sleep quality and diagnosing sleep disorders. It involves the patient spending a night with multiple sensors attached to him for the overnight measurement of different biological signals, including brain waves from the electroencephalogram required for arousal identification according to the American Academy of Sleep Medicine’s (AASM) recommended definition.
While the electrode placement for PSG is non-invasive and painless, patients commonly experience discomfort, stress, and difficulty falling asleep due to the multiple external wires. This may lead to frequent awakenings and hinder the study results. With the continuous development of sensing techniques, various lightweight diagnostic tools have been proposed as alternatives for PSG, including wearable devices like the Belun Sleep Ring [4], PneaVoX [5] and Clebre [6] tracheal sound systems, the neck-cuff system proposed in [7], and the Wearable Intelligent Sleep Monitor (WISM) [8], which is a device placed above the palmar thenar major muscles. Smartphone-based sleep testing solutions where no extra devices are needed have also been proposed, such as Apneal application [9], which records data from a smartphone attached to the patient’s chest. These alternative tools record fewer signals than full PSG. They mainly focus on parameters like heart rate, respiratory effort, breathing sounds, movement activity, and body position, and they often do not record EEG signals, limiting their use for arousal scoring.
Several previous studies have shown that different autonomic responses can be used as sensitive markers of transient sleep arousals, such as increases in respiratory rate, heart rate, blood pressure, and skin vasoconstriction. Studies based on auditory-simulated arousal have shown that arousal events lead to vasoconstriction [10], variations in blood pressure [11], and an increase in ventilation [12]. In [1], Catcheside et al. compared cardiovascular responses with arousal during normoxia and hypoxia. Their study results confirmed that changes in heart rate, pulse transit time, and skin blood flow were primarily related to arousal rather than hypoxia. In [13], Davies et al. also reported that arousal occurrence changes systolic blood pressure. Smith et al. [14] studied cardiovascular changes concomitant to respiratory-induced and spontaneous arousals by examining different aspects of ECG. Their findings confirmed that both types of arousal imply a shortening in time between R-waves (RR interval) and time between the start of the Q wave and the end of the T wave (QT interval) with a simultaneous more extended period between the onset of atrial depolarization and the onset of ventricular depolarization (PR interval). Another study involving apneic patients [15] found that increased sympathetic activity during sleep is due to chronic exposure to the periodic episodes of hypoxia, hypercapnia, and arousal events accompanying the recurring apneas.
Some attempts have been made to identify sleep arousal without brain signals. Rule-based algorithms have been proposed associating arousals with changes in autonomic nervous system activity, such as peaks in systolic blood pressure in [13] or drops in the peripheral arterial tonometry signal’s amplitude accompanied by increases in pulse rate in [16]. In [17], Foussier et al. used a Mahalanobis distance-based ranking algorithm and the multivariate analysis of variance method to extract the best discriminative uncorrelated heart rate variability (HRV) features for arousal detection. They applied a linear mixed model to account for inter-subject variation and emphasize the individual features’ discriminative power. Basner et al. [18] employed a Bayesian approach to calculate how likely a given heartbeat is to correspond to the start of arousal. Deep learning models have also been considered for this task. Olsen et al. [19] used a feedforward neural network trained on different HRV features. Ehrlich et al. [20] used an ensemble of fully convolutional networks to learn to detect arousal from RR interval signals extracted from an electrocardiogram, and Li et al. [21] developed a new dedicated deep learning model called DeepCAD to identify arousal using a raw single-lead ECG signal as input. The methods and models above have used only one input signal for detecting arousals, mainly relying on cardiovascular activations except in [19], where authors experienced using sleep stages with HRV features showing significant improvement in the model accuracy.
In this work, we present a retrospective study on the large Multi-Ethnic Study of Atherosclerosis (MESA) dataset that explores the potential of combining different signals easily measurable by wearable and smartphone-based diagnostic devices to improve the identification of cortical arousals without access to brain activity. We simulate a limited channel diagnostic context using PSG records using only signals that lightweight diagnostic devices commonly record or can estimate. This simulation-based approach ensures that the study is device-type agnostic. It also allows the evaluation of signal combinations that have not yet been explored, which can drive the design of new diagnostic tools. In this study, we rely on deep learning to automate the arousal detection task. The study’s results on the MESA dataset have confirmed our initial hypothesis regarding the performance improvement achieved by combining different signals. The combination of the thoracic effort signal with heart rate and a binary indicator of wake/sleep, which annotates sleep records per 30 s epochs based on wakefulness (wake epochs are periods of complete wakefulness and differ from the transient arousals that do not necessarily lead to the patient’s awakening), resulted in competitive results compared with the state-of-the-art performance in detecting arousal events during sleep without EEG signals.
2. Method
Different lightweight and wearable devices have been introduced to facilitate and improve home sleep testing (HST). These devices rely on different techniques and technologies for signal acquisition, resulting in variations in the quality of the signals they capture. Several studies have compared the performances of these devices in tasks such as sleep stage classification or identifying apnea/hypopnea events. However, the comparison results may be influenced by the size of the study group and its characteristics. To address this issue, we chose to simulate a limited-channel HST setting similar to HST using wearable devices and PSG signals from the Multi-Ethnic Study of Atherosclerosis (MESA), ensuring a device-agnostic study. We only included PSG signals that can be easily recorded and accessed during HST using wearable devices or smartphone applications. We also applied lenient inclusion criteria to the MESA dataset, allowing the use of signals affected by movement or sensor misplacement noises.
2.1. Dataset
This study is based on data from MESA [22]. MESA is a multi-center longitudinal research study, sponsored by the National Heart, Lung, and Blood Institute (NHLBI), involving asymptomatic participants aged between 45 and 84 from six communities in the United States. The study was designed to investigate factors associated with the development and progression of subclinical cardiovascular disease in an ethnically diverse population.
Between 2010–2012, 2237 participants underwent an overnight in-home PSG as part of the follow-up exams. Institutional review board approval was obtained at each study site, and written informed consent was obtained from all participants. Only 2055 from the 2237 MESA PSG records are available in the National Sleep Research Resource (NSRR) repository [23].
PSG was conducted using a 15-channel monitor (Compumedics Somte System; Compumedics Ltd., Abbotsville, Australia). Each PSG recording includes electroencephalography (EEG), bilateral electrooculograms, chin electromyography (EMG), bipolar electrocardiography (ECG), thoracic and abdominal respiratory inductance plethysmography, airflow measured by a thermocouple and nasal pressure cannula, finger pulse oximetry, bilateral limb movement piezoelectric sensors, and a position sensor. The PSG recordings also provide a heart rate signal derived from the ECG and a snore signal measured as vibrations related to breathing at the nasal pressure cannula level. PSG signals from the MESA study are listed in Table A1 in Appendix A.1.
Certified scorers manually scored the PSG records from MESA for sleep stages and arousal events. Each record is reviewed on an epoch-by-epoch basis. Each epoch is assigned a sleep stage, and EEG change periods that meet the arousal criteria are marked. The scoring is performed according to the AASM guidelines. Additional rules are provided in the MESA Sleep Reading Center (SRC) manual of operations and scoring rules (https://sleepdata.org/datasets/mesa/files/m/browser/documentation/MESA_Sleep_Polysomnography_Scoring_Manual.pdf (last accessed on 30 August 2024)). Regarding the sleep stage scoring, the rules mainly help to guide score assignment in sleep stage transition epochs and during an arousal event. In connection with scoring arousal, the SRC scoring rules offer useful guidance for distinguishing arousal events from artifacts caused by movement or isolated bursts of delta waves. They also provide advice on scoring arousal during rapid eye movement periods. To minimize inter-scorer scoring differences, MESA scorers participated in rule-based training and regularly took part in reliability exercises, which included re-scoring a standard set of 20 MESA records.
The SRC manual presents the results of scoring reliability assessment over a set of MESA records. The results exhibit the excellent quality of wake/sleep scoring, estimating the average difference in total sleep time to be 7 min and the inter-scorer correlation for total sleep time estimation to be between and 1. Inter- and intra-scorer reliability for arousal scoring were also evaluated. The intra-class correlation coefficients for the arousal index ranges from to , indicating strong consistency in scoring.
Out of the 2055 records available, 1311 records were included for analysis. Of the discarded records (604 records), lacked manual annotation for arousal, which is essential for model training and evaluation. The rest of the discarded records were mainly due to instances where at least one of the selected signals was of poor quality, defined as missing or noisy for more than of sleep time. The flowchart of the exclusion criteria is depicted in Figure A1 in Appendix A.2.
2.2. Model Architecture
In this study, we use the same neural network architecture as the DeepCAD model. The DeepCAD model, introduced in [21], is designed to automatically identify cortical arousals using a single-lead ECG signal as its input. The model learns a non-linear function through a neural network that assigns an arousal probability to each second of an ECG signal it processes. Compared with other methods that rely on cardiac features for arousal detection, the DeepCAD model has demonstrated leading performance, as highlighted in [20].
The architecture of the DeepCAD model comprises an inception block consisting of four parallel convolutional blocks with varying reception fields, followed by residual blocks to downsample the input to 1 Hz. As shown in Figure 1b, each residual block comprises two components, each containing two convolutional blocks and corresponding skip connections. In the second component, a downsampling stride of two is employed. The output of the last residual block is fed into two Long Short-Term Memory (LSTM) layers. LSTM [24] is a type of recurrent neural network (RNN) that addresses the vanishing gradient problem of traditional RNNs and provides extended short-term memory to the network. Each LSTM unit includes a cell to remember values across multiple time steps, an input gate, an output gate, and a forget gate that all regulate the flow of information in and out of the cell. The last LSTM layer is followed by a fully connected layer with a sigmoid activation function to produce arousal probabilities.
The inception and residual blocks are based on a similar convolutional structure shown in Figure 1a, which includes a convolutional layer followed by batch normalization and ReLU activation.
In order to adapt to the 4 Hz resolution of the input signals, we use two residual blocks instead of 8 in the DeepCAD model [21]. Moreover, we adjust the kernel sizes of convolutional layers for both the inception block and the residual blocks. We consider larger reception fields for the inception’s convolutional layers by fixing kernel sizes to 5, 33, 65, and 129. We set kernel sizes to 1, 2, 7, and 2 for residual blocks.
We depict the deep learning (DL) model architecture in Figure 1c.
2.3. Selected Input Signals
To simulate a limited-channel HST setting, we only consider signals commonly recorded and made available during HST using wearable devices or smartphone applications. The signals we are considering consist of the heart rate, the thoracic effort, the snoring signal, the position signal, and the binary signals of wake/sleep and position change. The thoracic effort, snoring, and position signals are directly acquired with sensors during PSG. In the MESA study, standard PSG sensors were used for acquisition: a respiratory inductance plethysmograph (RIP) thoracic belt for the thoracic effort signal, a built-in movement detector for the position, and a nasal cannula for snoring. Regarding the heart rate, wake/sleep, and position change, those signals are derived from recorded PSG signals: the heart rate signal is derived from the raw ECG signal, the position change signal is a binary signal triggered to 1 when a position change is detected on the position signal, and the wake/sleep signal is extracted from the hypnogram and takes the value of 0 or 1 based on whether the patient is awake or asleep. It is important to note that the “Wake” epochs refer to periods of complete wakefulness observed in the hypnogram, which differ from arousals that may not necessarily lead to the patient’s awakening.
We point out that the signals selected from PSG can already be captured or derived by most lightweight wearable home sleep testing (HST) devices and smartphone-based solutions. These solutions use different acquisition technologies and often depend on specific algorithms and methods to extract and reconstruct each signal. For example, heart rate can be estimated using a seismocardiogram from an accelerometer placed on the patient’s chest (seismocardiography (SCG) is used by Apneal [9] to estimate the heart rate and reconstruct the thoracic effort signal). A photoplethysmogram or a tracheal sound signal can also be used to estimate the heart rate. Additionally, accelerometer and gyroscope data can be used to collect information about the patient’s position and wake/sleep stages and reconstruct the thoracic effort signal. The accuracy of measured and derived signals using a specific HST device largely depends on the device sensing technique and the defined method for signal processing. It is important to note that signals from lightweight HST devices are not always less accurate. For instance, high-quality audio recordings can better estimate the likelihood of snoring instead of relying solely on the vibration signal from the cannula.
Table 1 summarizes the selected physiological signals and their sources in the MESA dataset. An example of measurements of selected signals around a given annotated arousal event extracted from one MESA participant’s recording is illustrated in Figure 2.
The selected signals are considered as candidate input signals to the detection model. An incremental analysis approach is defined to assess each signal’s impact and identify the signal combinations that improve arousal detection without EEG signals.
2.4. Incremental Analysis Approach
We follow an analysis process to determine which combination of biological input signals enhances the model’s ability to detect arousals. For doing so, we use a greedy incremental approach.
Round I: During the first round, we train three learning models: the first using the heart rate signal as input, the second using thoracic effort, and the third using the snoring signal as input. Let us note that, in the first round, we omit the body position and wake/sleep signals listed in Table 1, since they do not make sense to be used alone for our task. We then compare the performances of these models and select, for the next round, the input signal corresponding to the model that performs best.
Round II: In the second round, we build pairs of input signals by adding to the previously selected signal any signal (different from the previously selected signal) listed in Table 1. We train a model for each of the pairs, and we evaluate its performance. We then select for the next round the pair of input signals that corresponds to the best-performing model.
Successive Rounds: In the next rounds, we follow recursively the same process by adding, at each round, a new input signal to the combination selected at the previous round. We then select, for the next round, the new combination of signals corresponding to the best-performing model.
We illustrate the incremental approach with a flowchart in Figure 3. We differentiate between the initial round and the incremental subprocess, including all the subsequent rounds. Each round of the incremental subprocess involves training models using input signal combinations generated by adding a signal to the input of the best model from the preceding round, evaluating their performance, and assessing for improvement.
2.5. Experiment
2.5.1. Preprocessing
In this study, we aimed to minimize the amount of data preprocessing. First, all selected signals from PSG were resampled at 4 Hz. Then, we standardized each signal by removing the median and dividing each sample by the interquartile range. No filtering was applied to the signals to deal with noise. We only corrected outliers in the heart rate signal to address errors caused by R peak misdetection on the ECG signal.
The decision not to apply filtering is primarily due to using a deep architecture for the learned model. The model’s robustness to noise is mainly credited to the convolutional layers. These layers allow for extracting essential and relevant features from input signals through filters and moving windows and effectively skip noise and irrelevant information. It is worth noting that experiments involving filtering on some input signals (e.g., thoracic effort signal) were conducted, and they showed no significant difference in the model performance. The results of those experiments were omitted to keep the Results section clear and smooth.
2.5.2. Training
The selected 1311 records from the MESA dataset were split for model training, validation, and evaluation, resulting in 952 records for training, 104 for validation, and 255 for testing. Table A2 in Appendix A.1 shows the characteristics of data from each set.
Similar to [21], we use cross-entropy loss as the loss function, and we train the models on batches of size 30 using truncated backpropagation through time [25] with a depth of 90 and an Adam optimizer = ) [26]. We also initialize the learning rate to and reduce it by a factor of 10 when the performance stops improving for four consecutive epochs. We train the model for 30 epochs and select the model with the best area under precision–recall on the validation set.
2.5.3. Postprocessing
The deep learning model takes a sequence of selected signals as input. After processing and downsampling to 1 Hz, it outputs a vector of per-second likelihoods of arousal event occurrence. These outputs result from applying a sigmoid activation function to the output of the final linear prediction layer. As the values produced by the sigmoid function range between 0 and 1, they can be intuitively interpreted as the likelihood of an event occurring at a specific time interval. The closer the output is to 1, the more likely it is that an arousal occurred. The continuous output values of the sigmoid function are thresholded to determine class labels for the binary arousal/no arousal classification task. A decision threshold is defined so that the label is set to 1 if the arousal probability is above a specified decision threshold and 0 otherwise. Arousal events are then defined as a series of continuous positive labels.
We optimize the decision threshold value to maximize the event-based F1-score on the validation set. To do this, we start by selecting three threshold values, each spaced apart, centered on the threshold that improves the pointwise F1 score. We then adjust these values, either higher or lower, based on the order of the event-based F1 scores obtained. If the central value maximizes the event-based F1 score, we decrease the minimum and maximum values. The process is repeated until the difference between the successive values is , and the central value is the one that maximizes the event-based F1-score.
The model’s output is further postprocessed to better comply with the AASM guidelines and to smooth the model predictions. A two-step postprocessing is applied. A first merging step ensures that two arousal events are separated by at least 5 s. This 5 s gap is used instead of the 10 s gap recommended by the AASM guidelines to account for up to 3 s of uncertainty in arousal onset and end detection. The second postprocessing step discards arousals lasting less than 3 s to meet the AASM’s recommended minimum arousal duration. Table A3 in Appendix A.4 details the number and percentage of detected arousals that were combined and those that were discarded for each trained model.
2.6. Evaluation Metrics
To evaluate the model’s performance, we only consider the events annotated during sleep and discard any events annotated or detected during periods of stable wakefulness. This makes sense since arousals only refer to brief shifts to wakefulness from sleep. Moreover, only arousals during sleeping time are included on arousal index calculation. In our study, we used the hypnogram provided in the MESA dataset to remove wake periods from our analysis. However, it is worth noting that there exist several EEG-free sleep–wake classifiers that could be efficient enough to replace the use of a hypnogram.
We evaluate the results in three ways: pointwise, event-based, and recordwise. Unlike the recordwise evaluation, the two former evaluations are global and patient-agnostic. Point-wise evaluation enables precise assessment of the model’s performance at a resolution of one s. This means that the accuracy of the output label is evaluated at each time bin. On the other hand, event-based evaluation is more focused on assessing the model’s ability to detect arousal events as they occur and accurately identify autonomic activations related to them. Finally, record-based evaluation complements the previous global evaluation by taking into account inter-patient variability. In this type of evaluation, the stability model’s performance is assessed.
2.6.1. Point-Wise Evaluation
For calculating pointwise metrics, we first concatenate test set records as one output sequence and then compare it with the sequence of reference ground truth annotations. We classify each sequence point based on the presence or no presence of an arousal event and whether it was detected.
We evaluate the model performance based on precision, recall, and F1-score metrics. We further calculate the area under the precision–recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC).
2.6.2. Event-Based Evaluation
For event-based evaluation, we consider a lenient definition of true detections based on the overlap rate between the detected and the reference events. Thus, a detection is considered true positive if it covers enough at least one ground truth arousal (i.e., there is sufficient overlapping between a certain ground truth arousal event and the detected one).
Let us denote by the required minimum proportion of the ground truth event that should be covered by the model output and by the function set to 1 if the overlap criterion is satisfied between and .
As for pointwise evaluation, we use precision, recall, and F1-score metrics. In this context, the recall metric allows for measuring the ability of the model to correctly identify true arousal events while the precision allows for measuring the accuracy of the detections made by the model. In this study, the required minimum overlap rate is set to .
2.6.3. Record-Wise Evaluation
In the recordwise evaluation, we compute event-based recall, precision, and F1-score separately for each test record (i.e., each patient). We analyze the distributions of recall, precision, and F1-score to assess the model performance stability. We use the bootstrapping paired t-test [27] with the Holm–Bonferroni multiple comparisons method [28] for significance testing when comparing the results using different input combinations. The bootstrap approach was chosen due to its lack of strict assumptions about the underlying distribution of metrics’ output, as it does not presume normality or equality of variances. Instead, it mainly relies on the assumption that the empirical distribution from the data accurately reflects the actual population characteristics. The bootstrapping paired t-test involves repeatedly resampling the paired differences with replacement to create a bootstrap distribution of t-statistics, which is then used to calculate a p-value as the proportion of bootstrap t-statistics more extreme than the original t-statistic.
When performance is compared with a state of the art, we additionally compute the arousal index (ArI) for each record as the average number of arousal events per hour of sleep. We compare the ArI based on detected events with the one using the ground truth arousal events using the Pearson correlation. We also analyze the difference between the two indexes by a Bland–Altman plot.
3. Results
Our analysis process results in the evaluation of 16 input data combinations. The combinations include three training settings using a single signal as input, five using two signals, six using three signals, one using four signals, and one using all available signals.
3.1. Round I: Model Trained with a Single Input Signal
In the first round, we train the model using a single signal from HR, Thor, and Snore to detect arousals. We show the obtained results in terms of events-based, pointwise, and recordwise evaluation in Table 2.
The results presented in Table 2 indicate that the model trained using the Thor signal as input achieved the best performance. This model achieves a improvement in the pointwise F1-score and a improvement in the event-based F1-score. The improvements in F1 scores result from both higher precision and recall scores. We also obtain higher AUPRC and AUROC scores when using the Thor signal.
Figure 4 represents the average recordwise F1-scores of the trained models. We can observe that the model trained using the Thor signal significantly outperformed those trained using HR or Snore signals. As detailed in Table 2, the Thor signal model obtains a mean recordwise F1-score of compared with a mean recordwise F1-score of around by the HR or the Snore signal. The statistical significance of these results was tested using the bootstrapping test of Berg-Kirkpatrick et al. [27], as explained in Section 2.6.3.
In Table 3, we use relative confusion matrices to compare the model performance to detect arousal events using the Thor signal with its performance using the DHR or the Snore signal. Relative confusion matrices represent the counts of true events that both models commonly detect, events detected only by the first model, events detected only by the second model, and events not detected by either of the models. By focusing on the off-diagonal values, we conclude that about of arousals were detected only when the model used the Thor signal as input. However, about of true events were missed when using the Thor signal, while they were correctly identified using the other signals. Hence, combining the Thor signal with other physiological signals such as DHR or Snore may increase the arousal detection rate. This is what we investigate in the next round.
3.2. Round II: Model Trained with Two Input Signals
In this second round, we combine the Thor signal with one of the other signals at our disposal (DHR, Pos, Pos chg, Snore, and W/S signals). The obtained evaluation results are summarized in Table 4.
The results show that including one signal from DHR, Snore, or W/S to the model inputs enhances both event-based and pointwise F1-scores. Event-based and pointwise F1-scores were 56.29% and 51.22% when using only the Thor signal (cf. Table 2). Among those combinations, the one combining the Thor signal with the DHR signal achieved the highest improvement in the event-based F1-score of approximately 3.7% when compared with the model trained using only the Thor signal (cf. Table 2). Meanwhile, the model trained with the Thor and W/S signals resulted in the best improvement in pointwise AUROC, AUPRC, and F1-scores. Combining the Thor signal with position information (Pos or Pos_chg) significantly improves recall scores but results in lower precision due to more falsely detected events.
Recordwise evaluation results indicate that using a combination of signals (DHR, Snore, or W/S) along with the Thor signal provides significantly better F1-scores on test records when compared with the model using only the Thor signal (cf. Table 2). They show a statistically significant improvement in the recall score but no significant difference in the precision score over the test records. The bar plot in Figure 5 illustrates the average recordwise F1-scores of models trained during the second round of experiments. Additionally, the average F1-score of the model trained using only the Thor signal is provided for comparison. Upon comparison, we state that the model trained with the Thor and DHR signals achieves the highest average F1-score.
3.3. Round III: Model Trained with Three Input Signals
In the third round, we examine every possible combination of three input signals, which included the Thor signal and at least one of the DHR, Snore, or W/S signals.
We compare the results obtained from these combinations with the results of the following combinations from the previous round: Thor+DHR, Thor+Snore, and Thor+W/S (summarized in Table 4). We note further improvements in terms of event-based F1-score, especially when training the model with Thor, W/S, and DHR signals as input or when considering Thor, DHR, and Snore signals. The first combination enhances the event detection precision by at least 8% compared with previous round results (cf. Table 4). Additionally, it provides the best result in terms of pointwise AUPRC, AUROC, and F1-score while maintaining a balanced trade-off between precision and recall. On the other hand, the combination of Thor, Snore, and DHR signals gives a balanced trade-off between event-based precision and recall of around 59%. However, it has significantly higher recall (58.94%) than precision (47.21%) in pointwise evaluation.
Recordwise evaluation provides the same conclusions regarding the model trained using Thor, DHR, and W/S signals, achieving the highest F1-score (cf. Figure 6). Details in Table 5 show that the improvement in F1-score is due to higher precision on true arousal event identification. The results also reveal that combining Thor, DHR, and Snore signals or Thor, DHR, and Pos signals significantly improves the performance in terms of F1-score. Unlike the other two combinations, using Thor with DHR and Pos signals as input enhances the detection rate of the model (higher recall) while maintaining the same level of precision when compared with the model trained with Thor and DHR signals in Table 4.
3.4. Rounds IV and V: Model Trained with Four and Five Input Signals
In the fourth round, we train the model by using all of the Thor, Snore, DHR, and W/S signals. The results show only a slight improvement in the model’s performance compared with the model trained without the Snore signal (cf. Table 5). The model achieves a higher precision score but a lower recall.
In the fifth round, we further add position information to the input; the model’s recall is improved. However, the improvement is not significant compared with the performance obtained in Table 5 using only the Thor, DHR, and W/S signals. Table 6 summarizes the fourth and fifth rounds’ results.
In Figure 7, we summarize the results of the different rounds of the incremental analysis approach. We focus on the per-record F1-score metric, where the F1-score is calculated for each model and record of the test set.
The bar plot in Figure 7 illustrates per-model F1-scores showing the mean and the interquartile interval. It underlines the performance improvement when using the Thor signal compared with DHR and Snore signals, as well as the gradual, continuous enhancement in score when combining the Thor signal with other signals. It is also worth noting that the model trained with Thor+WS+DHR performs similarly to the ECG-based DeepCAD model. Additionally, the model combining all signals slightly improves the F1-score compared with DeepCAD.
3.5. Comparison with the State-of-the-Art DeepCAD Model
In this section, we compare the performance of our model when trained using Thor, DHR, and W/S signals as input with the state-of-the-art DeepCAD model proposed by Ao Li et al. [21]. The DeepCAD model is trained using the raw single-lead ECG to detect arousals during sleep. The DeepCAD model’s performance has already been tested on the MESA dataset and has shown promising results.
The comparison results at all levels of evaluation are summarized in Table 7. By only using Thor, DHR, and W/S signals as input, we obtain comparable performance from the DeepCAD model. The model provides higher pointwise AUROC and AUPRC scores, slightly enhancing the pointwise precision and, hence, the F1-score. However, the DeepCAD model performed slightly better according to event-based evaluation (both globally and per record).
In Table 8, we plot the relative confusion matrix comparing the performances of the two models. The off-diagonal values indicate that both models missed almost the same number of events, implying similar detection rates.
We compare the two models’ performances regarding the accuracy of the derived arousal indexes. Our findings reveal a significant Pearson correlation of 0.79 (0.74) between the ground truth ArI and the calculated ArI using the model trained with Thor, DHR, and W/S signals (respectively the DeepCAD model).
Figure 8 displays Bland–Altman plots that show the difference between the true ArIs and the calculated ArIs of the two models. The plots show that both models slightly underestimate the true ArIs. We can also infer that both models exhibit the same bias distribution with slightly less biased estimates using the Thor, W/S, and DHR signals.
4. Discussion
Findings from the first round of our incremental analysis highlighted the significantly better results obtained when the model is trained solely with the thoracic model as input compared with the two other models trained, each with the heart rate and the snore signals. Analyzing changes in the thoracic signal for arousal identification has been scarcely explored in the literature, making it a promising area for further research.
Additionally, our subsequent rounds of analysis demonstrated the effectiveness of a multi-signal approach compared with single-input signal models. By comparing different combinations of input signals, we were able to discern the most important signals for supporting the arousal identification task. Our comparison demonstrated competitive performance with the state-of-the-art DeepCAD model by combining thoracic effort, heart rate, and a wake/sleep indicator signal. The wake/sleep signal used in our study is constructed from the hypnogram. However, our sleep/wake misclassification simulation tests, detailed in Appendix A.5, indicate that even with a reasonable error rate (up to 30%), the W/S signal still improves arousal detection. Identifying new signal combinations for arousal identification holds great promise for future diagnostic device design.
While our findings are encouraging, it is essential to acknowledge that EEG-free arousal identification remains a complex challenge. Further research is required to develop more accurate and reliable models for detecting and characterizing autonomic activations associated with arousal.
One of the key advantages of our data-driven retrospective study is the ability to test a wide range of signal combinations independent of specific testing device capabilities, making it device-agnostic. However, this approach does come with its limitations. First, even when conducted at home, as in the case of MESA, PSG is often performed in a controlled and assisted testing setting that may not accurately replicate real-world unattended home sleep testing conditions using wearable devices. Second, smart wearables may rely on different technologies and sensing techniques, impacting measurement quality and reliability. While signals from smart wearables may not always be less accurate, it is worth noting that the viability of measuring the tested non-EEG signals will depend on the device’s capacity to record or reproduce them. Lastly, while using the MESA dataset allows us to train and evaluate models on large and diverse annotated data, the results may potentially be biased toward specific characteristics of the population targeted by MESA, such as the participants’ age, requiring careful consideration in generalizing our findings to other populations. In the future, we could consider stratifying the training data into different groups based on specific population characteristics and then training the model on a smaller set of the overall training set within each group. This approach would allow for the analysis of how certain population traits affect the model’s performance, providing a more detailed understanding of the generalizability of the study results.
5. Conclusions
In conclusion, our study explores the potential of using easily measurable physiological signals to detect arousal events during home sleep testing when EEGs are unavailable. Through a data-driven retrospective analysis utilizing data from MESA and simulating a limited-channel setting from PSG, we found that the thoracic effort signal was the most effective for arousal identification when compared with heart rate and simulated snoring signals. Furthermore, our results showed that continuously combining thoracic effort signals with other easy-to-measure signals led to an improved model performance, highlighting the effectiveness of a multi-signal approach. Our model, trained with the thoracic effort signal, heart rate, and a binary sleep/wake signal, performed similarly to the state-of-the-art ECG-based DeepCAD model. To the best of our knowledge, this combination of input signals has not yet been explored for the EEG-free arousal identification task, even though some existing portable HST devices can easily acquire these signals. Moreover, the comparison results of our study shed light on the most valuable signals for arousal identification. Those results encourage further research to improve the techniques for acquiring and sensing these signals. Finally, we emphasize that additional research is still needed to enhance the model’s ability to detect autonomic activations associated with arousals more accurately and reduce false detections. This will be the focus of our future research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Catcheside P.G. Chiong S.C. Orr R.S. Mercer J. Saunders N.A. Mc Evoy R.D. Acute cardiovascular responses to arousal from non-REM sleep during normoxia and hypoxia Sleep 20012489590210.1093/sleep/24.8.89511766159 · doi ↗ · pubmed ↗
- 2Eaton E.J. Hume K.I. Stone P.A. Woodcock A.A. Respiratory paradox as an indicator of arousal from non-REM sleep Sleep 1999221059106510.1093/sleep/22.8.105910617166 · doi ↗ · pubmed ↗
- 3Shneerson J. Sleep Medicine: A Guide to Sleep and Its Disorders Wiley Hoboken, NJ, USA 2005
- 4Yeh E. Wong E. Tsai C.W. Gu W. Chen P.L. Leung L. Wu I.C. Strohl K.P. Folz R.J. Yar W. Detection of obstructive sleep apnea using Belun Sleep Platform wearable with neural network-based algorithm and its combined use with STOP-Bang questionnaire P Lo S ONE 202116 e 025804010.1371/journal.pone.025804034634070 PMC 8504733 · doi ↗ · pubmed ↗
- 5Penzel T. Sabil A. The use of tracheal sounds for the diagnosis of sleep apnoea Breathe 201713 e 37e 4510.1183/20734735.00881729184596 PMC 5702894 · doi ↗ · pubmed ↗
- 6Kukwa W. Lis T. Łaba J. Mitchell R.B. Młyńczak M. Sleep position detection with a wireless audio-motion sensor—A validation study Diagnostics 202212119510.3390/diagnostics 1205119535626350 PMC 9139663 · doi ↗ · pubmed ↗
- 7Rofouei M. Sinclair M. Bittner R. Blank T. Saw N. De Jean G. Heffron J. A non-invasive wearable neck-cuff system for real-time sleep monitoring Proceedings of the 2011 International Conference on Body Sensor Networks Dallas, TX, USA 23–25 May 2011156161
- 8Xu Y. Ou Q. Cheng Y. Lao M. Pei G. Comparative study of a wearable intelligent sleep monitor and polysomnography monitor for the diagnosis of obstructive sleep apnea Sleep Breath.20232720521210.1007/s 11325-022-02599-x 35347656 PMC 9992231 · doi ↗ · pubmed ↗
