Identification of Comorbidities in Obstructive Sleep Apnea Using Diverse Data and a One-Dimensional Convolutional Neural Network

Kristina Zovko; Ljiljana Šerić; Toni Perković; Ivana Pavlinac Dodig; Renata Pecotić; Zoran Đogaš; Petar Šolić

PMC · DOI:10.3390/s26031056·February 6, 2026

Identification of Comorbidities in Obstructive Sleep Apnea Using Diverse Data and a One-Dimensional Convolutional Neural Network

Kristina Zovko, Ljiljana Šerić, Toni Perković, Ivana Pavlinac Dodig, Renata Pecotić, Zoran Đogaš, Petar Šolić

PDF

Open Access

TL;DR

This study uses a deep learning model to identify comorbidities in obstructive sleep apnea patients using physiological signals and clinical data.

Contribution

A novel 1D-CNN framework for multi-label classification of OSA-related comorbidities using diverse biomedical data.

Findings

01

The 1D-CNN model outperformed traditional ML classifiers with macro AUC-ROC of 0.731 and AUC-PR of 0.750.

02

The model showed consistent performance across age, gender, and BMI groups, indicating strong generalization.

03

SpO2 and airflow signals contain comorbidity-specific patterns useful for efficient OSA comorbidity screening.

Abstract

Recent advances in deep learning (DL) have enabled the integration of diverse biomedical data for disease prediction and risk stratification. Building on this progress, the overall objective of this study was to develop and evaluate a multimodal DL framework for robust multi-label classification (MLC) of major comorbidities in patients with obstructive sleep apnea (OSA) using physiological time series signals and clinical data. This study proposes a robust framework for multi-label classification (MLC) of comorbidities in patients with OSA using diverse physiological and clinical data sources. We conducted a retrospective observational study including a convenience sample of 144 patients referred for overnight polysomnography at the Sleep Medicine Center (SleepLab Split), University Hospital Centre Split (KBC Split), Split, Croatia. Patients were selected based on predefined inclusion…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

oxygen

Diseases6

obstructive sleep apnea diabetes mellitus asthma hypertension OSA COPD

Figures15

Click any figure to enlarge with its caption.

Funding1

—Croatian Science Foundation

Keywords

obstructive sleep apnea (OSA)deep learning (DL)1D-CNNmulti label classification (MLC)multi label confusion matrix (MLCM)sleep medicinepolysomnography

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsObstructive Sleep Apnea Research · Machine Learning in Healthcare · Phonocardiography and Auscultation Techniques

Full text

1. Introduction

Obstructive sleep apnea (OSA) is one of the most common sleep-related breathing disorders and affects a large portion of the global population. Current estimates suggest that nearly one billion individuals have some degree of OSA, which highlights its significance as a major public health concern [1,2]. OSA is characterized by repeated pauses or reductions in airflow during sleep due to upper airway obstruction. These events cause oxygen desaturation, fragmented sleep, and increased physiological stress. When these disturbances occur chronically, they can lead to serious long-term cardiovascular, metabolic, and respiratory consequences [3,4].

Thus, extensive research has shown that OSA is strongly associated with several important comorbidities, including hypertension [5], type 2 diabetes mellitus [6], and chronic respiratory diseases such as asthma or chronic obstructive pulmonary disease (COPD) [7,8]. The likelihood of developing these conditions increases with OSA severity. In addition, demographic and clinical factors such as age, body mass index (BMI), metabolic irregularities, and inflammatory processes further influence the interaction between OSA and its comorbidities [9,10]. Detecting these comorbidities early is essential for preventing complications and improving patient outcomes.

Polysomnography (PSG) remains the clinical gold standard for OSA diagnosis. PSG includes continuous monitoring of several physiological signals, such as oxygen saturation ( $[eqn]$ ), airflow, heart activity, brain waves, and muscle tone. Although PSG is highly reliable, it is resource intensive, costly, and time consuming. The interpretation of PSG data typically requires manual scoring performed by trained clinicians, which can introduce variability, slow down the diagnostic process, and limit scalability [11,12,13,14]. As the volume and complexity of biomedical data continue to grow, traditional manual analysis becomes increasingly challenging.

Advances in machine learning (ML) and deep learning (DL) have opened new possibilities for improving diagnostic support in sleep medicine. DL models are capable of automatically learning patterns from physiological time series data and often achieve higher performance than traditional analytic methods. Convolutional Neural Networks (CNNs) are particularly effective for analyzing biomedical signals such as $[eqn]$ and nasal airflow $[eqn]$ because they can extract meaningful temporal features directly from raw input data [14]. These advantages make DL promising for building automated systems aimed at identifying OSA-related health risks.

Most existing research focuses primarily on detecting, monitoring, or classifying the severity of OSA itself, without addressing the broader clinical presentation in which multiple health conditions often occur together [15]. Studies in this field mainly concentrate on identifying apnea and hypopnea events or estimating OSA severity levels, while very few investigate the prediction or classification of comorbidities associated with OSA [16,17]. To the best of current knowledge, only one study highlights the importance of identifying comorbidities in this patient population, and no DL approaches have been developed specifically for multi-label classification (MLC) comorbidity. This gap indicates a clear need for models that can analyze diverse physiological and clinical features in order to detect several coexisting conditions more accurately. The aim of this study is to develop a Deep Neural Network (DNN)-based approach for multi-label classification of OSA-related comorbidities using different types of data, including PSG signals, clinical variables, and signal-derived features.

In this study, a DL-based method using PSG signals and additional clinical information is explored to identify several comorbidities associated with OSA. In the next sections of this paper, the dataset and preprocessing steps applied to the physiological signals will be described. The extraction and preparation of clinical and signal-derived features will be explained. The architecture of the proposed one-dimensional Convolutional Neural Network (1D-CNN) for MLC will be presented, the evaluation metrics and comparison procedures will be outlined, and the experimental results and interpretation of model performance will be discussed. This structure provides a complete overview of how DL can support more accurate and efficient identification of comorbidities in patients with OSA. The general objective of this study is to develop and evaluate a multimodal DL framework for multi-label identification of major comorbidities in patients with obstructive sleep apnea by integrating $[eqn]$ , nasal airflow (FP0), and structured clinical parameters. The specific objectives are to preprocess physiological signals and derive representative features, design a fusio-based 1D-CNN for multi-label prediction, evaluate performance using established multi-label metrics in comparison with baseline approaches, and assess robustness across demographic subgroups. We hypothesize that multimodal fusion of $[eqn]$ , FP0, and clinical features improves multi-label comorbidity identification compared with single modality inputs and traditional baseline models, particularly in threshold independent metrics under class imbalance.

2. Related Work

Artificial intelligence (AI) techniques are increasingly used in medicine as the availability of large and diverse datasets grows and as clinical practice demands faster and more accessible diagnostic solutions. Different biomedical modalities, including physiological signals, medical imaging, wearable sensor data, and electronic health records, require analytical methods capable of capturing their temporal, spatial, and structural patterns. While traditional ML approaches remain valuable for structured and interpretable data, modern DL architectures such as CNNs, RNNs, LSTMs, and Transformers dominate contemporary research due to their ability to model complex patterns in signals, images, and multimodal inputs [18]. Nevertheless, despite these advances, many existing AI studies remain largely monomodal, and relatively few address the prediction or identification of OSA-related comorbidities. In the domain of physiological time series, common approaches include RNNs, LSTMs, GRUs, 1D-CNNs, and ensemble methods such as RF and XGBoost [19], as well as multimodal architectures combining CNN and LSTM models on EEG, ECG, $[eqn]$ , and airflow signals [20]. More advanced systems integrate physiological signals and EHR data through hybrid DL/ML frameworks [21], while GAN-based models have been explored for enhancing minority classes in $[eqn]$ or airflow datasets [22,23]. Other studies fuse physiological signals with CT imaging or EHR data using CNN–Transformer pipelines [24,25]. Similar methodological diversity is observed in EHR, questionnaire, and population-based clinical research, where Transformers, Graph Neural Networks (GNNs), and hybrid architectures are commonly applied [26,27,28]. In medical imaging, CNNs continue to serve as the foundation, with increasing adoption of Transformers, 3D-CNNs, and hybrid systems that integrate images with physiological or behavioral signals [29,30,31]. Such multimodal solutions often rely on CNN, GNN, and U-Net-based components or combine multiple sensor types through CNN–Transformer frameworks [32,33,34,35]. A separate line of work has investigated multi-label classification with combinations of ECG, EEG, EMG, MRI, CT, and wearable sensor data using SVM, GAN, or reinforcement learning approaches [36,37,38]. AI has also become increasingly prominent in sleep medicine, particularly for automated detection of OSA. Early studies relied on handcrafted features extracted from $[eqn]$ or ECG signals processed with Fully Connected Neural Networks (FCNNs) or classical ML methods [39,40]. More recent research employs ResNet models, contrastive learning, multiscale architectures, and attention mechanisms to detect apnea events or estimate AHI directly from physiological signals [41,42,43]. EEG-based and multimodal systems combine wavelet-based features, CNNs, BiLSTMs, and attention models to improve event detection and sleep staging [44,45,46,47,48,49]. Additional OSA-related work integrates anatomical imaging, acoustic data, or thermal and depth information using CNN-based architectures [50,51,52,53]. Clinical variable models using logistic regression (LR), XGBoost, SVM, and RF also remain widely used [54,55,56], together with an increasing emphasis on explainable AI (XAI) for improved clinical trust [57]. Within this broader landscape, prior work specifically targeting OSA comorbidity prediction has mostly focused on predicting individual comorbid conditions or clinically relevant outcomes using ML models trained on demographic, clinical, and sleep-derived descriptors. For instance, ref. [58] addressed OSA-related hypertension prediction by benchmarking multiple ML classifiers, including LR, gradient boosting-based methods (e.g., GBM/XGBoost), ensemble techniques, and a multilayer perceptron, in a cohort of 1493 OSA patients, and further interpreted the best-performing models using permutation importance and SHAP to highlight the relevance of demographic characteristics (e.g., age, BMI) and oxygenation-related measures (e.g., minimum $[eqn]$ , time below 90%). Their best-performing model, GBM, achieved strong discrimination (AUC-ROC = 0.873) and identified key contributors such as family history of hypertension and the percentage of time with $[eqn]$ . Ref. [59] extended comorbidity prediction beyond cardiometabolic risk by developing depression risk models in OSA-Hypopnea Syndrome patients, comparing traditional approaches such as LR and Least Absolute Shrinkage and Selection Operator (LASSO) regularization with tree-based models (Random Forest (RF)), demonstrating that the combination of clinical factors and lifestyle variables can improve stratification of mental health comorbidity. In addition, ref. [60] proposed an ML-enhanced framework for predicting incident atrial fibrillation in patients with concurrent type 2 diabetes and OSA syndrome, integrating ML-based risk modeling with clinical predictors and showing that metabolic indices alongside sleep-disordered breathing severity contribute to cardiovascular comorbidity development. Finally, ref. [61] demonstrated the applicability of ML to long-term prognostic modeling by using an RF predictor with feature selection to estimate 10-year cardiovascular disease-related mortality risk in an OSA cohort, illustrating how data-driven models can capture clinically meaningful outcome risk beyond cross-sectional comorbidity status.

Overall, most existing studies address single comorbidities or long-term outcomes using predominantly tabular predictors, whereas fewer works explore multi-label prediction across multiple comorbidities simultaneously. Moreover, despite rapid progress in AI for sleep medicine, OSA-related research still mainly focuses on estimating AHI, detecting apnea and hypopnea events, or assessing disease severity. As a result, the systematic classification of OSA-related comorbidities remains largely underexplored, and many proposed solutions rely on a single data modality, whether imaging, physiological time series, or clinical metadata, thereby overlooking important cross-modal relationships and interactions. In contrast, the proposed approach formulates comorbidity identification as a multi-label task and integrates signal-derived representations with additional clinical parameters, enabling direct assessment of the contribution of non-signal features through comparison with a signal-only baseline model. This combination addresses a clear methodological gap by moving beyond monomodal, single-outcome modeling toward clinically relevant multi-label comorbidity prediction in OSA.

3. Data Description and Preprocessing

3.1. Dataset Description

This study was designed as a retrospective observational analysis conducted at the SleepLab, KBC Split, Croatia. The study protocol was approved by the Ethics Committee of the School of Medicine, University of Split (Approval No. 003-08/23-03/0015; Date: 17 October 2023). No a priori sample size calculation was performed, as this study represents an exploratory DL model development and evaluation. Therefore, the final sample size was determined by data availability and predefined inclusion and exclusion criteria. A non-probabilistic purposive (criterion-based) sampling approach was applied by first selecting eligible patients with complete physiological recordings and available clinical information required for multi-label comorbidity annotation. When multiple patients met the same eligibility criteria, a random selection was performed to obtain sufficient representation of each target comorbidity for model training.

The dataset used in this study consists of 144 patients who underwent standard overnight polysomnography (PSG) at the SleepLab Split. All recordings were acquired in clinically supervised conditions using full PSG systems, and the data were exported in European data format (.edf) [62]. For the purpose of this study, two physiological signals were selected due to their strong relevance for respiratory analysis: oxygen saturation ( $[eqn]$ ) measured via pulse oximetry and nasal airflow (FP0) recorded using a nasal pressure transducer. Both signals capture essential information about respiratory disturbances during sleep. $[eqn]$ reflects blood oxygen fluctuations associated with apnea and hypopnea events, whereas FP0 reflects airflow amplitude and respiratory cycles. Alongside the time series signals, the dataset includes clinical and demographic information, such as age, gender, BMI, heart rate (HR), and Apnea–Hypopnea Index (AHI). A set of signal-derived features was also computed to describe the timing and severity of respiratory events, including the duration of airflow cessation, desaturation duration, the delay between airflow loss and oxygen decline, and slope-based markers reflecting the dynamics of oxygen drops and recovery.

The dataset used in this study consists of 144 patients who underwent standard overnight polysomnography (PSG) at the SleepLab Split. All recordings were acquired in clinically supervised conditions using a full PSG system (ALICE 6 Diagnostic System [63]), and the data were exported in European data format (.edf) [62]. In addition, standardized questionnaires were administered as part of the clinical assessment, including STOP Bang, the Berlin Questionnaire, the Pittsburgh Sleep Quality Index (PSQI), and the Epworth Sleepiness Scale (ESS). All PSG recordings and clinical data were de-identified prior to analysis by removing direct personal identifiers. The dataset was stored on secure institutional systems with access restricted to authorized research personnel, and the analyses were performed in compliance with applicable data protection regulations. Due to patient privacy and ethical restrictions, the raw data are not publicly available. For the purpose of this study, two physiological signals were selected due to their strong relevance for respiratory analysis: oxygen saturation ( $[eqn]$ ) measured via pulse oximetry and nasal airflow (FP0) recorded using a nasal pressure transducer. Both signals capture essential information about respiratory disturbances during sleep. $[eqn]$ reflects blood oxygen fluctuations associated with apnea and hypopnea events, whereas FP0 reflects airflow amplitude and respiratory cycles. Alongside the time series signals, the dataset includes clinical and demographic information, such as age, gender, BMI, heart rate (HR), and Apnea–Hypopnea Index (AHI). A set of signal-derived features was also computed to describe the timing and severity of respiratory events, including the duration of airflow cessation, desaturation duration, the delay between airflow loss and oxygen decline, and slope-based markers reflecting the dynamics of oxygen drops and recovery.

This multimodal structure provides a comprehensive representation of each patient’s physiological and clinical profile. An illustration of the whole process in the 1D CNN model is illustrated in Figure 1.

3.2. Signal Preprocessing

The PSG recordings contained noise, artifacts, and varying sampling rates depending on the recording system. Because .edf files occupy a large amount of memory, the $[eqn]$ and FP0 channels were extracted from the original PSG recordings and stored in the Feather format to enable faster and more efficient processing [65]. To ensure reliable analysis and allow uniform model input, the signals underwent a structured preprocessing procedure. The original recordings had sampling rates that were much higher than required for analyzing slow respiratory processes. Figure 2 shows $[eqn]$ and FP0 signals of one patient during the night before artifact removal and filtering.

Therefore, both $[eqn]$ and $[eqn]$ signals were resampled to a uniform sampling rate of 5 Hz. This rate is widely used in respiratory signal analysis because it preserves the essential shape of desaturation and airflow events while significantly reducing data volume and computational complexity. The raw signals included invalid values caused by sensor displacement, signal loss, saturation clipping, or patient movement. Such artifacts were detected and corrected using interpolation for short missing segments, replacement of physiologically impossible values, smoothing of extreme spikes, and correction of baseline drift in airflow signals. These steps ensure that only physiologically meaningful patterns remain available for feature extraction and model training. A combination of filtering methods was applied to improve signal smoothness and suppress high-frequency noise: low-pass filtering to preserve slow respiratory components [66], moving average smoothing to stabilize short-term fluctuations [67], and Savitzky–Golay filtering before computing derivatives. These filters help reveal true desaturation patterns and airflow changes while avoiding distortion of clinically relevant events.

After cleaning and filtering, both signals were divided into fixed-length windows covering short time intervals. Windowing allows the model to learn local temporal patterns such as apnea onset, airflow reduction, and the progression of oxygen decline. It also allows the dataset to be converted into multiple training samples per patient, improving model robustness. Figure 3 shows the $[eqn]$ and FP0 signals of one patient during the night after artifact and outlier removal.

3.3. Feature Engineering

To complement the raw physiological signals, a set of features was engineered to quantify specific aspects of abnormal breathing events. Desaturation events were identified in the $[eqn]$ waveform using thresholds related to amplitude drop and minimum event duration. Airflow cessation events were identified in the FP0 signal and paired with corresponding $[eqn]$ desaturations to ensure that extracted events were clinically meaningful. From each matched event pair, the following temporal and morphological parameters were calculated:

t3–t1 mean: Defined as the average delay between the onset of FP0 cessation (t1) and the beginning of oxygen desaturation (t3). This parameter quantifies the latency between respiratory obstruction and its physiological manifestation in blood $[eqn]$ .
t4–t2 mean: Defined as the average delay between the resumption of FP0 (t2) and the start of oxygen recovery (t4). This reflects the time needed for $[eqn]$ to normalize once breathing resumes.
$[eqn]$ : The mean duration of FP0 cessation episodes, computed directly from the FP0 signal between markers t1 (start of apnea) and t2 (end of apnea). This value represents the average length of respiratory arrest events.
$[eqn]$ : The mean duration of oxygen desaturation episodes, calculated as the time interval between t3 (start of desaturation) and t4 (end of desaturation). It provides a measure of how long $[eqn]$ remains depressed during events.
$[eqn]$ : The average desaturation difference, i.e., the difference between the initial $[eqn]$ value and the minimum value reached during all detected desaturation events throughout the night. This quantifies the drop in oxygen during the night.
mean slope: The average slope of the desaturation curves, calculated as $[eqn]$ / $[eqn]$ during the fall phase of all events. It describes the rate of decline in $[eqn]$ , distinguishing between abrupt and gradual desaturations.

In addition to the above signal-derived features, standard clinical parameters were included: age, gender (female: 1, male: 0), BMI, AHI, and heart rate. All features were then integrated into a feature matrix, where each row represents a patient and each column a clinical or derived feature, shown in Table 1. The additional parameters are shown in Table 2, along with their measurement units and clinical lower limits.

These features describe the physiological relationship between airflow changes and oxygen regulation, capturing clinically relevant respiratory patterns that may reflect underlying comorbidities Figure 4. All extracted features, along with demographic and clinical variables, were normalized to ensure consistent scaling across patients. The final dataset integrates three types of information: preprocessed $[eqn]$ and FP0 time series windows, computed temporal and morphological features, and clinical and demographic variables. This structure enables the DL model to simultaneously analyze short-term respiratory dynamics and longer-term patient characteristics.

3.4. Analysis of Data

To provide clinical context for the dataset, baseline characteristics were analyzed across patient subgroups defined by the presence of comorbidities (hypertension, diabetes mellitus, asthma/COPD), as well as an overall disease/no disease split. This analysis summarizes how demographic, clinical, and signal-derived parameters vary between groups and supports the interpretation of the extracted features used in subsequent modeling.

Table 3 compares patients without comorbidities (NO: $[eqn]$ ) and those with at least one comorbidity (YES: $[eqn]$ ). For each comparison (no comorbidity vs. comorbidity and for each individual comorbidity subgroup), both p-values and an effect size were reported. In addition to p-values, effect sizes were reported using Cohen’s d (absolute values) [68], computed as the standardized difference between group means using the pooled standard deviation. Although small numerical differences are observed in several variables, BMI shows the most pronounced separation ( $[eqn]$ ), with higher mean values in patients with comorbidities. In contrast, AHI values remain similar across groups. Figure 5 visualizes the distributions of selected parameters and illustrates substantial overlap between the two populations. Similar comparisons are reported for the individual comorbidity subgroups (Table 4 (YES: $[eqn]$ ), Table 5 (YES: $[eqn]$ ), and Table 6 (YES: $[eqn]$ )), highlighting patterns associated with each diagnosis.

In addition to baseline comparisons, slope-based respiratory features were examined across AHI severity categories (Figure 6, Figure 7 and Figure 8). The mean slope increases with higher AHI levels, particularly in the severe OSA group, indicating greater variability in respiratory signal dynamics with increasing apnea severity. When normalized by age and BMI, the same trend remains visible, suggesting that respiratory slope characteristics capture meaningful changes in breathing morphology across severity strata. Statistical testing across severity categories showed increasing separation between patients with and without comorbidities, reaching significance in the moderate OSA group ( $[eqn]$ ).

Overall, these observations support the subsequent modeling stage, in which both signal-derived features and clinical parameters are considered for comorbidity prediction. Even when univariate differences are modest, predictive information may arise from multivariate and nonlinear interactions that are not reflected in classical statistical comparisons.

3.5. Limitations of Data

Although the dataset provides valuable clinical information, several limitations should be acknowledged. The label distribution is imbalanced: hypertension is considerably more prevalent than diabetes mellitus and asthma/COPD, while multi-label combinations are rare. This imbalance increases the difficulty of multi-label learning, may bias the model toward the most frequent conditions, and can reduce the stability of performance estimates for underrepresented comorbidities. In addition, the dataset does not include healthy control subjects, since all individuals were referred for PSG due to suspected or confirmed sleep-disordered breathing, which may limit generalizability to screening or population-based settings.

Although full PSG contains multiple physiological channels, this study focused on two core channels [69] ( $[eqn]$ and FP0) for model development. Full PSG recordings include additional modalities (e.g., EEG, ECG, EMG, respiratory belts) that provide complementary information about sleep stages, autonomic regulation, and respiratory effort and could potentially improve classification performance. Therefore, restricting the input to two channels reduces the available multimodal context and may limit the model’s ability to capture complex interactions between physiological systems. Finally, the dataset originates from a single clinical center and reflects local referral patterns, which may introduce selection bias and motivates further validation on independent cohorts.

4. Methods

4.1. Problem Definition

The goal of this study is to develop an MLC model capable of identifying three clinically relevant comorbidities commonly associated with OSA: hypertension, diabetes mellitus, and asthma/COPD. Each patient can simultaneously exhibit zero, one, two, or all three comorbidities. The predictive task therefore requires assigning a vector of three binary outputs, where each element indicates the presence or absence of a specific condition.

This setup differs from traditional single-label classification because labels are not mutually exclusive. The model must learn to capture shared patterns across conditions while also distinguishing features unique to each disease.

4.2. 1D-CNN Architecture

Convolutional Neural Networks (CNNs) are a core DL architecture originally developed for image analysis, where they extract spatial patterns using trainable filters [70]. Their fundamental mechanism for learning local features through convolution extends naturally to one-dimensional data, making them highly suitable for biomedical time series signals. In the context of sleep medicine, 1D-CNNs are effective because physiological waveforms such as $[eqn]$ and airflow contain characteristic temporal structures associated with respiratory instability and oxygen desaturation events. These temporal signatures can be difficult to capture using traditional ML methods but can be efficiently learned through convolutional layers that scan the signal and detect recurring patterns [71].

In 1D form, each convolutional filter slides along the temporal axis of the signal and computes a dot product between the kernel and local segments of the waveform. This enables the model to detect short-term events such as rapid desaturation declines or airflow cessations, as well as more gradual patterns related to apnea severity or recovery dynamics. Additional components such as activation functions, padding, and stride control the nonlinearity and temporal resolution of the learned representations. The use of the ReLU activation function enhances gradient flow and prevents saturation effects, while padding ensures that the temporal length of the output remains aligned with the input signal [72].

The predictive task in this study is formulated as an MLC problem, where each patient may simultaneously exhibit several comorbidities rather than belonging to a single diagnostic category. This stands in contrast to traditional single-label classification, where each instance is associated with exactly one class. A simple visual illustration of this difference is shown in Figure 9, which compares mutually exclusive labels with multi-label assignments applicable to real-world biomedical data [73,74].

Within this learning framework, convolutional layers play a key role in extracting temporal patterns from biomedical signals. A conceptual overview of how convolutional operations progressively transform the input signal through stacked feature extraction blocks is shown in Figure 10, which illustrates the hierarchical flow from raw time series data to deeper learned representations [71].

Building on these concepts, the proposed model uses a multi-branch 1D-CNN architecture designed to integrate PSG time series with clinical information. Two branches process the physiological signals independently: one for oxygen saturation ( $[eqn]$ ), one for the derivated $[eqn]$ signal and one for nasal airflow ( $[eqn]$ ). Each branch contains convolutional layers that learn relevant temporal motifs, followed by batch normalization to stabilize training and dropout to reduce overfitting. Global Average Pooling (GAP) condenses each feature map into a compact representation, emphasizing dominant temporal patterns rather than specific signal positions. In parallel, a third branch processes clinical and signal-derived parameters such as age, BMI, AHI, heart rate, and respiratory timing features. This structured input is passed through a fully connected pathway to generate a dense embedding compatible with the signal-based representations. The outputs of all three branches are then concatenated to form a unified feature vector that captures temporal dependencies in the $[eqn]$ waveform, airflow-related respiratory patterns, and broader patient-level characteristics. To support multi-label prediction, the final output layer uses three independent sigmoid units, enabling simultaneous estimation of hypertension, diabetes mellitus, and asthma/COPD. This design aligns the model with the multi-label nature of the task.

A detailed overview of the complete architecture, including convolutional hyperparameters, dense layers, dropout rates, and the focal loss configuration, is presented in Figure 11, which summarizes every component used in the final implementation.

Overall, the integration of convolutional feature extraction, clinical feature processing, and multi-label prediction enables the model to leverage diverse biomedical data and to learn both short-term physiological patterns and long-term patient characteristics. This architecture proved effective in identifying comorbidity-related signatures within PSG signals, supporting its use in automated risk assessment for OSA populations.

4.3. Class Imbalance Handling

The dataset exhibits notable class imbalance, especially for diabetes mellitus and asthma/COPD, which occur less frequently than hypertension. Multi-label combinations further amplify this imbalance and can bias the model toward majority classes. To address this issue, the training process incorporates weighted binary cross entropy, where each label is assigned a class-specific weight inversely proportional to its frequency. This ensures that rare comorbidity classes contribute more strongly to the loss function, encouraging the model to learn from underrepresented cases. A detailed analysis of comorbidity distribution was performed to calculate appropriate weights and to identify the imbalance between single-label, dual-label, and triple-label cases.

4.4. Evaluation Metrics

The evaluation of MLC models requires metrics that capture the fact that each instance may contain multiple labels simultaneously. Unlike single-label classification, where each sample has exactly one true class, MLC allows partial correctness, meaning that a prediction may overlap with the true label set even when the match is not exact. For this reason, model performance was assessed using a combination of label-based, example-based, threshold-independent, and error-based metrics [75,76].

Label-based metrics evaluate each comorbidity independently by computing standard binary measures such as accuracy, precision, recall, and F1-score [77]. Here, TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively, and N denotes the total number of samples. These metrics rely on the following definitions:

[eqn]

[eqn]

[eqn]

[eqn]

To summarize results across all labels, macro and micro averaging were applied. Macro averaging assigns equal weight to each label, while micro averaging aggregates all true positives, false positives, true negatives, and false negatives across labels

[eqn]

[eqn]

Example-based metrics evaluate prediction correctness at the level of each patient. Subset accuracy is the strictest measure, requiring the entire predicted label set to match the true label set exactly,

[eqn]

Flat accuracy evaluates correctness at the level of individual labels,

[eqn]

Partial accuracy quantifies the overlap between the predicted and true label sets,

[eqn]

Threshold-independent metrics were also used. The area under the ROC curve (AUC-ROC) and the area under the precision–recall curve (AUC-PR) measure discriminative performance across different thresholds. AUC-PR is particularly informative for imbalanced datasets.

To quantify fine-grained prediction errors, Hamming loss was calculated,

[eqn]

Finally, model errors were analyzed using a multi-label confusion matrix (MLCM). Unlike traditional confusion matrices, which assume a single true class, the multi-label version computes true positives, false positives, true negatives, and false negatives separately for each comorbidity. This enables detailed inspection of label-wise misclassifications and reveals dependencies between comorbidities [78,79,80,81].

The combined use of these complementary metrics provides a comprehensive evaluation framework for assessing the models’ ability to detect multiple comorbidities simultaneously and supports a detailed interpretation of predictive performance.

5. Results

This section presents the quantitative evaluation of the proposed 1D-CNN architecture for MLC of comorbidities in patients with OSA. All experiments were conducted on a workstation equipped with an Intel® Core™ i5-1135G7 CPU (2.40 GHz) and 16 GB RAM. The complete training procedure for the final model required less than one hour, depending on the selected hyperparameters. At inference time, the model required only a few seconds per patient under the same hardware configuration.

The dataset was divided at the patient level into training (70%), validation (15%), and test (15%) subsets to prevent data leakage. Hyperparameter optimization was performed on the training and validation subsets, while the test subset was reserved exclusively for the final evaluation. The independent test set, fully unseen during model development, was used exclusively for the final evaluation. The model outputs three binary labels corresponding to hypertension, diabetes mellitus, and asthma/COPD.

Table 7 summarizes the raw MLCM, including class-wise precision and recall. The diagonal values represent correctly predicted labels, while non-diagonal entries represent misclassifications or clinically realistic co-occurrences.

Label $[eqn]$ demonstrates perfect recall (1.00), indicating that all true instances were successfully detected, despite its relatively small prevalence. Labels $[eqn]$ and $[eqn]$ exhibit moderate recall (0.50 and 0.70), reflecting typical overlaps between clinically related comorbidities. To investigate cross-label relationships, the precision and recall matrices were computed from the raw MLCM.

High diagonal values in both tables confirm that the 1D-CNN maintains stable precision and recall across all labels. Misclassifications primarily occur among physiologically or clinically related comorbidities, which is expected due to natural co-occurrence patterns.

To address the study objective of evaluating multi-label comorbidity identification performance, we report both label-wise metrics derived from the multi-label confusion matrix and global multi-label evaluation scores on the independent test set. The proposed 1D-CNN demonstrates stable and generalizable performance across all metrics. The high precision for $[eqn]$ and $[eqn]$ (0.89 and 0.64) indicates a low false positive rate (Table 8), while $[eqn]$ achieves extremely high recall (1.00), confirming the model’s ability to recognize all true cases of this comorbidity (Table 9). This behavior suggests that the network successfully differentiates subtle temporal features embedded in the $[eqn]$ and FP0 signals and effectively integrates structured clinical parameters.

Table 10 presents the full set of multi-label evaluation metrics. Subset accuracy (strictest measure) reached 0.286, indicating an exact match of all labels for 29% of samples. Flat and partial accuracy achieved markedly higher values (0.635), reflecting consistent partial correctness. F1-scores (macro, micro, and weighted) ranged between 0.53 and 0.55, demonstrating balanced performance across both frequent and rare labels. Macro AUC-ROC = 0.731 and AUC-PR = 0.750 indicate strong threshold-independent discriminative capability.

Moderate recall for $[eqn]$ and $[eqn]$ primarily stems from the natural overlap between comorbidities (e.g., hypertension and metabolic disorders), not from instability of the model. Importantly, such confusions are clinically plausible, as several of the considered conditions frequently co-occur in OSA patients and share overlapping physiological patterns. Precision and recall matrices confirm this by showing the highest off-diagonal confusion occurring between clinically correlated labels.

Global metrics further support model robustness. While subset accuracy is intentionally strict and penalizes any partially incorrect prediction, flat/partial accuracy better reflects practical screening utility by quantifying how many comorbidity labels are correctly identified per patient. The model achieves approximately two-thirds correctness at the label level (flat/partial accuracy = 0.635) and balanced F1-scores despite class imbalance. High macro AUC-ROC and AUC-PR values reflect excellent ranking performance even when labels overlap.

Overall, these results demonstrate that the CNN reliably extracts meaningful temporal patterns from physiological waveforms. Comorbidities exhibit consistent and interpretable prediction behavior, not random confusion. The architecture generalizes well to unseen patients, confirming robustness for real-world deployment.

To address the objective of assessing robustness across demographic subgroups, we performed a stratified evaluation by age, gender, and BMI on the independent test set. Furthermore, stratified analysis by age, gender, and BMI indicates that the model performs consistently across all demographic subsets.

Age groups (Figure 12): Most stable performance occurs in the 40–69 range groups with the largest representation and typical comorbidity prevalence. Slight deviations in the youngest and oldest ranges are attributable to small sample sizes. Gender (Figure 13): No significant differences. Minor fluctuations reflect natural prevalence differences rather than model bias. BMI categories (Figure 14): Highest stability occurs in overweight/obese class I ranges (BMI 25–34.9). Extremes (BMI > 40) show a moderate increase in errors due to physiological variability and fewer samples.

Overall, the observed fluctuations at the extremes of the distributions are primarily attributable to smaller subgroup sample sizes rather than systematic model bias.

These findings indicate that performance variability is driven by dataset distribution rather than any inherent bias, suggesting suitability for diverse clinical populations.

The proposed 1D-CNN successfully captures short-term oxygen desaturation dynamics, airflow interruption patterns, and patient-level clinical characteristics. This multimodal integration enables reliable multi-label prediction of comorbidities directly from PSG-derived signals. Results confirm that $[eqn]$ and FP0 alone carry strong discriminative potential for identifying hypertension, diabetes mellitus, and asthma/COPD, supporting the development of cost-efficient screening tools. The strong threshold-independent performance (AUC-ROC, AUC-PR), balanced F1 scores, and clinically plausible confusion patterns highlight the models’ potential for real-world deployment.

Analysis of the 1D-CNN Model and Comparison with Baseline Model Approach

Table 11 summarizes the MLC results of the proposed model and compares them with a baseline CNN model trained without additional clinical parameters. The baseline model used only the transformed time series signals as input, enabling a direct assessment of the contribution of the additional parameters to overall performance.

The proposed model achieves the best threshold-independent performance, with AUC-ROC (macro) = 0.731 and AUC-PR (macro) = 0.745. In contrast, the CNN baseline model, which uses only the transformed time series signals without additional clinical parameters, obtains substantially lower ranking performance (AUC-ROC = 0.459, AUC-PR = 0.599). This highlights the clear benefit of integrating additional clinical parameters alongside signal features in the proposed architecture, indicating that the proposed 1D-CNN ranks positive cases more reliably across labels and is better aligned with the underlying class imbalance. In threshold-dependent metrics, the CNN baseline also demonstrates inferior performance, with subset accuracy = 0.238 and Hamming loss = 0.413, while the proposed 1D-CNN achieves stronger performance, including Subset accuracy = 0.286 and Hamming loss = 0.365. Overall, the comparison confirms that predictive improvements stem not only from convolutional modeling but also from combining temporal features with clinically informative parameters.

6. Discussion

The overall objective of this study was to develop and evaluate a multimodal DL framework for multi-label identification of major comorbidities in patients with OSA by integrating $[eqn]$ , FP0 airflow, and structured clinical parameters. The findings of this study demonstrate that the proposed multi-branch 1D-CNN model can effectively extract clinically meaningful temporal patterns from $[eqn]$ and $[eqn]$ signals and integrate them with structured clinical variables to identify key comorbidities associated with OSA. The model achieved balanced multi-label performance across hypertension, diabetes mellitus, and asthma/COPD, despite notable class imbalance within the dataset. This indicates that weighted loss functions and multimodal feature fusion successfully mitigated the dominance of majority labels and encouraged the network to learn discriminative representations even for less frequent comorbidities.

Recent literature increasingly supports the use of DL models for automated analysis of physiological signals in OSA screening, particularly when leveraging multimodal signal fusion to improve robustness and generalizability [82,83,84]. The raw MLCM (Table 7) shows that hypertension and asthma/COPD achieved moderate recall (0.50 and 0.70), while diabetes mellitus reached perfect recall (1.00). This result is particularly noteworthy given the relatively low prevalence of diabetes within the dataset, suggesting that the model learned subtle temporal clinical signatures specific to the metabolic profile of diabetic patients. Precision remained high across all labels (0.89–0.64), indicating low false positive rates and confirming that the classifier avoids overpredicting comorbidities, which is crucial for clinical usability.

Analysis of the precision and recall matrices (Table 8 and Table 9) further highlights that misclassifications predominantly occur between clinically related comorbidities, most notably hypertension and diabetes mellitus. This is consistent with well-known physiological and metabolic interactions in OSA patients, where sympathetic activation, intermittent hypoxia, and obesity contribute to overlapping risk profiles [85,86,87]. In addition, the bidirectional association between OSA and cardiometabolic dysfunction has been repeatedly highlighted in recent reviews and meta-analyses, supporting the clinical plausibility of these label overlaps [88]. These errors therefore likely reflect meaningful comorbidity co-occurrence rather than model instability.

Evaluation metrics (Table 10) reinforce these observations. Flat accuracy and partial accuracy of 0.635 show that approximately two-thirds of labels per patient were correctly predicted, while macro/micro/weighted F1-scores (0.533–0.551) indicate consistent performance across both frequent and rare labels. Threshold-independent metrics revealed strong discriminative capability (macro AUC-ROC = 0.731; AUC-PR = 0.750), confirming that the learned representations generalize well to unseen patient data. Combined, these results support the robustness of the proposed multimodal architecture. This aligns with recent methodological recommendations emphasizing threshold-independent evaluation (e.g., AUC-PR) for imbalanced clinical prediction tasks, where fixed threshold metrics may underestimate ranking performance [89,90].

Subgroup analyses, Figure 12, Figure 13 and Figure 14, provide additional insights into demographic generalization. The model maintains stable performance across gender, with only minor variations reflecting natural prevalence differences rather than systematic bias. Age-based performance shows the highest stability in the 40–69 cohort, consistent with the highest sample density, while extremes of age exhibit greater variance due to limited representation. Similarly, accuracy remains highest within BMI ranges 25–35, aligning with typical OSA and comorbidity prevalence, whereas reduced stability in BMI ≥ 40 groups reflects physiological heterogeneity and smaller sample sizes. These findings suggest that performance variability is driven primarily by data distribution rather than architectural limitations.

Importantly, the study demonstrates that rich comorbidity-related physiological information is encoded within only two PSG-derived signals, $[eqn]$ and $[eqn]$ . The ability of the 1D-CNN to detect comorbidities without relying on full PSG channels (EEG, EMG, ECG) underscores the potential for simplified and more accessible diagnostic workflows. This aligns with the growing need for scalable, low-cost screening tools in clinical and home-based environments [90]. A comparable trend is observed in the broader OSA literature, where multimodal but reduced sensor approaches (e.g., oxygen saturation combined with other accessible signals) can achieve competitive performance while improving practicality and scalability [82].

While the majority of prior work focuses on OSA detection and severity estimation, studies addressing comorbidity level prediction remain limited, which makes direct comparison challenging and highlights the clinical novelty of the presented multi-label framework [84]. Additionally, some studies report that simpler or traditional ML approaches can appear competitive under fixed threshold metrics, particularly when probability calibration or per-label threshold tuning is not applied. This may partly explain why differences between models are sometimes smaller for subset/flat accuracy than for AUC-based metrics, despite clear improvements in ranking performance.

Overall, the results show that 1D-CNN-based multimodal learning offers a promising direction for early identification of OSA-related comorbidities. The model captures both short-term respiratory dynamics and long-term clinical characteristics, achieving clinically interpretable and stable performance. These findings support the future integration of such models into decision support systems and telemedicine platforms.

7. Limitations and Future Work

Despite the encouraging results, several limitations should be acknowledged. First, the dataset size (144 patients) is modest and originates from a single clinical center, which restricts the models’ exposure to broader population variability and may limit generalizability. Second, only two physiological channels ( $[eqn]$ and FP0 nasal airflow) were used in this study. Although these signals are highly informative for capturing respiratory disturbances in OSA and contributed to strong predictive performance, the absence of additional PSG channels (e.g., EEG, ECG, EMG, and thoracoabdominal effort belts) reduces the physiological context available to the model. Future studies could incorporate a richer set of PSG modalities, supported by expert validation, to provide complementary information and potentially improve comorbidity classification. Future work should include external validation on datasets from other institutions and multi-center experiments to confirm generalizability across different clinical settings and patient populations.

Class imbalance remains another important challenge. Hypertension was substantially more prevalent than diabetes mellitus and asthma/COPD, while multi-label comorbidity combinations were rare, reflecting the underlying clinical distribution rather than a sampling artifact. Although weighted loss functions improved learning stability, performance variability across labels persisted, indicating the need for additional strategies such as data augmentation, synthetic minority oversampling (e.g., GAN-based generation), and targeted rebalancing techniques in future work.

Finally, model interpretability was not explicitly addressed. Although convolutional architectures can provide more structured feature extraction than fully connected models, the present study did not incorporate explainable AI (XAI) methods such as Grad-CAM, SHAP, or LIME. Integrating XAI techniques in future research could improve clinical transparency and trust by highlighting the signal segments and clinical variables that most strongly influence comorbidity predictions.

8. Conclusions

This study demonstrates that a multimodal 1D-CNN can integrate $[eqn]$ and FP0 airflow signals with structured clinical variables to identify multiple OSA-related comorbidities within a unified multi-label framework. The findings show that the proposed fusion-based approach can reliably detect hypertension, diabetes mellitus, and asthma/COPD using a reduced set of PSG-derived inputs, supporting the feasibility of comorbidity screening without relying on full polysomnography.

The subgroup analyses indicate stable performance across age, BMI, and gender strata, suggesting that the learned representations capture clinically meaningful physiological patterns rather than systematic demographic bias. These results highlight the potential of simplified multimodal architectures to support scalable risk assessment in both clinical and home monitoring scenarios.

Despite limitations related to the modest single-center cohort and residual label imbalance, the proposed framework provides a methodological basis for further development toward clinical translation. Future work should focus on external multi-center validation, expansion of datasets, incorporation of additional PSG modalities, and the integration of explainable AI techniques to improve transparency and clinical trust. Overall, the study supports the use of deep-learning-based analysis of simplified PSG signals as a promising direction for automated decision support tools aimed at early identification of comorbidity profiles in OSA patients.

Bibliography90

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Benjafield A.V. Ayas N.T. Eastwood P.R. Heinzer R. Ip M.S.M. Morrell M.J. Nunez C.M. Patel S.R. Penzel T. Pépin J. Estimation of the global prevalence and burden of obstructive sleep apnoea: A literature-based analysis Lancet Respir. Med.2019768769810.1016/S 2213-2600(19)30198-531300334 PMC 7007763 · doi ↗ · pubmed ↗
2Iannella G. Pace A. Bellizzi M.G. Magliulo G. Greco A. De Virgilio A. Croce E. Gioacchini F.M. Re M. Costantino A. The Global Burden of Obstructive Sleep Apnea Diagnostics 202515108810.3390/diagnostics 1509108840361906 PMC 12071658 · doi ↗ · pubmed ↗
3Abrishami A. Khajehdehi A. Chung F. A systematic review of screening questionnaires for obstructive sleep apnea Can. J. Anesth.20105742310.1007/s 12630-010-9280-x 20143278 · doi ↗ · pubmed ↗
4Deviaene M. Varon C. Testelmans D. Buyse B. Van Huffel S. Assessing cardiovascular comorbidities in sleep apnea patients using Sp O 2Proceedings of the 2017 Computing in Cardiology (Cin C)IEEE New York, NY, USA 20171410.22489/Cin C.2017.232-224 · doi ↗
5Chadia K. Archontogeorgis K. Drakopanagiotakis F. Bonelis K. Anevlavis S. Steiropoulos P. Clinical and Sleep Characteristics and the Effect of CPAP Treatment on Obese Patients with Obstructive Sleep Apnea and Asthma—A Retrospective Study Healthcare 202513224010.3390/healthcare 1317224040941591 PMC 12427636 · doi ↗ · pubmed ↗
6Gentile S. Monda V.M. Guarino G. Satta E. Chiarello M. Caccavale G. Mattera E. Marfella R. Strollo F. Obstructive Sleep Apnea and Type 2 Diabetes: An Update J. Clin. Med.202514557410.3390/jcm 1415557440807193 PMC 12347911 · doi ↗ · pubmed ↗
7Tondo P. Hoxhallari A. Lacedonia D. Magaletti P. Sabato R. Foschino Barbaro M.P. Scioscia G. The CORE syndrome: An overlap of severe asthma, obstructive sleep apnea, rhinosinusitis, and esophageal reflux Sleep Breath.2024281761176510.1007/s 11325-024-03028-x 38627338 · doi ↗ · pubmed ↗
8Kainulainen S. Töyräs J. Oksenberg A. Korkalainen H. Sefa S. Kulkas A. Leppänen T. Severity of desaturations reflects OSA-related daytime sleepiness better than AHIJ. Clin. Sleep Med.2019151135114210.5664/jcsm.780631482835 PMC 6707054 · doi ↗ · pubmed ↗