Development and validation of a bedside-available machine learning model to predict discrepancies between SaO₂ and SpO₂: Exploring factors related to the discrepancies
Raito Sato, Naoki Ito, Sakina Kadomatsu, Norikazu Hanioka, Mikio Nakajima, Tadahiro Goto, Mohsen Mehrabi, Mohsen Mehrabi, Mohsen Mehrabi, Mohsen Mehrabi

TL;DR
This study developed a machine learning model to predict when pulse oximeter readings may be inaccurate in critically ill patients, helping identify hidden hypoxemia.
Contribution
A bedside-available machine learning model was developed and validated to predict discrepancies between SpO₂ and SaO₂ using non-invasive data.
Findings
The XGBoost model achieved an AUROC of 0.73 in the development dataset and 0.70 after validation.
Worse vital signs, such as low blood pressure and temperature, were key factors associated with the discrepancy.
The model was deployed as a web-based application for clinical accessibility.
Abstract
In critically ill patients, a discrepancy frequently exists between percutaneous oxygen saturation (SpO₂) and arterial blood oxygen saturation (SaO₂), which can lead to potential hypoxemia being overlooked. The aim of this study was to explore the factors related to the discrepancy and to develop an easy-to-use prediction model that uses readily available bedside information to predict the discrepancy and suggest the need for arterial blood gas measurement. This is a prognostic study that used eICU Collaborative Research Database from 2014 to 2015 for model development and MIMIC-IV data from 2008 to 2019 for model validation. To predict the outcome of SpO₂ exceeding SaO₂ by 3% or more, non-invasive, readily available bedside information (patient demographics, vital signs, vasopressor use, ventilator use) was used to develop prediction models with three machine learning methods (decision…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRespiratory Support and Mechanisms · Hemodynamic Monitoring and Therapy · Sepsis Diagnosis and Treatment
Introduction
The pulse oximeter is essential in clinical settings for non-invasively monitoring percutaneous oxygen saturation (SpO₂), instead of a reliable estimate of arterial oxygen saturation (SaO₂) typically measured by arterial blood gas (ABG) analysis [1,2]. However, it is well-known that pulse oximeters sometimes overestimate arterial oxygen saturation [2,3]. The accuracy of SpO₂ varies due to patient demographics (e.g., race/ethnicity) and conditions (e.g., COVID-19, high HbA1c) [4–8]. The overestimation of SpO₂ by pulse oximeters can potentially delay the detection of hypoxemia, especially critical in severe cases, where it may contribute to increased tissue dysfunction and in-hospital mortality [9,10]. Thus, accurately predicting the discrepancy between SpO₂ and SaO₂ is crucial in the real clinical setting.
Several studies have reported machine learning models that predict partial pressure of oxygen (PaO₂) from SpO₂ in the patients in the intensive care unit (ICU) [11], but it is limited to them on ventilators. Given that potential hypoxemia can occur in patients not on artificial respiration, it is critically important to identify inaccuracies of pulse oximeters across various clinical settings. In addition, factors associated with the discrepancies were not well-determined.
This study aimed to develop and validate a model to predict the discrepancies between SpO₂ and SaO₂ using simple, non-invasive bedside information. This model is designed to predict pulse oximeter overestimations, inform the need for ABG analysis, and aid in preventing missed diagnoses of potential hypoxemia. The contribution of each variable to the discrepancy was further investigated.
Materials and methods
Study design and settings
This is a prognostic study that uses two publicly available patient-level ICU databases: the eICU Collaborative Research Database (eICU) version 2.0 and the Medical Information Mart for Intensive Care IV (MIMIC-IV) dataset version 2.2. The eICU comprises data from over 13,000 ICU patients were admitted to one of 335 units at 208 hospitals located throughout the United States, spanning from 2014 to 2015 [12]. MIMIC-IV, an extensive database from the Beth Israel Deaconess Medical Center in the United States, includes de-identified health-related data of over 60,000 patients, admitted from 2008 to 2019 [13]. The eICU database was used for model development, and the MIMIC-IV database for both model fine-tuning and validation. The eICU database was accessed on March 20, 2024, and the MIMIC-IV database was accessed on September 1, 2023. Ethical approval and informed consent were deemed unnecessary for this study as eICU and MIMIC-IV data are de-identified in accordance with HIPAA’s Safe Harbor provision, and data access was restricted to credentialed authors who complied with the specified data use agreement. Consequently, the TXP Medical Ethical Review Board waived the requirement for ethical approval and informed consent (TXPREC-008). This study adhered to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guidelines for prognostic studies [14].
Study samples
This study included adult patients (aged ≥18 years), targeting individuals identified as Black, White, Hispanic, or Asian. Measurements required SpO₂ and SaO₂ levels to be recorded within 10 minutes of each other, with SpO₂ ranging from 80% to 100% and SaO₂ from 50% to 100%. The difference between SpO₂ and SaO₂ in this study was required to be no more than 20%. The following criteria were established for inclusion: mean blood pressure (MBP) should range from 50 to 180 mm Hg, respiratory rate (RR) should range from 5 to 50 breaths per minute, heart rate (HR) should range from 30 to 140 beats per minute, body mass index (BMI) should range from 15 to 50 kg/m^2^, and body temperature should range from 32 to 42°C. Additionally, MBP, RR, and HR measurements were needed within 10 minutes of the SaO₂ reading, and body temperature should be recorded within 3 hours of the SaO₂ measurement. Any records with missing data were excluded from this analysis.
Predictors
The predictors for machine learning models were chosen from routinely available data at ICU. Specifically, the predictors included patient age, sex, BMI, race, vital signs (temperature, HR, MBP, RR, and SpO₂), the use of invasive ventilation (yes/no), and the use of vasopressor (yes/no).
Outcomes
The outcome of the study was SpO₂ being 3% higher than SaO₂
Statistical analysis
In the training set (80% random sample), the risk of algorithmic bias was reduced by applying and comparing three machine learning methods: (1) Decision Tree, (2) Logistic Regression, and (3) eXtreme Gradient Boosting (XGBoost). Model performance was assessed in the training set, with the area under the receiver operating characteristic curve (AUROC) as the optimization metric, and 95% confidence intervals (CI) for AUROC were calculated. Hyperparameter tuning for each model was performed using a grid search combined with 10-fold cross-validation. The specific parameter ranges and selected values for XGBoost are provided in S2 Table. Final model evaluation was conducted on the remaining 20% test set, using multiple diagnostic metrics including AUROC, sensitivity, specificity, positive likelihood ratio (Positive LR), negative likelihood ratio (Negative LR), and Diagnostic Odds Ratio. The model with the highest overall diagnostic value was selected for further fine-tuning and validation, and showed its calibration plot and calibration slope. The calibration plot visually illustrates the agreement between predicted probabilities and actual outcomes, while the calibration slope quantifies the degree of calibration. These visual and numerical assessments provide insights into the model’s calibration performance. As for exploratory analysis to investigate whether performance could be improved in the new independent dataset, this study conducted an exploratory hyper-parameter optimization using admissions from 2008–2013 and then evaluated the updated model on the held-out 2014–2019 data. In this study, two distinct techniques for model interpretation were employed: SHapley Additive exPlanations (SHAP) analysis, and Partial Dependency Plots (PDPs). Both methods demonstrate the impact of different features on the model’s predictive output. These analyses, visualizations, and the model publication were conducted using R (version 4.2.2), Python (version 3.10.12), and the Python library Streamlit (version 1.26.0).
Results
During 2014–2015, the eICU database recorded 70,304 data points containing both SpO₂ and SaO₂ measurements. Data points were excluded for the following reasons: 44,330 had significant time gaps over 10 minutes between SpO₂ and SaO₂ measurements, 4,789 showed outliers, and 1,381 were missing data. Finally, 4,781 admissions and 19,804 data points were identified as suitable for analysis. There were 5,231 data points where SpO₂ exceeded SaO₂ by 3% (Tables 1 and 2). In the validation cohort using the MIMIC-IV database (2008–2019), there were 4,267 admissions and 9,339 data points. Of these, 5,953 data points were from 2008 to 2013, and 3,386 data points were from 2014 to 2019. Study flows are shown in Fig 1.
Table 1: Predictor variables and outcome in 19,804 data from eICU.
Table 2: Patient characteristics stratified by SpO₂–SaO₂ discrepancy ≥3%.
Flow diagram of patients eligible for analysis in the development and validation cohorts.(A) In the eICU dataset (2014–2015), 70,304 data points included both SpO₂ and SaO₂ measurements. Among these, 19,804 data points also contained all other predictors (i.e., patient age, sex, BMI, race, temperature, HR, MBP, RR, the use of invasive ventilation, and the use of vasopressor). (B) In the MIMIC-IV dataset (2008–2019), 81,797 data points included both SpO₂ and SaO₂ measurements. Among these, 9,350 data points also contained all other predictors. SpO₂ = Percutaneous Oxygen Saturation, SaO₂ = Arterial Oxygen Saturation, eICU = eICU Collaborative Research Database, MIMIC-IV = Medical Information Mart for Intensive Care IV, BMI = Body mass index, HR = Heart rate, MBP = Mean blood pressure, RR = Respiratory rate.
Comparison of three training models
Among the three models compared, the XGBoost model demonstrated the highest predictive performance on the test set (20% of dataset) with AUC values of 0.73 (95% CI: 0.71–0.74) for XGBoost, 0.60 (95% CI: 0.58–0.62) for Decision Tree model, and 0.57 (95% CI: 0.55–0.59) for Logistic Regression (S1 Fig). After comparing other evaluation scores (S1 Table), the XGBoost model was selected for prediction. Additionally, a calibration plot was constructed to further assess the model’s performance. The calibration slope was 0.90 (S2 Fig).
SHAP and partial dependence plot
The SHAP summary plot revealed the top 14 features identified by the XGBoost model, ranked by their average SHAP values. These values indicate the positive or negative impact of each feature (Fig 2). To further elucidate the associations, PDPs were created, focusing on the relationship between six key variables (BMI, Age, HR, Temperature, MBP, RR) and the discrepancy in pulse oximetry measurements (Fig 2). Among these variables, Age, HR and RR showed increasing trends in association with the discrepancy, whereas Temperature and MBP displayed inverse trends.
SHAP value and PDPs in the development cohort.(A) SHAP value plot showing the impact of each predictor on the model’s output in the development cohort using the eICU dataset. Each point represents a single data point, with the color indicating the feature’s value (red for high values and blue for low values). The position on the x-axis shows the effect of the feature on the prediction, with values to the right indicating a higher predicted outcome. (B) Partial Dependence Plots (PDPs) for key predictors in the development cohort using the eICU dataset. Each plot displays the relationship between a specific predictor and the predicted outcome, while averaging out the effects of all other predictors in the model. The x-axis represents the range of the predictor’s values, and the y-axis shows the average predicted outcome, providing insight into how changes in the predictor’s value influence the model’s predictions. SHAP = SHapley Additive exPlanations, eICU = eICU Collaborative Research Database, MIMIC-IV = Medical Information Mart for Intensive Care IV.
Fine-tuning and temporal validation
The original XGBoost model was first validated on the MIMIC-IV dataset, which yielded an AUROC of 0.56. After fine-tuning the XGBoost model on the earlier years of MIMIC-IV (2008–2013), and validating it on the remaining years of MIMIC-IV (2014–2019), the AUROC was 0.70 (95% CI: 0.68–0.72), and the calibration slope is 0.85 (S3 Fig). SHAP values in the validation cohort showed a similar pattern to those observed in the training cohort (Fig 3). Additionally, PDPs, shown in the same figure, exhibited similar trends to those in the internal validation, indicating that vital signs have an influence on prediction.
SHAP value and PDPs in the validation cohort.(A) SHAP value plot for the validation cohort using the MIMIC-IV dataset. The model developed with the eICU data (development cohort) was applied to the MIMIC-IV (2014-2019) to assess its generalizability. Each point represents a single data point, with the color indicating the feature’s value. The position on the x-axis shows the effect of the feature on the prediction, with values to the right indicating a higher predicted outcome. (B) Partial Dependence Plots (PDPs) for key predictors in the validation cohort using the MIMIC-IV (2014-2019) dataset. The PDPs are aligned with those used in the development cohort (eICU) to ensure consistency in comparison. Each plot displays the relationship between a specific predictor and the predicted outcome, while averaging out the effects of all other predictors in the model. The x-axis represents the range of the predictor’s values, and the y-axis shows the average predicted outcome, providing insight into how changes in the predictor’s value influence the model’s predictions. SHAP = SHapley Additive exPlanations, eICU = eICU Collaborative Research Database, MIMIC-IV = Medical Information Mart for Intensive Care IV.
Discussion
In this study, a machine learning model was developed using 19,804 data points from the eICU database to accurately predict overestimations made by pulse oximeters. In both datasets, worse vital signs were associated with the SpO₂–SaO₂ discrepancy. For example, low body temperature, low MBP, and high RR were associated with the SpO₂–SaO₂ discrepancy, indicating that, in the ICU, there is a need to pay attention to the discrepancy, particularly in critically ill patients.
Compared with a previously reported model that achieved an AUROC of about 0.83 for estimating PaO₂ from SpO₂ in ventilated patients [10], this model shows lower discrimination. Its exclusive reliance on non-invasive inputs may partly explain the reduced precision in estimating SaO₂. Even so, an exploratory temporal-validation analysis produced an AUROC of 0.70, suggesting reasonably stable performance in unseen patients and offering a clinically relevant gauge of generalizability. Further gains may be possible by expanding the training dataset and fine-tuning the timing of the paired measurements.
Interestingly, the SHAP analysis showed that the top-ranked predictors were consistent in the training (eICU) and validation (MIMIC) datasets, indicating that these variables retain their importance across settings and may support wider generalizability. Partial-dependence plots for heart rate, mean arterial pressure, and respiratory rate displayed similar risk gradients in both cohorts, illustrating how changes in vital signs affect the predicted risk and reducing the model’s black-box perception for clinicians. The agreement between SHAP values and PDP trends suggests that these routinely recorded clinical features are key drivers of the predictions.
These findings about the correlation between the worsening vital signs (e.g., low temperature and low blood pressure) leading to low perfusion and the decrease in pulse oximeter accuracy are consistent with previous studies. For example, when patients have low perfusion, the accuracy of the pulse oximeter decreases [15,16]. Additionally, the accuracy of the pulse oximeter is influenced by the patient’s body temperature, often leading to overestimated readings when the body temperature is low [17].
A key strength of this study lies in offering clear insights into how various factors affect discrepancies between SpO₂ and SaO₂, simplifying the understanding of these complex interactions. These factors may influence on overlooking patients with hypoxemia, which is associated with poor prognosis. This study emphasizes the necessity of vigilant monitoring for hidden hypoxemia in critically ill patients. It also contributes to the ongoing development of accurate, non-invasive methods for assessing oxygenation status, marking a significant step forward in improving patient care in critical care environments.
This model demonstrated stable predictive performance in a temporally split validation using the MIMIC-IV dataset, suggesting its potential for integration into clinical workflows as a decision-support tool. In particular, it may help identify patients at risk of hypoxemia, especially those with low perfusion or other critical vital signs. The model outputs can be used to guide the prioritization of patient monitoring in ICU settings, while still relying on clinical judgment and confirmatory tests such as blood gas analysis. Clinicians can use the model’s predictions to support early recognition and intervention for patients who may require closer observation.
Model implementation
A lightweight, publicly accessible web application (https://spo2-to-sao2.streamlit.app/) implementing this prediction model has been released. Users manually enter readily available information—vital signs and basic patient characteristics—and the app instantly returns the estimated probability of an SpO₂–SaO₂ discrepancy (≥ 3 percentage points). Although the tool is not yet embedded in hospital information systems and therefore requires hand entry, it still allows clinicians to gauge risk at the bedside and may serve as an engaging proof-of-concept for future, fully integrated deployments.
Potential limitation
This study has several limitations. First, SaO₂-predictor pairs combine measurements that are not always collected simultaneously. For a more accurate analysis, the data were restricted to vital signs and ABG tests that were recorded within 10 minutes. Second, the database used for training in this study is solely from the United States, which may limit the generalizability of these results. However, to address this concern and enhance the robustness of these findings, external validation was conducted. Moreover, another limitation of this study is the potential bias arising from using multiple data points from the same patients. This aspect is critical in interpreting the data and suggests the need for future research to consider individual patient variability more thoroughly.
Conclusion
By using non-invasive, readily available bedside information, a machine learning model was developed to predict when SpO₂ exceeds SaO₂ by 3% or more, while the prediction ability was suboptimal in a different dataset. Vital signs (e.g., temperature and heart rate) were identified as factors associated with these discrepancies. These findings underscore the need for awareness of hidden hypoxemia and provide a basis of further studies to identify hidden-hypoxia in critically ill patients.
Supporting information
S1 FigComparison of three machine models ROC curves and AUC values.This figure shows the ROC curves and AUC values for three machine learning models: (1) Decision Tree, (2) Logistic Regression, and (3) XGBoost. The ROC curves illustrate each model’s performance by plotting the true positive rate against the false positive rate. The AUC values, displayed on each curve, indicate the overall performance of the models, with higher values representing better discriminatory ability. ROC = Receiver operating characteristic, AUC = Area under the curve, XGBoost = eXtreme Gradient Boosting.(TIF)
S1 TableThree models’ scores.AUROC = Area under the receiver operating characteristic curve, XGBoost = eXtreme Gradient Boosting, Positive LR = Positive Likelihood Ratio, Negative LR = Negative Likelihood Ratio, Diagnostic OR = Diagnostic Odds Ratio.(TIF)
S2 TableHyperparameter search space for the XGBoost model.XGBoost = eXtreme Gradient Boosting.(TIF)
S2 FigCalibration plot and calibration slope in the development cohort.This figure presents the calibration plot and calibration slope for the development cohort using the eICU data set. The calibration plot compares the predicted probabilities of the model with the actual outcomes, illustrating how well the model’s predictions match the observed results. The ideal line represents perfect calibration, where predicted probabilities exactly match the observed frequencies. The calibration slope indicates the agreement between predicted probabilities and actual outcomes. A slope of 1 suggests perfect calibration, while deviations from 1 indicate under- or over-estimation of the predicted risks. eICU = eICU Collaborative Research Database.(TIF)
S3 FigROC curve and calibration plot in the validation cohort.This figure presents the ROC curve and calibration plot for the validation cohort using the MIMIC-IV dataset (2014–2019). The model was trained using data from the eICU database and validated using the MIMIC-IV dataset (2014–2019). ROC = Receiver Operating Characteristic, MIMIC-IV = Medical Information Mart for Intensive Care IV, eICU = eICU Collaborative Research Database.(TIF)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Miyasaka K, Shelley K, Takahashi S, Kubota H, Ito K, Yoshiya I, et al. Tribute to Dr. Takuo Aoyagi, inventor of pulse oximetry. J Anesth. 2021;35(5):671–709. doi: 10.1007/s 00540-021-02967-z 34338865 PMC 8327306 · doi ↗ · pubmed ↗
- 2Al-Halawani R, Charlton PH, Qassem M, Kyriacou PA. A review of the effect of skin pigmentation on pulse oximeter accuracy. Physiol Meas. 2023;44(5):05TR 01. doi: 10.1088/1361-6579/acd 51a 37172609 PMC 10391744 · doi ↗ · pubmed ↗
- 3Milner QJW, Mathews GR. An assessment of the accuracy of pulse oximeters. Anaesthesia. 2012;67(4):396–401. doi: 10.1111/j.1365-2044.2011.07021.x 22324874 · doi ↗ · pubmed ↗
- 4Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial bias in pulse oximetry measurement. N Engl J Med. 2020;383(25):2477–8. doi: 10.1056/NEJ Mc 2029240 33326721 PMC 7808260 · doi ↗ · pubmed ↗
- 5Okunlola OE, Lipnick MS, Batchelder PB, Bernstein M, Feiner JR, Bickler PE. Pulse oximeter performance, racial inequity, and the work ahead. Respir Care. 2022;67(2):252–7. doi: 10.4187/respcare.09795 34772785 · doi ↗ · pubmed ↗
- 6Fawzy A, Wu TD, Wang K, Sands KE, Fisher AM, Arnold Egloff SA, et al. Clinical outcomes associated with overestimation of oxygen saturation by pulse oximetry in patients hospitalized with COVID-19. JAMA Netw Open. 2023;6(8):e 2330856. doi: 10.1001/jamanetworkopen.2023.30856 37615985 PMC 10450566 · doi ↗ · pubmed ↗
- 7Rose N, Sriram RB, Sudheer R. An observational study of simultaneous pulse oximetry and arterial oxygen saturation readings in intensive care unit/high dependency unit in COVID-19 patients. Asian J Med Sci. 2022;13(3):18–22. doi: 10.3126/ajms.v 13i 3.41218 · doi ↗
- 8Pu LJ, Shen Y, Lu L, Zhang RY, Zhang Q, Shen WF. Increased blood glycohemoglobin A 1c levels lead to overestimation of arterial oxygen saturation by pulse oximetry in patients with type 2 diabetes. Cardiovasc Diabetol. 2012;11:110. doi: 10.1186/1475-2840-11-110 22985301 PMC 3489581 · doi ↗ · pubmed ↗
