Assessing perioperative risks in a mixed elderly surgical population using machine learning: A multi-objective symbolic regression approach to cardiorespiratory fitness derived from cardiopulmonary exercise testing

Pietro Arina; Davide Ferrari; Maciej R. Kaczorek; Nicholas Tetlow; Amy Dewar; Robert Stephens; Daniel Martin; Ramani Moonesinghe; Mervyn Singer; John Whittle; Evangelos B. Mazomenos; Dhiya Al-Jumeily OBE; Dhiya Al-Jumeily OBE

PMC · DOI:10.1371/journal.pdig.0000851·May 16, 2025

Assessing perioperative risks in a mixed elderly surgical population using machine learning: A multi-objective symbolic regression approach to cardiorespiratory fitness derived from cardiopulmonary exercise testing

Pietro Arina, Davide Ferrari, Maciej R. Kaczorek, Nicholas Tetlow, Amy Dewar, Robert Stephens, Daniel Martin, Ramani Moonesinghe, Mervyn Singer, John Whittle, Evangelos B. Mazomenos, Dhiya Al-Jumeily OBE, Dhiya Al-Jumeily OBE

PDF

Open Access

TL;DR

This study uses machine learning to better predict post-surgery complications in elderly patients by incorporating cardiorespiratory fitness data, with a new method called MOSR showing the best performance.

Contribution

The novel Multi-Objective Symbolic Regression (MOSR) model improves postoperative morbidity prediction by integrating cardiorespiratory fitness data and demonstrating superior accuracy and interpretability.

Findings

01

Models using cardiorespiratory fitness data improved postoperative morbidity prediction by 20% compared to clinical data alone.

02

MOSR outperformed existing risk scores and other ML models in predicting postoperative complications.

03

CPET time-series data models showed a 12% improvement over cardiorespiratory fitness models.

Abstract

Accurate preoperative risk assessment is of great value to both patients and clinical teams. Several risk scores have been developed but are often not calibrated to the local institution, limited in terms of data input into the underlying models, and/or lack individual precision. Machine Learning (ML) models have the potential to address limitations in existing scoring systems. A database of 1190 elderly patients who underwent major elective surgery was analyzed retrospectively. Preoperative cardiorespiratory fitness data from cardiopulmonary exercise testing (CPET), demographic and clinical data were extracted and integrated into advanced machine learning (ML) algorithms. Multi-Objective-Symbolic-Regression (MOSR), a novel algorithm utilizing Genetic Programming to generate mathematical formulae for learning tasks, was employed to predict patient morbidity at Postoperative Day 3, as…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

SLTM SHROOM4

Proteins2

Species1

Homo sapiens(human · species)

Chemicals5

PDIG-D-24-00447 CO2 Na O2 K

Diseases12

TS ML CRF POMS postoperative complications trauma MOSR CPET AT morbidities stroke PPOSSUM

Figures3

Click any figure to enlarge with its caption.

Fig 1 — A - Cardiopulmonary exercise testing (CPET).On the left, the different phases of CPET are detailed, including Rest, Ventilatory Anaerobic Threshold (AT), Peak, and Recovery, describing the exercise intensity levels and physiological responses. The central graph displays the progression of exercise intensity, marked by the green ramp, alongside traces of oxygen uptake (V̇O2) and carbon dioxide production (V̇CO2). The right section elucidates primary features derived from CPET such as V̇O2, V̇CO2, heart rate (HR), oxygen pulse (V̇O2/HR), ventilation (VE), and the respiratory exchange ratio (RER). The image portrays a patient undergoing CPET, equipped with the appropriate testing apparatus. The graph is presented with the consent of Quick O, Reed-Poysden C, from their work “Cardiopulmonary Exercise Test: Interpretation and Application in Perioperative Medicine” 2022, https://doi.org/10.28923/atotw.473. B - Bar chart illustrating the average number of minor and major components contributing to each Postoperative Morbidity Survey (POMS) score on day 3. The bars represent the mean count of individual minor (blue) and major (orange) morbidity factors for patients with POMS scores ranging from 0 to 6. C - Scatter plot with linear regression analysis illustrating the relationship between patient age and Postoperative Morbidity Survey (POMS) scores on day 3. Each datapoint represents an individual patient’s age against their corresponding POMS score. D - Scatter plot with linear regression analysing the relationship between V̇O2/Kg at the Anaerobic Threshold (AT) and Postoperative Morbidity Survey (POMS) scores on day 3. Each datapoint signifies the V̇O2/Kg AT value for an individual patient and their respective POMS score.

Fig 2 — A - Receiver operating characteristic (ROC) curve comparison for the Multi-objective Symbolic Regression (MOSR) model’s performance in predicting Postoperative Morbidity Survey (POMS) scores at day 3.The ROC curves of three models are presented: MOSR using Cardiorespiratory Fitness (CRF) dataset (blue), MOSR using Clinical dataset (orange), and MOSR utilizing a combination of both CRF and Clinical dataset (green). B - SHAP Beeswarm Plot of MOSR models using Cardiorespiratory Fitness (CRF) and Clinical database. Each row represents a feature used in the model. Each dot on a row corresponds to a datapoint, with its position on the x-axis indicating the SHAP value, or the contribution of that feature to the model’s prediction for that datapoint. The colour of the dots represents the value of that feature, with blue indicating low values and red indicating high values. V̇O2/Kg AT: Oxygen Consumption per Kilogram at Anaerobic Threshold, V̇O2/Kg VOP: Oxygen Consumption per Kilogram at Peak, VE/V̇CO2 AT: Ventilatory Equivalent for Carbon Dioxide at Anaerobic Threshold, BMI: Body Mass Index, V̇O2/Kg Rest = Oxygen Consumption per Kilogram at Rest, MET = Metabolic Equivalent of Task, V̇O2 AT = Oxygen Consumption at Anaerobic Threshold, HR AT = Heart Rate at Anaerobic Threshold, VT VOP = Tidal Volume at Peak, VE VOP = Ventilation at Peak, V̇O2/HR AT = Oxygen Consumption per Heart Rate at Anaerobic Threshold, V̇CO2 AT = Carbon Dioxide Production at Anaerobic Threshold, PetCO2 VOP = Partial End-tidal Carbon Dioxide at Peak, WR AT = Work Rate at Anaerobic Threshold, VE AT = Ventilation at Anaerobic Threshold, WR VOP = Work Rate at Peak, PetCO2 Rest = Partial End-tidal Carbon Dioxide at Rest, RR VOP = Respiratory Rate at Peak. C - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR), utilizing either CRT data alone or with Clinical data, against established clinical assessment scores in predicting Postoperative Morbidity Survey (POMS) outcomes on day 3. MOSR CRF dataset model (blue), Cardiopulmonary Exercise Testing (CPET) (red), American Society of Anaesthesiologists Score (ASA) (yellow), Physiological and Operative Severity Score for the enumeration of Mortality and morbidity (PPOSSUM) (purple), and Duke Activity Status Index (DASI) (green). D–F - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model, against DecisionTree Classifier, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Panel D shows ROC curves for models using only Clinical dataset, Panel E displays ROC curves for models using only Cardiorespiratory Fitness dataset and Panel F presents ROC curves for models using both CRF and Clinical data.

Fig 3 — A – Comparison of 585 time-series showing oxygen uptake per kilogram per minute (V̇O2/Kg/min) in cardiopulmonary exercise testing (CPET).The x-axis shows the moment of the exam, while the y-axis displays the V̇O2/Kg/min values measured in ml/kg/min. Patients are categorized and color-coded based on their Postoperative Morbidity Survey (POMS) scores: those with POMS scores equal or less than 1 are represented in light blue, POMS scores exactly at or more than 2 in orange, and POMS scores greater than 3 in red. B – Illustration of the effect of resampling a data track on its resolution and detail. The top left plot shows the original track with high-resolution data points, depicted in blue. The other three plots demonstrate the track after resampling to different numbers of points, with each plot color-coded to represent a specific resampling: 1000 points in green (top right), 500 points in orange (bottom left), and 100 points in red (bottom right). C - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model against other machine learning classifiers: Decision Tree, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Each curve represents the respective model’s performance in predicting day 3 Postoperative Morbidity Survey (POMS) scores using Cardiorespiratory Fitness Time Series (CRF-TS) dataset. D - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model against other machine learning classifiers: Decision Tree, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Each curve represents the respective model’s performance in predicting day 3 Postoperative Morbidity Survey (POMS) scores using the subset of patient used in the Time-Series experiment from the Cardiorespiratory Fitness (CFR 585) dataset.

Funding2

—http://dx.doi.org/10.13039/100010269Wellcome Trust
—http://dx.doi.org/10.13039/100010269Wellcome Trust

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCardiac, Anesthesia and Surgical Outcomes · Hemodynamic Monitoring and Therapy · Hip and Femur Fractures

Full text

Introduction

As the number of operative procedures carried out on a population that is becoming more comorbid and frailer grows [1,2], so does the incidence of postoperative morbidity and mortality [3–8]. Accurate preoperative risk prediction supports patients in making informed decisions and guides clinical decision-making. Existing risk scores include the Portsmouth-Physiological and Operative Severity Score for the enumeration of Mortality and Morbidity (PPOSSUM) [9]. However, such scores are limited by their reliance on linear regression models that often omit or barely incorporate physiological data such as measures of cardiorespiratory fitness (CRF) that can predict postoperative outcomes [10].

Cardiopulmonary Exercise Testing (CPET) is the gold standard for preoperative assessment of CRF. Several headline parameters are commonly applied to preoperative outcome prediction such as ventilatory anaerobic threshold (AT) and peak oxygen consumption (V̇O_2_) (Fig 1A) [11–16]. Inclusion of CPET variables into risk prediction models has also been limited in scope. In addition, the complexity of time-series physiological data has prevented large scale analysis. Mechanisms through which CRF affects postoperative outcomes are under investigation [13]. There is a distinct need to better characterize the utility of CPET, both alone and alongside other tools used for multifactorial risk stratification.

A - Cardiopulmonary exercise testing (CPET).On the left, the different phases of CPET are detailed, including Rest, Ventilatory Anaerobic Threshold (AT), Peak, and Recovery, describing the exercise intensity levels and physiological responses. The central graph displays the progression of exercise intensity, marked by the green ramp, alongside traces of oxygen uptake (V̇O2) and carbon dioxide production (V̇CO2). The right section elucidates primary features derived from CPET such as V̇O2, V̇CO2, heart rate (HR), oxygen pulse (V̇O2/HR), ventilation (VE), and the respiratory exchange ratio (RER). The image portrays a patient undergoing CPET, equipped with the appropriate testing apparatus. The graph is presented with the consent of Quick O, Reed-Poysden C, from their work “Cardiopulmonary Exercise Test: Interpretation and Application in Perioperative Medicine” 2022, https://doi.org/10.28923/atotw.473. B - Bar chart illustrating the average number of minor and major components contributing to each Postoperative Morbidity Survey (POMS) score on day 3. The bars represent the mean count of individual minor (blue) and major (orange) morbidity factors for patients with POMS scores ranging from 0 to 6. C - Scatter plot with linear regression analysis illustrating the relationship between patient age and Postoperative Morbidity Survey (POMS) scores on day 3. Each datapoint represents an individual patient’s age against their corresponding POMS score. D - Scatter plot with linear regression analysing the relationship between V̇O2/Kg at the Anaerobic Threshold (AT) and Postoperative Morbidity Survey (POMS) scores on day 3. Each datapoint signifies the V̇O2/Kg AT value for an individual patient and their respective POMS score.

Machine Learning (ML) models utilizing demographic and clinical data have been applied extensively in perioperative medicine [17,18]. However, no ML models have yet incorporated CRF data or focused exclusively on preoperative information [18]. We aimed to identify the optimal ML algorithm that utilizes CRF data for predicting postoperative outcomes. One option we chose to explore was the Multi-objective Symbolic Regression (MOSR) approach [19,20], an algorithm based on genetic programming that generates a set of readable mathematical equations that can be used as predictive models [20].

We hypothesized that applying ML to perioperative data, particularly incorporating preoperative CPET-derived CRF, would accurately predict postoperative morbidity in a population of elderly patients with multiple comorbidities undergoing major surgery. Our goals were to assess the impact of CPET and preoperative electronic healthcare record (EHR) data in a broad perioperative elective surgical fragile patient dataset on predicting postoperative outcomes, and to compare the effectiveness of MOSR against current machine learning algorithms and clinical risk scores.

Methods

University College London Hospitals NHS Foundation Trust (UCLH) maintains a prospective research database of patients undergoing CPET before major surgery. All participants provided written consent for their CPET outcomes to be included in the database for future research, in compliance with the Declaration of Helsinki. Ethical approval was initially granted in 2012, and reaffirmed in 2019, with no specified time constraints (NRES Committee London – Southeast reference: 12/LO/0192, London - Westminster Research Ethics Committee reference: 19/LO/1371). The database was created to collect data for assessing and studying short- and long-term postoperative morbidity and mortality, encompassing a broad range of patients who underwent CPET prior to major surgery. The initial analysis was performed by two clinical exercise physiologists and was later confirmed by a consultant anaesthetist. The database was queried for patients enrolled between 2012 and 2022. Adherence to Caldicott principles ensured data confidentiality and proper collection methods.

Population and selection criteria

Eligible participants over 18 years routinely referred for CPET as part of their preoperative assessment and listed for elective major non-cardiac surgery at University College London Hospitals NHS Foundation Trust were included. Patients under 18 years or unable to provide consent were excluded. CPET was performed in accordance with International Peri-Operative Exercise Testing and Training Society (iPOETTS) guidance. Where iPOETTS medical criteria precluded testing, patients were excluded [21]. Of 2,145 patients assessed, 1,190 were included in the tabular database, and 585 with full CPET recordings for time-series analysis (CONSORT flow diagram; S1 Fig). Table 1 contains relevant demographic data.

Table 1: Population demographics, comorbidities, laboratory variables, medications and surgery types, cardiopulmonary exercise test (CPET) values and perioperative outcomes.

Database

The main dataset was divided into four components. A full feature list in provided in the Appendix:

(i) Clinical dataset (39 features) with demographics and medical history – age, biometrics, medical and drug history, laboratory values, and surgery details (including severity and technique).(ii) CRF dataset (46 features) - derived cardiorespiratory fitness values were extracted during different phases of the CPET (Fig 1A). Measurements were made of minute oxygen uptake (V̇O_2_), minute carbon dioxide production (V̇CO_2_), and end-tidal gas tensions (P_ET_O_2_ and P_ET_CO_2_) during a ramped exercise protocol performed on a cycle ergometer. Ventilatory equivalents for oxygen (the slope between minute ventilation and oxygen uptake: V_E_/ V̇O_2_) and carbon dioxide (the slope between minute ventilation and carbon dioxide production: V_E_/V̇CO_2_), and oxygen pulse (a correlate of stroke volume, the slope between oxygen uptake and heart rate (HR): V̇O_2_/HR) were derived. V̇O_2_ peak was defined as the highest average V̇O_2_ over the last 30 seconds of ramped exercise. The ventilatory anaerobic threshold (AT) was determined using the V-slope method, ventilatory equivalents, and end-tidal gas tensions [21]. Two clinical exercise physiologists independently interpreted the tests and then subsequently verified by a consultant anesthetist. V̇O_2_ Peak and Anaerobic Threshold (AT) values were adjusted for body weight (ml.min^-1^.kg^-1^). Electrocardiographic and expired gas data were collected at intervals of one sample per second and median filtered for each patient.(iii) CRF-TS (time-series) (18 features). The dataset of a subset of 585 patients was extracted from the CRF database. This dataset contained CPET time-series data for 15 features routinely recorded during the CPET exam (V̇O2, V̇CO_2_, etc.), with a resolution of one value per second.(iv) Outcomes and clinical scores - American Society of Anesthesiologists (ASA) score, Duke Activity Status Index (DASI) score, post-operative care destination, length of hospital stay, mortality and morbidity (as described below) were recorded onto the database.

Data were imported from Microsoft Excel read-only files on secured university storage. Routine laboratory results were extracted from the hospital electronic healthcare record system (EPIC, Verona, Wisconsin, USA). Dataset quality was verified by three physicians.

Major and minor POMS morbidity classification

To evaluate postoperative morbidity, we employed the Post-Operative Morbidity Survey (POMS) on Postoperative Days 3, 5, and 7. POMS is straightforward approach for detecting and quantifying postoperative complications. Designed to be applicable across all types of surgery, its focus is on identifying complications that could hinder a patient’s discharge from the hospital. This method has been previously validated within complex surgical populations, underscoring its reliability [22]. POMS was prospectively collected in person by a trained researcher in a controlled hospital environment. The POMS score was calculated as a binary, non-weighted result for any positive score across the nine POMS domains, as outlined in S1 Table. The POMS domains were also categorized into major and minor complications, as reported in literature [23]. Routine care elements such as the presence of nasogastric tubes after upper gastrointestinal surgery or urinary catheters post-cystectomy (also classified as minor POMS) were not classified as morbidities and these domains were excluded from our analysis [23].

Based on initial analysis, the patient population could be classified into two distinct groups (Fig 1B) and this was instrumental for subsequent model development. This classification emerged from the priori observation of the fact major and minor factors influencing the POMS score on day 3 post-surgery, leading to the identification of 557 patients with a day 3 POMS score between 0 and 1 (POMS 0-1), indicative of no or minor morbidities, and 633 patients with a score equal to or above 2 (POMS ≥2), signifying moderate or severe postoperative complications. The decision to prioritize model development on Postoperative Day 3 was influenced by the findings presented in S2 Fig. This figure, produced using linear regression analysis, demonstrated a high correlation between the count of POMS-positive domains on Day 3 and the scores on Days 5 and 7. This approach was further justified by previous reports that the trajectory of postoperative POMS is indicative of 5-year mortality rates [5].

Statistical analysis

Data analysis and statistical computations were conducted using Python (version 3.10.12) [24] and Pandas (version 1.4.2). Binary encoding was applied to categorical variables, assigning 1 for the occurrence of an event and 0 for its absence. The categorical attributes of sex, surgery type, and specialty were subjected to one-hot encoding due to their non-ordinal nature. Continuous variables underwent normalization to fit within a 0-1 scale. Assessment of data normality was conducted, with results presented either as median with interquartile range or as mean with standard deviation, depending on the distribution. Frequency distributions were evaluated for categorical data. To discern differences between groups, statistical tests such as Student’s t-test, Mann-Whitney U test, Chi-Squared test, and Fisher’s Exact Test were utilized to analyse the demographic. To assess the relationship between features and outcome, a linear Pearson correlation analysis was performed. A threshold of p < 0.05 was established for statistical significance.

MORS algorithm and classification models

ML classification models were developed based on the POMS classification to predict patients with POMS 0-1 or POMS ≥2 at day 3, utilizing only features available before surgery. Model performances are reported in Fig 2 and Table 2. The initial 80:20 random split was applied to the entire dataset of 1,190 patients, creating two primary subsets: a training set (952 samples) and a test set (238 samples), while maintaining outcome prevalence across groups. This 80/20 division follows standard machine learning practices to ensure a balance between training and evaluation data.

Table 2: Machine learning model performances.

A - Receiver operating characteristic (ROC) curve comparison for the Multi-objective Symbolic Regression (MOSR) model’s performance in predicting Postoperative Morbidity Survey (POMS) scores at day 3.The ROC curves of three models are presented: MOSR using Cardiorespiratory Fitness (CRF) dataset (blue), MOSR using Clinical dataset (orange), and MOSR utilizing a combination of both CRF and Clinical dataset (green). B - SHAP Beeswarm Plot of MOSR models using Cardiorespiratory Fitness (CRF) and Clinical database. Each row represents a feature used in the model. Each dot on a row corresponds to a datapoint, with its position on the x-axis indicating the SHAP value, or the contribution of that feature to the model’s prediction for that datapoint. The colour of the dots represents the value of that feature, with blue indicating low values and red indicating high values. V̇O2/Kg AT: Oxygen Consumption per Kilogram at Anaerobic Threshold, V̇O2/Kg VOP: Oxygen Consumption per Kilogram at Peak, VE/V̇CO2 AT: Ventilatory Equivalent for Carbon Dioxide at Anaerobic Threshold, BMI: Body Mass Index, V̇O2/Kg Rest = Oxygen Consumption per Kilogram at Rest, MET = Metabolic Equivalent of Task, V̇O2 AT = Oxygen Consumption at Anaerobic Threshold, HR AT = Heart Rate at Anaerobic Threshold, VT VOP = Tidal Volume at Peak, VE VOP = Ventilation at Peak, V̇O2/HR AT = Oxygen Consumption per Heart Rate at Anaerobic Threshold, V̇CO2 AT = Carbon Dioxide Production at Anaerobic Threshold, PetCO2 VOP = Partial End-tidal Carbon Dioxide at Peak, WR AT = Work Rate at Anaerobic Threshold, VE AT = Ventilation at Anaerobic Threshold, WR VOP = Work Rate at Peak, PetCO2 Rest = Partial End-tidal Carbon Dioxide at Rest, RR VOP = Respiratory Rate at Peak. C - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR), utilizing either CRT data alone or with Clinical data, against established clinical assessment scores in predicting Postoperative Morbidity Survey (POMS) outcomes on day 3. MOSR CRF dataset model (blue), Cardiopulmonary Exercise Testing (CPET) (red), American Society of Anaesthesiologists Score (ASA) (yellow), Physiological and Operative Severity Score for the enumeration of Mortality and morbidity (PPOSSUM) (purple), and Duke Activity Status Index (DASI) (green). D–F - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model, against DecisionTree Classifier, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Panel D shows ROC curves for models using only Clinical dataset, Panel E displays ROC curves for models using only Cardiorespiratory Fitness dataset and Panel F presents ROC curves for models using both CRF and Clinical data.

To train the model and optimize its hyperparameters, we conducted a 10-fold cross-validation grid search on the training set. In each iteration, a 90:10 random split was performed, assigning 857 samples for training and 95 for validation. This approach, widely used in machine learning, allows us to systematically evaluate different parameter combinations by assessing model performance on the validation set, leading to the selection of the best hyperparameters.

After finalizing the model’s parameters and weights, we evaluated its predictive performance and robustness through 10 independent test runs. Each test run utilized a randomly selected 90% subset (214 samples) of the test set (238 samples), and we report the average accuracy obtained across these runs.

Data leakage was not present in our study, as the training and testing sets were strictly partitioned at the outset using an 80:20 random split. The test set (238 samples) remained completely independent and was not utilized during model training, hyperparameter optimization, or cross-validation. This strict separation ensures that model performance was evaluated exclusively on unseen data, maintaining the validity and reliability of the results.

MOSR is an algorithm that distinguishes itself because of its flexibility, using Genetic Programming to create mathematical formulae for learning tasks. It stands out as it can utilize combinations of mathematical operations, covering both simple and complex functions, without adhering to a set model [19,20]. Another major advantage of MOSR is its capability to automatically identify the most crucial features during training. For this reason, we included all available variables in our experiments. For each experiment, we evolved a population of 300 individual models for 500 generations.

The ML library for MOSR is open source and available at: https://github.com/davideferrari92/multiobjective_symbolic_regression.

The Python package PyCaret was utilized to train Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Light Gradient Boosting (LGBM), and Extreme Gradient Boosting (XGB) for comparison against MOSR [25]. All models underwent training in three distinct experiments: the first focused exclusively on the clinical dataset, the second on the CRF dataset, and the final experiment utilized both databases. For hyperparameter optimization of the benchmark algorithms, we implemented grid search using PyCaret’s default configuration. The hyperparameter search space was predefined, and the optimized parameters for each algorithm fell within this exploration range.

In accordance with TRIPOD guidelines, models underwent evaluation using established classification metrics, including accuracy, sensitivity, specificity, F1-Score, Positive Predictive Value (PPV – also defined precision), Negative Predictive Value (NPV), and Area Under the Curve (AUC) [26]. The results of the test set are reported in Table 2, the training test in S2 Table and the precision-recall curve for MORS in S3 Fig.

Feature analysis

Shapley Additive exPlanations (SHAP) was applied to analyze contributions of the features included in the models. This method measures the impact of each individual feature enabling additional interpretation of relative contributions to final model performance [27,28].

Clinical scores

ML model performance was benchmarked against existing clinical risk prediction scores, PPOSSUM, ASA and DASI. To predict binary outcomes of POMS 0-1 and POMS ≥2, thresholds were established for these scores. A threshold of risk of 30% was set for PPOSSUM, while the ASA score threshold was set at a score of 2, and DASI at a score of 34 [29–32]. Logistic regression was performed using a CPET-derived ventilatory anaerobic threshold of 11 ml/O_2_/kg/min (V̇O_2_/Kg AT), a value commonly used to predict postoperative outcomes [33,34]. PPOSSUM was computed retrospectively for comparison. The Surgical Outcome Risk Tool (SORT) was omitted due to impractical retrospective physician assessment [35].

Time-series (TS) features extraction and classification models

The CRF-TS database, which includes data from 585 patients, was utilized to explore how CPET time-series data contribute to predicting patients with POMS 0-1 or POMS ≥2 on Postoperative Day 3. We experimentally rescaled the time-series to a lower dimension to reduce the complexity of the ML task without losing the informative shape of the curve; as shown in Fig 3A and 3B, the shape does not change when down sampled to 1000, 500 and 100. The MOSR model, as previously mentioned, was applied to this database, and its performance compared with that of LR, DT, RF, LGBM and XGB models. The robustness of these models was evaluated in a similar way to the previous experiments. Finally, we evaluated the models trained on the CFR-TS dataset against the same subset of patients using data from the CFR dataset (CRF 585 patient subset), employing the same algorithms for comparison (Fig 3C and 3D).

A – Comparison of 585 time-series showing oxygen uptake per kilogram per minute (V̇O2/Kg/min) in cardiopulmonary exercise testing (CPET).The x-axis shows the moment of the exam, while the y-axis displays the V̇O2/Kg/min values measured in ml/kg/min. Patients are categorized and color-coded based on their Postoperative Morbidity Survey (POMS) scores: those with POMS scores equal or less than 1 are represented in light blue, POMS scores exactly at or more than 2 in orange, and POMS scores greater than 3 in red. B – Illustration of the effect of resampling a data track on its resolution and detail. The top left plot shows the original track with high-resolution data points, depicted in blue. The other three plots demonstrate the track after resampling to different numbers of points, with each plot color-coded to represent a specific resampling: 1000 points in green (top right), 500 points in orange (bottom left), and 100 points in red (bottom right). C - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model against other machine learning classifiers: Decision Tree, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Each curve represents the respective model’s performance in predicting day 3 Postoperative Morbidity Survey (POMS) scores using Cardiorespiratory Fitness Time Series (CRF-TS) dataset. D - Receiver Operating Characteristic (ROC) curves comparing the Multi-objective Symbolic Regression (MOSR) model against other machine learning classifiers: Decision Tree, LGBM Classifier, Logistic Regression, Random Forest Classifier, and XGB Classifier. Each curve represents the respective model’s performance in predicting day 3 Postoperative Morbidity Survey (POMS) scores using the subset of patient used in the Time-Series experiment from the Cardiorespiratory Fitness (CFR 585) dataset.

Results

Population description

1190 patients were included, with a median age of 71 years (61–79). Sixty-nine percent were male. The types of surgery reflect those routinely performed within our university hospital (Table 1). Intensive Care and Surgical High Dependency Units were the predominant initial postoperative destinations (64%). CPET data and key outcomes are shown in Table 1. Thirty-day mortality was 1.9% and one-year mortality 5.5%. A modest correlation existed between advancing age and an increased number of POMS-positive domains at day 3 (correlation coefficient: 0.27, p < 0.05), as indicated by the POMS score (Fig 1C). A similar, yet weaker, correlation existed for female sex (correlation coefficient: 0.091, p < 0.05). By contrast, V̇O_2_/kg at ventilatory anaerobic threshold (AT) demonstrated a stronger inverse correlation with the number of POMS-positive domains on Postoperative Day 3 (correlation coefficient: 0.69, p < 0.05) (Fig 1D).

MOSR models – Dataset comparison and SHAP analysis

The MOSR model incorporating only the clinical dataset, as presented in Fig 2A and Table 2, recorded the lowest metrics (AUC 0.733, F1-Score 0.720). Conversely, the MOSR model that exclusively used CRF data outperformed the others (AUC 0.946, F1-Score 0.890). The combined MOSR model, which included both CRF and clinical data, exhibited a performance slightly inferior to the model utilizing only CRF data (AUC 0.930, F1-Score 0.879). SHAP feature analysis of the MOSR model based on both CRF and clinical datasets highlights individual feature contributions to model performance and ranking, specifically lower V̇O_2_peak and V̇O_2_ at AT, elevated body mass index (BMI) and, finally, severity of the surgical procedure as being critical for model performance (Fig 2B).

MOSR models – Comparison with clinical scores

Comparison of MOSR models against the clinical risk prediction scores (Fig 2C and Table 2) showed the MOSR CRF model outperformed all clinical scores (AUC 0.946, F1-score 0.890). Notably, the ASA (AUC 0.630, F1-score 0.553), DASI (AUC 0.602, F1-score 0.720) and PPOSSUM (AUC 0.630, F1-score 0.720) scores all had lower discriminative power compared to the MOSR models utilizing either CRF or clinical data. The model using an V̇O_2_/Kg AT of 11 ml/kg/min as a discriminator achieved an AUC of 0.751 and F1-score of 0.740. Finally, the MOSR models outperformed other clinical scores in terms of PPV and NPV.

MOSR model – Comparison with state-of-art ML models

Superior performance was exhibited by MOSR models compared to current state-of-the-art machine learning models, as indicated in Fig 2D–2F and Table 2. The CRF MOSR model outperformed the others (AUC 0.946, F1-score 0.890). The accuracy of the MOSR CRF model was 0.859 on the test set and 0.888 on the training set.

Consistent outcomes were observed when aggregating all CRF and clinical datasets, with the MOSR model surpassing others in performance (AUC 0.930, F1-score 0.879). The remaining models displayed marginally lower metrics. In the analysis using solely the clinical dataset, all models exhibited reduced performance relative to other tests. Overall, the MOSR models demonstrated higher NPV and PPV values compared to the other models.

The MOSR model showed a small difference in accuracy between the test (0.859) and training sets (0.886), whereas the gap was significantly larger for other models, The MOSR model led with the best performances (AUC 0.733, F1-score 0.720).

Time-series MOSR model – Analysis and comparison

Using CRF from Time-Series (CRF-TS, Fig 3C), the MOSR model was the sole model to achieve performance metrics with an AUC of 0.85 and an F1-Score of 0.84. For the MOSR CRF-TS model, the accuracy was 0.781 on the test set and 0.783 on the training set, showing minimal difference. When the same subset of patients from the time-series were analyzed using the CRF dataset (CFR 585 subset), all models exhibited lower performances compared to the previously mentioned results (Fig 3D). The MOSR model achieved the highest performance (AUC 0.747, F1-Score 0.783). The main features utilized in the MOSR CRF-TS were V̇O_2_/HR, V̇O2/Kg and V̇CO_2_/Kg.

Discussion

Principal findings

We used a new machine learning method (called Multiobjective Symbolic Regression-MOSR), alongside other established Machine Learning models, to better predict which elderly patients might develop complications after a variety of major surgeries. Our approach builds on standard clinical information (like age, other illnesses, and lab results) by adding real data about a patient’s fitness. We based our models on prospectively collected databases of comparable size to those used to develop the widely used Surgical Outcomes Risk Tool (SORT) [22,23]. Our findings highlight the critical role of CPET-derived cardiorespiratory fitness data in enhancing model performance over demographic and clinical features, underscoring the utility of formal evaluation of a patient’s fitness prior to undergoing major surgery. We employed MOSR, an advanced and novel machine learning model, to develop predictive classifications that surpass the performance of state-of-the-art machine learning algorithms. Our models performed better than the clinical risk prediction scores PPOSSUM, DASI and ASA, and could eventually be used to inform risk prediction for patients undergoing major surgery. We also described usage of SHAP analysis to explore the impact of individual features within ML models and to identify candidate mechanisms for future research. Furthermore, we detailed the innovative application of data obtained from a CPET time-series in training the MOSR; this surpasses the performance of manually extracted CPET data.

The impact of physiology on predicting postoperative outcomes

Our analysis establishes a connection between V̇O_2_ peak, V̇O_2_ at AT, and V_E_/V̇CO_2_ and postoperative outcomes, supporting findings from previous research [33,34] while also underscoring the significance of age in surgical outcomes. However, the link between Cardiorespiratory Fitness features (V̇O_2_ peak_,_ AT and outcomes) was found to have greater influence than age on model performance. Linear regression alone proved insufficient for identifying patients at high risk for accumulating multiple POMS defined postoperative morbidities.

Our risk models, not specific to any type of surgery but considering operative severity, underscore that in an elderly population of patients with multiple comorbidities, the role of patient physiology and fitness rather than named comorbidity in the development of perioperative complications. The best model, utilizing the MOSR algorithm and incorporating CRF data, demonstrated the best performance (Positive Predictive Value, or PPV increased by 49%). PPV tells us how many patients identified by models as “at risk” ended up having complications. For example, if using a hypothetical alternative method, if 100 patients were flagged as high risk, and only 20 truly experienced complications, a 49% improvement in PPV would mean that, with our new model, if you flagged 100 patients, roughly 30 of them (an extra 10 out of 100) would be at risk. In short, our model is much better at correctly identifying the patients who will experience problems, reducing the number of false positives. In addition to improvements in PPV, our model also showed a 20% improvement in NPV (negative predictive value), meaning it’s better at identifying when a patient is not at risk (avoiding false negatives). Furthermore, overall accuracy measures (like AUC and F1-score) improved by 27% and 22%, respectively. These combined improvements help ensure that more patients are correctly classified, which could lead to better-tailored preoperative care through improved shared decision-making discussions and better risk data to support clinical decision making. In practical terms, this suggests that if incorporated into clinical risk prediction, the MOSR algorithm with cardiorespiratory fitness data could potentially improve diagnostic accuracy, leading to better identification of relevant cases and better-informed, personalized clinical decisions.

Cardiorespiratory fitness assessed through CPET is individual to the patient and likely reflects individual resilience. The results of this study indicate that the physiological response to acute stress (in this case exercise) is a better predictor of postoperative outcomes than traditional clinical diagnoses alone. This supports the incorporation of formal assessment of cardiorespiratory fitness in preoperative assessment when considering personalized risk assessment.

These findings are supported by the SHAP features analysis of the model utilizing the entire database, which reveals a significant impact of CPET parameters on the model’s final performance. Notably, the most influential features were, in order, AT and V̇O_2_ peak, ventilatory equivalents for CO_2_ at AT and then body mass index, as described in the literature [36]. Although the severity of the operation was important, it played a lesser role. Therefore, the relationships and correlations between exercise response, patient body mass and composition, and the severity of surgery may lay the groundwork for future research.

MOSR and other machine learning models surpassed clinical risk prediction scores PPOSSUM, ASA and DASI scores by roughly 30%, and the AT-based regression model by approximately 20%. The modest effectiveness of PPOSSUM and DASI observed in this study aligns with findings from other recent studies [37–39]. Additionally, the ultimate MOSR fitness model showed promising results when compared to the Surgical Outcome Risk Tool (SORT) for morbidity (AUROC 0.72, 95% CI: 0.67-0.77), which was developed using a population that overlaps with the UCLH CPET database [23].

CPET time-series data for CRF fitness

Most CPET studies have traditionally focused on a narrow set of individual non-dynamic measures, including V̇O_2_ peak, AT, and VE:V̇CO_2_ at AT, due to their predictive power and physiological relevance. However, this approach overlooks most data obtainable from CPET. Our time-series based models emphasized the significance of dynamic exercise responses. Training the models on to the same subset of patients, the model using time-series data achieved, on average, a 12% increase in AUC and a 7% improvement in F1-score compared to models developed with hand-extracted CPET data on the population. These data could be used to enhance our comprehension of how various elements of the physiological response to exercise contribute to resilience against surgical trauma. Interestingly, the primary features utilized by the MOSR model are derived from the V̇O_2_/HR, V̇O2/Kg, and V̇CO_2_/Kg time-series. This highlights the significance of cardiac output and heart rate adaptation, lending physiological credibility to the model’s responses and supporting the concept of cross-stressor adaptation [15,40,41].

MOSR performances and model interpretation

MOSR consistently outperformed all other algorithms and, analyzing the features included in the final models, we can value its intrinsic feature selection behavior. Moreover, in contrast to all other ML approaches that rely on Binary Cross entropy for classification training, MOSR was trained by also optimizing for the Partial Akaike Information Criterium [42] that penalizes high complexity models, effectively maximizing the ratio between performance and model complexity. In our experiments, MOSR demonstrated excellent predictive performance and generalization capability; most importantly, it did not overfit the training data like other algorithms, making it the most favorable choice. Particularly in models with CRF derived from time-series data, MOSR produced the best AUC and F1-scores on the test set, proving to be the most reliable algorithm out of our selection. In our study the potential of MOSR for correlating patient clinical and physiological features with perioperative outcomes is highlighted, showcasing its capability to autonomously select important features and create models that are both simple and accurate, demonstrating strong generalizability.

Limitations and strengths

This study represents the first investigation using a perioperative CRF fitness dataset with ML to predict postoperative outcomes in an elderly population. The study’s primary limitation stems from being conducted at a single center and its retrospective nature. However, the 28-day mortality after elective non-cardiac major surgery in our institution (2%) is comparable with the wider literature (1.1 to 3.6%), implying that our population is not significantly different from that in other comparable centers [5,43]. Given encouraging initial results, there is an intention to carry out prospective external validation. The strength of our work lies in usage of POMS-defined morbidity and the usage of MOSR to moderate traditional limitations of ML techniques such as overfitting to single datasets and its incumbent capacity for easy external translation to external datasets.

Conclusion

We have demonstrated that incorporating CPET-derived CRF fitness data into a machine learning model, specifically MOSR, can significantly improve preoperative risk prediction in elderly surgical patients compared to traditional clinical risk scores. These data offer deeper insights compared to the clinical data conventionally used for risk prediction. This approach could enable clinicians to better identify patients at high risk of postoperative complications, allowing for targeted interventions and potentially reducing morbidity. Additionally, leveraging data directly from the CPET time-series appears to enhance model performance, underscoring the value of examining elder patient physiology and adaptability to acute stress. We also introduced a novel ML model, MOSR, to the clinical field, which outperformed existing state-of-the-art ML models. These findings lay the groundwork for developing clinical decision support tools that could optimize postoperative care pathways for high-risk elderly surgical patients and advance precision medicine, potentially improving patient care and reducing hospital budgets [44–46].

Supporting information

S1 FigCONSORT flow diagram illustrating the selection process of patients for the study, demonstrating final sample sizes for tabular and time-series analysis.(TIF)

S2 FigLinear regression between number of POMS positivity at Day 3, Day 5, and Day 7.(TIF)

S3 FigPR curves for the Multi-objective Symbolic Regression (MOSR) models: A - Clinical Dataset, B - CFR Dataset, C - Combined CFR and Clinical Dataset.(TIF)

S1 TablePostoperative Morbidity Score (POMS), adapted from [22].(DOCX)

S2 TableMachine learning model performances TRAINING set.Comparison of Multi-Objective Symbolic Regression (MOSR), versus Decision Tree Classifier, Light Gradient Bosting Machine (LGBM) Classifier, Logistic Regression, Random Forest Classifier and Extreme Gradient Boosting (XGB) Classifier. Values are expressed in mean and 95% CI from 10 execution on training set. CRF: Cardiorespiratory Fitness, CI: 95% Confidence interval, PPV: Positive Predictive Value, NPV: Negative Predictive Value, ASA: American Society of Anaesthesiologist score, DASI: Duke Activity Status Index, CPET: Cardiopulmonary Exercise Testing, PPOSSUM: Portsmouth Physiological and Operative Severity Score for the enumeration of Mortality and morbidity.(DOCX)

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Rose J, Weiser TG, Hider P, Wilson L, Gruen RL, Bickler SW. Estimated need for surgery worldwide based on prevalence of diseases: a modelling strategy for the WHO Global Health Estimate. Lancet Glob Health. 2015;3(Suppl 2):S 13–20. doi: 10.1016/S 2214-109X(15)70087-2 25926315 PMC 5746187 · doi ↗ · pubmed ↗
2Lee SW, Lee H-C, Suh J, Lee KH, Lee H, Seo S, et al. Multi-center validation of machine learning model for preoperative prediction of postoperative mortality. NPJ Digit Med. 2022;5(1):91. doi: 10.1038/s 41746-022-00625-6 35821515 PMC 9276734 · doi ↗ · pubmed ↗
3Dencker EE, Bonde A, Troelsen A, Varadarajan KM, Sillesen M. Postoperative complications: an observational study of trends in the United States from 2012 to 2018. BMC Surg. 2021;21(1):393. doi: 10.1186/s 12893-021-01392-z 34740362 PMC 8571843 · doi ↗ · pubmed ↗
4Derogar M, Orsini N, Sadr-Azodi O, Lagergren P. Influence of major postoperative complications on health-related quality of life among long-term survivors of esophageal cancer surgery. J Clin Oncol. 2012;30(14):1615–9. doi: 10.1200/JCO.2011.40.3568 22473157 · doi ↗ · pubmed ↗
5Moonesinghe SR, Harris S, Mythen MG, Rowan KM, Haddad FS, Emberton M, et al. Survival after postoperative morbidity: a longitudinal observational cohort study. Br J Anaesth. 2014;113(6):977–84. doi: 10.1093/bja/aeu 224 25012586 PMC 4235571 · doi ↗ · pubmed ↗
6Stefani LC, Gamermann PW, Backof A, Guollo F, Borges RMJ, Martin A, et al. Perioperative mortality related to anesthesia within 48 h and up to 30 days following surgery: A retrospective cohort study of 11,562 anesthetic procedures. J Clin Anesth. 2018;49:79–86. doi: 10.1016/j.jclinane.2018.06.025 29909205 · doi ↗ · pubmed ↗
7Tjeertes EKM, Ultee KHJ, Stolker RJ, Verhagen HJM, Bastos Gonçalves FM, Hoofwijk AGM, et al. Perioperative Complications are Associated With Adverse Long-Term Prognosis and Affect the Cause of Death After General Surgery. World J Surg. 2016;40(11):2581–90. doi: 10.1007/s 00268-016-3600-4 27302465 PMC 5073115 · doi ↗ · pubmed ↗
8Johnson ML, Gordon HS, Petersen NJ, Wray NP, Shroyer AL, Grover FL, et al. Effect of definition of mortality on hospital profiles. Med Care. 2002;40(1):7–16. doi: 10.1097/00005650-200201000-00003 11748422 · doi ↗ · pubmed ↗