Interpretable Machine Learning with SHAP Identifies Key Biomarkers in a Multi-Factorial Spectrum of Age-Related Neurological and Metabolic Conditions
Daniil V. Artamonov, Polina I. Popova, Ekaterina A. Korf, Natalia G. Voitenko, Alisa A. Chernysheva, Pavel V. Avdonin, Richard O. Jenkins, Nikolay V. Goncharov

TL;DR
This study uses interpretable machine learning to identify key blood biomarkers for diagnosing age-related neurological and metabolic conditions.
Contribution
The study introduces a variance-aware workflow with SHAP to identify stable biomarkers in multi-factorial age-related conditions.
Findings
Gradient boosting with SHAP identified iron, transferrin, and glucose as key biomarkers with synergistic interactions.
Statistical significance and SHAP values showed moderate correlation in feature importance across clinical groups.
Unsupervised clustering poorly aligned with clinical labels, highlighting the need for interpretable ML in diagnostics.
Abstract
Vascular and metabolic disorders in the elderly—including acute ischemic stroke (AIS), chronic cerebral circulation insufficiency (CCCI), type 2 diabetes mellitus (DM), and subcortical ischemic vascular dementia (SIVD)—pose a major diagnostic challenge due to their reliance on multi-parameter blood chemistry. In this study, 49 biochemical features were analyzed within a cohort of 120 patients. The application of variance-aware statistical testing revealed that several features (e.g., Fe, Transf, RDW%, LDL) exhibited statistically significant heterogeneity of variance (p < 0.05), which is known to distort standard ANOVA inference. While standard machine-learning (ML) classifiers demonstrated variable performance across clinical groups, a gradient boosting model with restricted tree depth (max depth = 3) achieved high discriminative accuracy, yielding F1-scores between 0.87 and 0.96…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16- —Russian Science Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Acute Ischemic Stroke Management · Artificial Intelligence in Healthcare
1. Introduction
Diagnostic error is commonly defined as a missed, delayed, or incorrect diagnosis and is considered one of the most serious threats to patient safety [1]. A large-scale study of patients transferred to intensive care units or who died in hospitals found that diagnostic errors were present in nearly one-quarter of cases. Approximately every fifteenth death was associated with a diagnostic error. These errors were most often linked to delays in diagnosis and incorrect selection of diagnostic tests, leading to worse outcomes and high mortality rates [2]. In 2010–2011, an estimated 40,000 adult patients died annually in the United States due to misdiagnosis in intensive care units and emergency departments, while 10 years later, this number had risen to 250,000 [3,4]. The leading causes of death are vascular disease and infections. The implementation of artificial intelligence (AI) systems in clinical diagnosis represents a promising direction for developing reliable support tools that can improve the accuracy of diagnostic decisions, optimize examination time, and reduce the risk of medical errors [5].
However, beyond the emergency departments, vascular and metabolic disorders represent a substantial clinical and public health burden in the aging population, with acute ischemic stroke (AIS), chronic cerebral circulation insufficiency (CCCI), prediabetes, type 2 diabetes (DM), and subcortical ischemic vascular dementia (SIVD) being among the most prevalent and debilitating conditions affecting elderly individuals [6]. AIS, a major cause of mortality and morbidity in the elderly, has become increasingly prevalent in recent years, with ischemic stroke accounting for approximately 70–80% of all stroke cases and demonstrating age-dependent increases in incidence, particularly among individuals aged 85 years and older [7]. Chronic cerebral vascular insufficiency and cerebral small vessel disease contribute substantially to cognitive decline and neurological dysfunction through progressive deterioration of small penetrating arteries, leading to white matter hyperintensities, lacunar infarctions, and impaired cerebral autoregulation [8]. Prediabetes and DM affect a large proportion of elderly individuals, with up to 70% of patients with prediabetes eventually progressing to diabetes [9]. In elderly patients with DM, macrovascular complications including coronary heart disease (42.4%), stroke (11.4%), and chronic obliterating disease of the lower limb arteries (10.1%) are highly prevalent, alongside microvascular complications such as retinopathy (64.9%) and polyneuropathy (90.7%) [10]. Subcortical ischemic vascular dementia, often referred to as Binswanger’s disease in its severe form, represents the most common subtype of vascular dementia and is characterized by confluent white matter hyperintensities, multiple lacunar infarcts, and progressive cognitive decline with prominent executive dysfunction, psychomotor slowing, and gait disturbances [11]. The clinical significance of these interconnected vascular and metabolic disorders extends beyond individual morbidity and mortality, as cerebral small vessel disease has been identified as a critical contributor to approximately 50% of dementia cases globally and serves as both an initiator and amplifier of neurodegenerative pathology [12,13]. Biomarkers serve as a fundamental basis for optimizing primary screening by providing enhanced criteria for early detection and risk stratification. They contribute to the development of personalized approaches to diagnosis and prevention in oncology, neurodegenerative, and metabolic disorders [14].
Classical analysis of variance (ANOVA) remains a widely used statistical method for comparing mean outcomes across multiple treatment or diagnostic groups in biomedical research. However, a fundamental assumption of ANOVA—homogeneity of variance (homoscedasticity)—is frequently violated in clinical and biochemical datasets, leading to potentially spurious results and compromised statistical inference. When the assumption of equal variances is violated, the standard ANOVA F-test becomes unreliable, with Type I error rates deviating substantially from the nominal significance level [15]. Heteroscedasticity in biomedical data is particularly problematic because it distorts estimated standard errors, rendering confidence intervals inaccurate and hypothesis tests unreliable, which can lead to incorrect conclusions about treatment efficacy or diagnostic group differences [16]. The presence of heteroscedasticity is common in clinical trials and multi-site studies, where inherent biological variability, differences in measurement precision across subgroups, and varying sample sizes combine to produce unequal variances that systematically reduce statistical power to detect true treatment effects [17].
Importantly, Welch’s ANOVA maintains comparable or even superior statistical power to classical ANOVA when variances are equal, meaning that it can be used routinely without concern for power loss, effectively eliminating the need to test for homogeneity of variance as a prerequisite for analysis [18]. In clinical trials and biomedical research settings, Welch’s ANOVA has been successfully applied to analyze biochemical markers and clinical endpoints where unequal variances are anticipated; for instance, in studies examining biomarker changes in knee synovial fluid from patients with anterior cruciate ligament injuries, Welch’s ANOVA enabled valid comparison of cytokine concentrations across injury groups despite significant heteroscedasticity [19]. Comparative studies have demonstrated that Welch’s ANOVA consistently provides more accurate Type I error control and better power performance than both the traditional ANOVA F-test and several other alternatives under conditions of heteroscedasticity, establishing it as a preferred method for analyzing clinical and experimental data with unequal variances [20].
Machine learning (ML) algorithms have demonstrated substantial efficacy in classification and diagnosis tasks based on biochemical and clinical data, with various supervised learning methods including linear discriminant analysis (LDA), random forest (RF), support vector machines (SVM), logistic regression (LR), k-nearest neighbors (kNN), and decision trees (DT) showing strong performance across diverse clinical applications [21]. Random forest, an ensemble learning method that constructs multiple decision trees and aggregates their predictions, has emerged as one of the most widely applied ML techniques in medical research with particular success in handling high-dimensional datasets, such as electronic health records, and demonstrating robustness to missing data and resistance to overfitting [22]. Investigations into oncological applications have shown that SVM and DT approaches can yield accuracies exceeding 97% when differentiating between benign and malignant lesions, as seen in skin cancer detection studies, underscoring the capacity of these algorithms to handle high-dimensional data while maintaining exceptional sensitivity and specificity [23]. Logistic Regression continues to play a vital role in clinical prediction models by providing a robust probabilistic framework that can integrate multiple covariates to predict disease risk, offering interpretable coefficients that directly translate into clinical risk estimates [24]. Support vector machines, employed in 32% of contemporary studies, excel in handling high-dimensional data and nonlinear relationships between features, demonstrating particular utility in genomic data analysis, medical imaging classification, and precision medicine applications where complex patterns must be identified from multi-modal datasets [25]. K-nearest neighbors algorithms have been successfully applied to medical diagnosis tasks, particularly in scenarios where local patterns in feature space are informative for classification, although their performance can be sensitive to the choice of distance metric and the number of neighbors [26]. Furthermore, the flexibility of these ML algorithms to adapt to high-dimensional and heterogeneous data environments underscores their pivotal role in modern clinical diagnostics, where they not only enhance predictive performance but also contribute to a deeper understanding of underlying disease mechanisms [27].
Shapley Additive Explanations (SHAP), a framework grounded in cooperative game theory for explaining machine learning predictions, has become a mainstream approach in data science for interpreting complex models and enhancing the clinical utility of ML-based diagnostic and prognostic systems. SHAP values assign each feature an importance value for a particular prediction by fairly distributing the ‘payout’ among the features, thus providing explainability locally for individual predictions as well as globally across datasets [28]. Recent studies integrating convolutional neural networks with attention mechanisms and SHAP for personalized health monitoring have achieved remarkable accuracies of 97.86%, with SHAP providing both population-level feature importance rankings and patient-specific explanations through force plots and dependence visualizations, which enable clinicians to understand why specific predictions were made for individual patients [29]. The XGBoost model combined with SHAP analysis accurately predicted lymph node metastasis in gastric cancer patients. SHAP values identified inflammatory indices and peripheral lymphocyte subpopulations as top contributors, with SHAP values ranging from 0.15 to 0.32 on average. This approach enhanced model interpretability and achieved high predictive accuracy (AUC > 0.85) [30]. In acute heart failure prediction, XGBoost models combined with SHAP analysis have demonstrated superior performance with AUC values of 0.817 for in-hospital mortality and 0.766 for worsening heart failure, while SHAP clustering has enabled risk stratification by forming interpretable patient clusters with similar mortality and complication rates [31].
This study focuses on comparing one-way ANOVA and Welch’s ANOVA for selecting biochemical features across patient groups, followed by the evaluation of ML classifiers (LDA, LR, kNN, DT, RF, SVM) and interpretation using SHAP. The objective is to develop ML models for a personalized diagnosis of cardiovascular diseases, providing enhanced accuracy and interpretability of results.
2. Results
2.1. Data Analysis
Statistical analysis of the dataset provides an essential foundation for understanding the distribution and variability of biochemical features across patient groups. Initial screening was performed with classical statistical methods to identify features that differ between groups and to inform subsequent feature selection for machine learning. In particular, assessment of variance homogeneity and between-group differences is important because these characteristics influence both the interpretability and predictive performance of supervised models.
To test variance homogeneity, Levene’s test was applied to each biochemical parameter. Figure 1 shows the Levene W statistics and corresponding p-values (right axis, log scale) for all parameters. Several features exhibited statistically significant heterogeneity of variance (p < 0.05, e.g., Fe, Transf, RDW%, LDL, LDH), indicating violation of the equal-variance assumption required by classical one-way ANOVA.
Due to the observed heteroscedasticity, the results of classical one-way ANOVA were compared with Welch’s ANOVA, which relaxes the equal-variance assumption by adjusting the F-statistic and degrees of freedom. Figure 2 presents p-values from both methods for all blood parameters (p-values plotted on a log scale and sorted by Welch’s ANOVA). The comparison highlights marked discrepancies in statistical significance for a subset of parameters.
Notably, the divergence between methods is pronounced for several parameters (e.g., Fe, Transferrin (Transf), RDW%, Glucose, Albumin, RDWa, von Willebrand Factor (vWF), LDL). For these features, the p-values produced by Welch’s ANOVA differ substantially from those of classical ANOVA, reflecting the influence of unequal variances across groups. Box plots of the top 5 features ranked by Welch’s ANOVA and by one-way ANOVA are shown in Figure 3 and Figure 4, respectively. For the boxplots, feature values were standardized (z-score) prior to plotting to facilitate visual comparison of group distributions.
This phenomenon—substantial differences in ranked lists of significant features between the two ANOVA variants—indicates that heteroskedasticity materially affects significance calls. Selecting an appropriate test (e.g., Welch’s ANOVA when variances are unequal) therefore improves the reliability of feature selection, which in turn supports construction of more robust and interpretable machine-learning models.
Multicollinearity Check: To ensure that features included in the models are not highly correlated with one another, Pearson correlation was used to examine the relationships between all features. A heatmap of the correlation matrix was generated to visualize multicollinearity, with highly correlated features (correlation coefficient > 0.8) flagged for potential exclusion from subsequent analysis to reduce collinearity and improve model performance. The correlation heatmap is shown in Figure 5.
2.2. Comparison of ANOVA Variants Under Heteroscedasticity
To quantitatively validate the impact of heteroscedasticity on classical ANOVA inference, five statistical tests were compared: classical ANOVA, Welch’s ANOVA, Levene’s test, Brown-Forsythe (Levene on medians), and Heteroskedasticity-Consistent Covariance Matrix Estimator (HC3) (Table S1). The comparative testing results reveal the fundamental limitations of ANOVA under violated homoscedasticity assumptions. As shown in Table S1, which presents the number of significant parameters (at α = 0.05 significance level), classical ANOVA identified 25 significant features, clearly illustrating the tendency toward Type I error inflation in the presence of unequal variances across patient groups. This inflation occurs because the standard ANOVA F-test is highly sensitive to deviations from variance homogeneity, leading to distorted statistical inference. In contrast, Welch’s ANOVA identified 26 significant parameters, demonstrating a more conservative and robust approach that maintains statistical power without compromising Type I error control even under heteroscedasticity conditions. Comparison with alternative robust methods—including Brown-Forsythe (15 significant), HC3 (27 significant), and Kruskal-Wallis (27 significant)—confirms consensus between Welch’s ANOVA and Kruskal-Wallis (Spearman ρ = 0.92), providing strong methodological justification for selecting Welch’s ANOVA as the primary feature selection method under the heteroscedastic conditions characteristic of biochemical biomarker datasets.
2.3. Machine Learning
A set of classical supervised classifiers was evaluated to quantify discriminative performance and the effect of different feature-selection strategies. The following models were considered: LDA, LR, kNN, DT, RF and SVM. These methods cover linear and non-linear paradigms and provide complementary perspectives on class separability and feature relevance.
Three feature sets were used in the experiments: (1) the full preprocessed feature set including all biochemical variables, (2) a subset of features selected by Welch’s ANOVA, and (3) a subset selected by classical one-way ANOVA. For both ANOVA-based selections, only features with p-values below 0.05 were retained. The one-way ANOVA subset included the following features: Transf, ALB, Glu, Chol, Ca, BChE, LDL, LDH, Fe, RBC, RDW%, GRA%, NEFA, LYM%, HGB, GRAN, HDL, HCT, ALP, vWF, CK-NAC, Ur.Acid, TP, a1-AGP, MCHC, and MCV. The Welch ANOVA subset consisted of: Fe, Transf, RDW%, LDL, LDH, vWF, Chol, BChE, Glu, ALB, RBC, MCHC, NEFA, GRAN, HDL, HGB, Ca, HCT, Ur.Acid, PDW, RDWa, GRA%, ALP, CK-NAC, LYM%, TP, WBC, and a1-AGP. All numerical predictors were standardized (z-score), and all preprocessing steps were performed within cross-validation folds to prevent information leakage.
Model selection and hyperparameter optimization were performed using stratified 3-fold cross-validation with an inner grid search for each classifier. Performance was summarized by per-class F1-score (reported in tables) together with macro-average F1, Precision, and Recall metrics. Detailed per-class results and aggregated metrics are provided in Table 1.
Figure 6 shows a radar (spider) plot of the mean per-class F_1_-score for six classifiers (LDA, kNN, LR, DT, RF, SVM) under three feature-selection strategies. Each axis represents a classifier, and the closed polygons correspond to the F_1_-score averaged across all classes for each feature set: the full set of features, features selected by one-way ANOVA, and features selected by Welch’s ANOVA. The plot provides a concise visual comparison of classifier performance across feature-selection strategies, highlighting relative differences rather than absolute F_1_ values.
2.4. Personalized and Interpretable Models
For the SHAP analysis, a Gradient Boosting model with a tree depth of 3 was used to prevent overfitting on the full dataset while maintaining interpretability. This shallow tree structure ensures that the computed SHAP values are stable and reliable. In fact, this choice is heuristic, so we obtained the learning curves for the gradient boosting classifier (Figure S1) and conducted a sensitivity analysis of the gradient boosting model (Figure S2). Tree SHAP allows for exact and computationally efficient calculation of Shapley values, capturing not only the individual contributions of each biochemical feature to the model’s predictions, but also interactions between features. As a result, the analysis provides a nuanced understanding of how combinations of variables influence patient classification and helps identify the most influential features beyond simple correlations. The model’s performance is presented in the confusion matrix (Figure 7) and summarized in Table 2.
Subsequently, SHAP values were computed for all features (blood test parameters) across all classes to evaluate their individual contributions to the model’s predictions. The results are presented for each class using paired plots: one showing feature interactions and the other summarizing individual feature impacts. Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 display these visualizations for classes 1 through 5, highlighting the most influential features and the strength and direction of their effects on the predicted probabilities. Features are ranked by global importance. Each dot represents a patient; red indicates a high biomarker value, while blue indicates a low value. Dots to the right of the midline increase the diagnostic probability for the respective class, whereas dots to the left decrease it.
2.5. Mutual Information for Feature Synergy (MI)
To further investigate feature relationships and interactions, two complementary visualizations were constructed. First, a feature interaction network was created where each feature was represented as a node. The top 50% of features were selected based on mutual information with the target variable. Edges were added between feature pairs exhibiting absolute Pearson correlations above 0.3, with edge weights proportional to correlation strength. Isolated nodes were removed, retaining only interacting features. Community detection was performed using the Louvain algorithm, and nodes were arranged in a circular layout, with cluster membership indicated by distinct pastel colors. Edge widths reflected correlation magnitude, providing an intuitive representation of clusters of features that may jointly contribute to class discrimination.
Second, to capture interaction effects identified by the predictive model, the absolute SHAP interaction values were computed and averaged across all classes. The resulting mean interaction matrix was visualized as a heatmap (Figure 13a). This visualization highlights pairs of features whose combined effects contribute substantially to model predictions, complementing the correlation-based network by revealing non-linear and class-dependent interactions. Together, these analyses provide a multi-level perspective on both linear dependencies and model-informed synergistic relationships among features. The combined results of the feature interaction network and the mean SHAP interaction values across classes are presented in Figure 13.
2.6. Unsupervised Pattern Discovery
Unsupervised clustering was performed on the full feature space using the KMeans algorithm. The optimal number of clusters was determined via the elbow method, which involves plotting the within-cluster sum of squares (WCSS) as a function of the number of clusters and identifying the point where the rate of decrease sharply diminishes. This approach suggested k = 5, coincidentally matching the number of predefined classes in the dataset (Figure 14a).
To visualize the clustering results, the high-dimensional feature space was projected into two dimensions using t-distributed Stochastic Neighbor Embedding (t-SNE), which preserves local neighborhood structures and allows for intuitive visualization of complex patterns. The resulting 2D embeddings of the clustered objects are shown in Figure 14b.
To assess the quality of the clustering and its correspondence to the ground-truth classes, the Silhouette score for the original class labels in the full feature space was first evaluated, yielding a value of −0.139. After applying KMeans clustering, the Silhouette score increased to 0.122, indicating a moderate improvement in the cohesion of the assigned groups. The agreement between the derived clusters and the true labels was further quantified using the Adjusted Rand Index (ARI = 0.198) and Normalized Mutual Information (NMI = 0.286). The ARI measures the similarity between two labels while correcting for chance, with values closer to 1 indicating strong agreement, whereas the NMI evaluates the mutual dependence between cluster assignments and true classes, also ranging from 0 (no correspondence) to 1 (perfect match). Despite the moderate improvements in these metrics, the results suggest that the clusters do not fully capture the original class structure. This discrepancy may be partly due to the high dimensionality of the feature space, which can complicate the separation of clusters and obscure intrinsic patterns in the data. The composition of each cluster with respect to the original classes is summarized in Table 3.
To further characterize the identified clusters, differences in mean blood parameter values were analyzed. Prior to the analysis, all features were standardized to ensure valid comparisons between variables with different scales. The resulting differences are visualized as a heatmap in Figure 15, illustrating the relative profiles of blood indicators for each cluster. This visualization highlights distinct trends in certain biochemical parameters, suggesting potential physiological distinctions among the identified groups.
To objectively assess the agreement between the model-estimated and statistical significance of features [6], a rank comparison was performed and the Spearman correlation was employed. The negative logarithm of the p-value from the nonparametric Kruskal-Wallis tests and Dunn’s post hoc test was used as a measure of the statistical significance of each feature: . The model-estimated significance of features was expressed as normalized mean absolute values:
where are the SHAP values for feature (blood parameter) j, patient i, class k.
The Spearman correlation coefficient was calculated based on the ranks:
where d_i_ is the difference in ranks between the ranked sets of statistical significance (the negative logarithm of statistical significance) and the features selected as a result of the SHAP analysis; n is the number of features.
To determine the Spearman correlation coefficient, the sets of statistical significance of features and SHAP values were ranked—each element was assigned a rank and according to the order in the set. The results of the comparative analysis are presented in Table 4 and Table 5.
2.7. External Validation on a Single-Class Cohort
To evaluate the generalization capability of the proposed framework and substantiate the clinical relevance of the identified biomarker signatures, external validation was performed on an independent, single-class cohort of patients diagnosed with SIVD. Given that the initial findings were considered exploratory due to the high-dimensional nature of the biochemical feature space (p ≫ n), this step aimed to determine whether the model-estimated feature importance remains stable when applied to a non-overlapping dataset.
The validation cohort comprised 28 patients, representing a clinical phenotype closely aligned with the SIVD group analyzed in the primary study. It should be noted that certain biomarkers, specifically {‘BilAc’, ‘LAC’, ‘Glyc’, ‘Transf’, ‘PON1’, ‘NEFA’, ‘Ur.Acid’, ‘a1-AGP’, ‘K’, ‘vWF’}, were unavailable in the validation dataset due to differences in clinical recording protocols. To maintain the integrity of the input feature vector and enable the deployment of the pre-trained classifier, these missing values were imputed using the mean values derived from the primary training dataset.
Furthermore, a rigorous comparative analysis was conducted between the primary and validation cohorts to assess potential out-of-distribution (OOD) effects and distributional shifts. This evaluation included mean values with 95% confidence intervals (CIs), medians, and interquartile ranges (IQRs) for each biochemical parameter. The Population Stability Index (PSI) was calculated for every feature; the average PSI was 2.111, indicating substantial distributional differences between the training and validation cohorts. These detailed comparative statistics and PSI values are provided in the Supplementary Materials (Tables S2 and S3).
To ensure methodological consistency and mitigate the impact of heteroscedasticity, the pre-trained Gradient Boosting model (maximum tree depth = 3) was deployed alongside the previously optimized variance-aware preprocessing pipeline with accuracy 71.43%. Performance was quantified not only by classification accuracy but also by the consistency of feature attribution, using SHAP (Figure 16) to assess whether key markers—such as iron (Fe), glucose, and transferrin—retain their synergistic predictive roles in an independent clinical setting. This external validation serves to confirm that the identified biomarker signatures are not artifacts of the original sample’s variance structure but reflect stable physiological patterns associated with vascular cognitive impairment.
3. Discussion
Classical classifiers (LDA, LR, kNN, DT, RF, SVM) were evaluated on three feature set variants: the full set, features selected via one-way ANOVA, and features selected via Welch’s ANOVA. The results revealed considerable variability across classes: for some classes, the models achieved acceptable accuracy, whereas for others, F1 scores were low (see Table 1). SHAP analysis and the interaction network identified a number of stable markers and inter-feature relationships (Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13).
Several factors likely contribute to the observed limitations in classification performance. First, the dataset exhibits high dimensionality relative to the sample size, with approximately 49 numerical features for only 120 patients. This “wide” setup (p ≫ n) increases estimate variability, promotes overfitting, and destabilizes model parameters, particularly for linear models and SVMs with nonlinear kernels. Second, class imbalance shifts the optimization toward the more prevalent classes, reducing macro-F1 scores even when overall accuracy or dominant-class F1 appears satisfactory, as reflected in Table 1. Third, heterogeneity within clinical groups, arising from diverse comorbidities, increases within-class variance and diminishes between-class discriminability. This is consistent with high heteroscedasticity observed in several features (Levene test), explaining differences in feature rankings between Welch’s and standard ANOVA. Finally, strong correlations and redundancy among features (e.g., RBC/HGB/HCT; PLT/PCT; Chol/LDL) complicate coefficient interpretation and dilute the contribution of individual variables in linear models.
SHAP analysis further provided insights into feature importance. A consistent set of markers—including Fe, Transferrin, RDW%, LDL/Chol, LDH, vWF, Glu, ALB, and several hematological parameters—was repeatedly identified across classes. These features not only show high individual contributions but also participate in synergistic interactions, as evidenced in the SHAP interaction network (Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13). Importantly, feature relevance is class-specific: certain markers are critical for distinguishing vascular-metabolic conditions, while others are more relevant for cognitive-vascular syndromes. This class-specificity explains why global metrics, such as macro-F1, may not fully capture the clinical utility of the models for particular diagnoses.
The results of the comparative analysis of ranks showed that there is a consistency between the results of the statistical analysis and the SHAP method—it is especially pronounced for combinations of groups 1–2 (0.5277) and 1–5 (0.5919) according to the Spearman coefficient. However, blood parameters that are important for predicting groups, but not for statistical analysis, were also identified: for group 2, they were Glu, Transf, LAC, WBC, LDL; for group 3, they were UREA, Ca, CREA, Fe, BilAc; for group 4, they were Chol, Fe, HDL, Ur. Acid, Ca; and for group 5, they were MCHC, Ca, K-RUV, BilAc, NEFA.
To further assess model robustness, all analyses were validated on an independent cohort consisting exclusively of patients with dementia. Statistical comparisons between the original and dementia-specific cohorts revealed a significant shift in data distributions, and several features present in the original dataset were unavailable in the new cohort. Despite these limitations, the gradient boosting model, combined with SHAP-based interpretability, maintained high classification accuracy. Notably, the same set of stable markers and class-specific feature interactions identified in the original cohort were largely preserved, indicating that the predictive patterns uncovered are robust to cohort heterogeneity and missing variables, at least within the context of dementia patients.
4. Materials and Methods
4.1. Chemicals
PBS (pH 7.4) was purchased from Biolot (St. Petersburg, Russia). Diagnostic kits were produced by Randox Laboratories (Crumlin, UK). All other reagents were from Sigma-Aldrich (Rockville, MD, USA).
4.2. Study Design and Setting
The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the Research Institute of Hygiene, Occupational Pathology and Human Ecology of the Federal Medical Biological Agency (Approval No. 3, registration date: 2 June 2022). Informed consent was obtained from all subjects involved in this study. The period during which the patient groups were formed, and their neurological examination and blood sampling for subsequent analysis were performed was approximately one year—from mid-2022 to mid-2023. The dataset comprised 120 elderly patients (60 to 78 years old) living in St. Petersburg and registered in the geriatric departments of city clinics. Structured interviews were conducted to obtain socio-demographic data. Medical history and neurological examination data were collected. The patients were divided into five groups based on clinical diagnoses established according to standard criteria detailed in the prior study [6]. Exclusion criteria included data on progressive cancer and recent infectious diseases. As for the inclusion criteria, numerical labels (1–5) map to the following groups for consistency across analyses (Table 6). For validation purposes, a group of 28 patients aged 60–80 undergoing inpatient treatment at City Hospital No. 28 in St. Petersburg was formed in the fall of 2025.
4.3. Sample Preparation
Blood samples were collected, processed, and stored in accordance with international guidelines [32]. Blood was collected from the subjects on an empty stomach from the cubital vein into BD Vacutainer vacuum tubes, Becton Dickinson, Franklin Lakes, NJ, USA with anticoagulants (K3EDTA, heparin, and citrate). The plasma was stored at −70 °C until needed.
4.4. Biochemical and Hematological Analysis
Biochemical parameters were determined using a Sapphire 400 analyzer (Tokyo Boeki Medisys, Tokyo, Japan) and commercial RANDOX kits. Immunological and hematological studies of whole blood were performed on the day of blood collection. Baseline parameters and histograms of the distribution of erythrocytes, leukocytes, and platelets by volume were determined using a Medonic hematology analyzer (Boule Diagnostics, Spånga, Sweden).
4.5. Von Willebrand Factor and ADAMTS13
To quantify vWF in blood plasma using the standard method, the Technozym vWF:Ag ELISA kit (Technoclone GmbH, Vienna, Austria) was used. The activity of the von Willebrand factor in the plasma of patients was determined in the agglutination reaction of lyophilized platelets with ristocetin using a set of reagents from the Renam company (Moscow, Russia, No. AG-5). The determination of ADAMTS13 activity (VWF cleaving agent) in plasma was carried out using a Synergy 2 plate fluorometer (BioTech, Hudson, MA, USA) and the fluorescent substrate FRETS-VWF73.
4.6. Dataset Description
This study utilized a dataset comprising biochemical blood parameters from 120 patients divided into four groups: control (group 1), AIS (group 2), CCCI (group 3), DM (group 4) and SIVD (group 5). The dataset included 49 features such as RBC, MCV, PLT, and various enzymes (e.g., BChE, TP, vWF). In Table 7 below is the detail description of the dataset. Data preprocessing involved imputation of missing values using median filling, normalization via StandardScaler to ensure zero mean and unit variance, and outlier detection with a contamination level of 0.01 to reduce noise in heterogeneous biomedical data. All analyses were conducted using Python 3.10 with scikit-learn (v. 1.8.0), NumPy (v. 2.3.5), Pandas (v. 2.3.3), SHAP (v. 0.48.0), CatBoost (v. 1.2.8) and Matplotlib (v. 3.10.8) libraries.
4.6.1. Feature Selection Methods
To address heteroscedasticity and enhance model robustness, three feature selection methods were applied, with a specific focus on mitigating the risks associated with violated statistical assumptions.
- Full Feature Set: All 49 blood parameters were used without filtering, serving as a baseline to assess the impact of the complete biomarker set on classification performance. This “wide” setup (p ≫ n) often increases estimate variability and promotes overfitting in high-dimensional biomedical data, necessitating comparative selection strategies.
- One-Way ANOVA Method: Features were selected based on standard ANOVA, which evaluates differences between groups by comparing mean values. While this method assumes homogeneity of variances (homoscedasticity), it was employed as a comparative benchmark to identify biomarkers with significant variance across imbalanced groups. However, when this assumption is violated, the standard F-test becomes unreliable, leading to Type I error rates that deviate substantially from nominal significance levels and potentially producing spurious results.
- Welch’s ANOVA Method (Variance-Aware): To account for the unequal variances typical in clinical biochemical data, Welch’s ANOVA was applied. This method is specifically designed to handle heteroscedasticity by adjusting the F-statistic and degrees of freedom
The justification for this approach is twofold:
- (1)Empirical Evidence: In our study, Levene’s test confirmed statistically significant heterogeneity of variance (p < 0.05) for several critical features, including iron (Fe), transferrin (Transf), RDW%, and LDL receptor (LDL). These findings directly demonstrated that the equal-variance assumption was untenable for our dataset.
- (2)Methodological Robustness: Welch’s ANOVA provides more accurate Type I error control and maintains comparable or superior statistical power to classical ANOVA even when variances are equal. It effectively eliminates the need for homogeneity as a prerequisite, ensuring that the identified biomarker signatures are not artifacts of unequal group variances.
Features were ranked according to their F-statistics, and the comparison of p-values between the two ANOVA variants revealed marked discrepancies in statistical significance for key parameters, underscoring the necessity of using variance-aware workflows to ensure the reliability of subsequent machine learning tasks.
4.6.2. ML Models
Six classical supervised classification algorithms were employed to predict patient group membership based on the selected blood biomarkers. Prior to model training, feature values were standardized to ensure consistent scaling across variables. The chosen models represent a complementary spectrum of analytical approaches: linear methods facilitate interpretability and identification of direct associations, whereas nonlinear models are capable of capturing intricate, nonlinearly separable patterns inherent in heterogeneous clinical datasets.
Linear Discriminant Analysis (LDA)
LDA is a classical statistical approach that projects data onto a lower-dimensional space to achieve maximum class separability. It constructs a linear combination of features that optimally distinguishes between predefined groups by maximizing the ratio of between-class to within-class variance [33].
Logistic Regression (LR)
LR was employed as a complementary linear classifier to benchmark model performance. It models the probability of class membership using the logistic (sigmoid) function and provides interpretable coefficients that quantify each feature’s contribution to the prediction. In this study, LR was extended to multiclass classification using the one-vs-rest strategy. To mitigate overfitting and improve generalization, both L1 (Lasso) and L2 (Ridge) regularization techniques were applied [34].
k-Nearest Neighbor (kNN)
KNN is a non-parametric, instance-based learning algorithm that classifies samples based on their similarity to previously observed data. Rather than building an explicit model during training, KNN stores all training examples and determines the class of a new sample by analyzing the labels of its closest neighbors in the feature space. kNN captured local patterns in the biochemical feature space [35].
Decision Tree (DT)
A non-linear, tree-based method that recursively partitions the feature space based on feature thresholds to minimize impurity. DT was included to model potential non-linear decision boundaries [36,37].
Random Forest (RF)
An ensemble of decision trees that aggregates predictions via majority voting, offering robustness to overfitting and high-dimensional data. RF was used here to handle feature redundancy and interactions among correlated biochemical markers [38]. Key parameters tuned during hyperparameter optimization include:
N estimators: This defines the number of decision trees in the RF. Values 50, 100 and 200 were tested to strike a balance between predictive performance and computational efficiency.
Max_depth: This controls the maximum number of splits per decision tree. A low value risks underfitting, while a high value may lead to overfitting. In this study, tree depths were adjusted to 5, 10 and 15 to capture meaningful patterns without excessive complexity.
These parameter adjustments resulted in a robust and accurate model, well-suited for handling the complexity and high dimensionality of biological data.
Support Vector Machines (SVM)
A kernel-based method that finds an optimal hyperplane, maximizing the margin between classes, particularly effective for high-dimensional data. SVM was applied to capture potentially non-linear separations in the biomarker space; tuned parameters included regularization parameter C, kernel type, and gamma parameter for RBF and polynomial kernels [39].
4.6.3. Cross-Validation
To ensure robust model evaluation and prevent overfitting, stratified k-fold cross-validation was employed with k = 3. Stratification preserved the class distribution within each fold, reflecting the multiclass nature of the dataset (four patient groups). The dataset was partitioned into non-overlapping training and validation subsets for each iteration, with models trained on the training set and evaluated on the test set. This process was repeated across all folds, and performance metrics were averaged to improve reliability and assess model sensitivity to training subsets.
4.6.4. Hyperparameter Optimization
Hyperparameters were tuned using GridSearchCV, an exhaustive search over a predefined parameter grid with internal 3-fold cross-validation, using the macro-averaged F1-score as the evaluation metric. The tuning parameters for each model included:
LDA: Solver type (‘svd’, ‘lsqr’, ‘eigen’) and shrinkage parameter (applicable for ‘lsqr’ and ‘eigen’ solvers). In this study, the default ‘svd’ solver was used without additional shrinkage optimization, as it is robust to multicollinearity and does not require regularization tuning.
Logistic Regression (LR): Regularization type (‘l1’, ‘l2’, ‘elasticnet’, ‘none’), regularization strength (C ∈ {0.01, 0.1, 1, 10, 100}), solver (‘saga’, compatible with all penalty types), and L1 ratio (∈{0, 0.5, 1}, applicable only for elastic-net regularization). Based on GridSearchCV results, the optimal LR configuration employed L1 regularization with C = 1, using the saga solver.
K-Nearest Neighbors (KNN): Number of neighbors (k ∈ {3, 5, 7, 9, 11}), weighting strategy (‘uniform’, ‘distance’), and distance metric (‘euclidean’, ‘manhattan’). Based on GridSearchCV results, the optimal KNN configuration employed k = 9 neighbors, distance-based weighting, and the Euclidean distance metric.
Decision Tree (DT): Maximum tree depth (None, 3, 5, 7, 9), minimum number of samples required to split an internal node (2, 5, 10), minimum number of samples required at a leaf node (1, 2, 4), and splitting criterion (‘gini’, ‘entropy’). Based on GridSearchCV results, the optimal DT configuration employed a maximum tree depth of 5, a minimum of 2 samples per leaf, a minimum of 2 samples to split an internal node, and the entropy-based splitting criterion.
Random Forest (RF): Number of trees (n_estimators ∈ {50, 100, 200}), maximum tree depth (None, 5, 10, 15), minimum number of samples required to split an internal node (2, 5, 10), minimum number of samples required at a leaf node (1, 2, 4), and splitting criterion (‘gini’, ‘entropy’). Based on GridSearchCV results, the optimal RF configuration employed 50 trees, no explicit limit on tree depth, a minimum of 5 samples to split an internal node, a minimum of 1 sample per leaf, and the entropy-based splitting criterion.
Support Vector Machine (SVM): Regularization parameter C (0.1, 1, 10, 100), kernel type (‘linear’, ‘rbf’, ‘poly’), and kernel coefficient gamma (‘scale’, ‘auto’; used only for ‘rbf’ and ‘poly’ kernels). Based on GridSearchCV results, the optimal SVM configuration employed an RBF kernel with C = 10 and γ = scale.
Optimal hyperparameters were selected based on the highest cross-validated F1-score to ensure strong generalization capability on unseen data.
4.6.5. Evaluation Metrics
Model performance was assessed using the macro-averaged F1-score, balancing precision and recall across all classes, appropriate for multiclass, imbalanced datasets.
Mean F1-scores and standard deviations were calculated across cross-validation folds for each pipeline (full feature set, one-way ANOVA, Welch’s ANOVA).
Ethical considerations complied with institutional guidelines, and data were anonymized to protect patient confidentiality [40].
5. Conclusions
In this study, a variance-robust and interpretable analytical framework was proposed and validated for biomarker discovery in multivariate clinical datasets characterized by small sample sizes, heteroscedasticity, and overlapping diagnostic categories. It was demonstrated that ignoring variance heterogeneity leads to unstable feature selection and potentially misleading interpretations, while the use of Welch’s ANOVA provides more robust results without sacrificing statistical power.
Integrating interpretable machine learning methods with SHAP analysis enables the identification of nonlinear, class-specific, and synergistic effects between biochemical markers that go beyond traditional single-variate statistics. The results demonstrate that interpretability is not simply an auxiliary tool for explaining patterns, but a critical component of robust clinical machine learning.
The proposed workflow can be directly applied to other biomedical fields where data heterogeneity and limited sample sizes make robust statistical inference difficult, providing a transparent link between statistical testing, predictive modeling, and clinical interpretation.
6. Limitations of the Research
Several limitations of the present study should be acknowledged. First, the analysis is based on a relatively small cohort (n = 120) while the feature space has high dimensionality (≈49 biochemical parameters), resulting in a p ≫ n configuration that increases estimator variability and the risk of overfitting despite cross-validation and model regularization. This limited sample size reduces the statistical power to detect small-to-moderate effects and constrains the generalizability of the findings.
Second, the dataset contains substantial inter-feature correlation and redundancy (e.g., RBC/HGB/HCT, PLT/PCT, Chol/LDL), which complicates model interpretability and can destabilize feature importance rankings; although SHAP provides local and global explainability, correlated predictors may still lead to ambiguous attribution of effects.
Third, several parameters exhibit pronounced heteroscedasticity (Levene’s test), and one of the groups shows marked separation in the clustering analysis. Unequal variances across groups violate assumptions of classical parametric tests and can bias significance rankings and cross-group comparisons; while Welch’s ANOVA and variance-aware procedures were employed to mitigate this issue, residual effects of heteroscedasticity and group-specific variance structure may still affect both statistical and machine-learning inferences.
Finally, class imbalance and the moderate agreement between unsupervised clusters and clinical labels (low ARI/NMI and modest Silhouette scores) limit the immediate clinical translatability of the classifiers. Together, these limitations mean that the identified biomarker signatures should be considered exploratory; future work must validate the results on larger, multi-center cohorts, apply more extensive external validation, and test robustness to alternative preprocessing, dimensionality-reduction, and class-balancing strategies.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Gunderson C.G. Bilan V.P. Holleck J.L. Nickerson P. Cherry B.M. Chui P. Bastian L.A. Grimshaw A.A. Rodwin B.A. Prevalence of Harmful Diagnostic Errors in Hospitalised Adults: A Systematic Review and Meta-Analysis BMJ Qual. Saf.2020291008101810.1136/bmjqs-2019-01082232269070 · doi ↗ · pubmed ↗
- 2Auerbach A.D. Lee T.M. Hubbard C.C. Ranji S.R. Raffel K. Valdes G. Boscardin J. Dalal A.K. Harris A. Flynn E. Diagnostic Errors in Hospitalized Adults Who Died or Were Transferred to Intensive Care JAMA Intern. Med.202418416410.1001/jamainternmed.2023.734738190122 PMC 10775080 · doi ↗ · pubmed ↗
- 3Newman-Toker D.E. Peterson S.M. Badihian S. Hassoon A. Nassery N. Parizadeh D. Wilson L.M. Jia Y. Omron R. Tharmarajah S. Diagnostic Errors in the Emergency Department: A Systematic Review AHRQ Comparative Effectiveness Reviews Agency for Healthcare Research and Quality (US)Rockville, MD, USA 202236574484 · pubmed ↗
- 4Winters B. Custer J. Galvagno S.M. Colantuoni E. Kapoor S.G. Lee H. Goode V. Robinson K. Nakhasi A. Pronovost P. Diagnostic Errors in the Intensive Care Unit: A Systematic Review of Autopsy Studies BMJ Qual. Saf.20122189490210.1136/bmjqs-2012-00080322822241 · doi ↗ · pubmed ↗
- 5Kai Q. Artificial Intelligence Applications in Healthcare Diagnosis and Treatment in the Era of Big Data Pac. Int. J.20258606610.55014/pij.v 8i 2.791 · doi ↗
- 6Goncharov N.V. Popova P.I. Kudryavtsev I.V. Golovkin A.S. Savitskaya I.V. Avdonin P.P. Korf E.A. Voitenko N.G. Belinskaia D.A. Serebryakova M.K. Immunological Profile and Markers of Endothelial Dysfunction in Elderly Patients with Cognitive Impairments Int. J. Mol. Sci.202425188810.3390/ijms 2503188838339164 PMC 10855959 · doi ↗ · pubmed ↗
- 7Hua S. Dong Z. Wang H. Liu T. Global, Regional, and National Burden of Ischemic Stroke in Older Adults (≥60 Years) from 1990 to 2021 and Projections to 2030 Front. Neurol.202516156760910.3389/fneur.2025.156760940406698 PMC 12094992 · doi ↗ · pubmed ↗
- 8Deshpande A. Zhang L.Q. Balu R. Yahyavi-Firouz-Abadi N. Badjatia N. Laksari K. Tahsili-Fahadan P. Cerebrovascular Morphology: Insights into Normal Variations, Aging Effects and Disease Implications J. Cereb. Blood Flow Metab.2025451249126410.1177/0271678 X 25132853740314210 PMC 12048404 · doi ↗ · pubmed ↗
