# Systematic Review and Meta-Analysis of Explainable Machine Learning Models for Clinical Depression Detection

**Authors:** Ariosto Trelles, Tomás Fontaines Ruiz, Antonio Ponce Rojo

PMC · DOI: 10.3390/bs15111476 · 2025-10-30

## TL;DR

This paper reviews machine learning models for detecting depression, finding that data quality and interpretability matter more than the specific algorithm used.

## Contribution

The study systematically evaluates and compares the performance and interpretability of various machine learning models for depression detection using real-world data.

## Key findings

- XGBoost achieved the best average performance with an F1-Score of 0.86 and AUC-ROC of 0.84.
- SHAP was the most commonly used interpretability method, appearing in 70% of the studies.
- F1-Score strongly correlated with AUC-ROC (r = 0.950), but both metrics showed high heterogeneity across studies.

## Abstract

Depression is among the most prevalent mental disorders, and its early detection is essential to improving therapeutic outcomes in psychotherapy. This systematic review and meta-analysis evaluated the accuracy, interpretability, and generalizability of supervised algorithms (SVM, Random Forest, XGBoost, and GCN) for clinical detection of depression using real-world data. Following PRISMA guidelines, 20 studies published between 2014 and 2025 were analyzed across major scientific databases. Extracted metrics included F1-Score, AUC-ROC, interpretability methods (SHAP/LIME), and cross-validation strategies, with statistical analyses using ANOVA and Pearson correlations. Results showed that XGBoost achieved the best average performance (F1-Score: 0.86; AUC-ROC: 0.84), although differences across algorithms were not statistically significant (p > 0.05), challenging claims of algorithmic superiority. SHAP was the predominant interpretability approach (70% of studies). Studies implementing combined SHAP+LIME showed higher F1-Score values (F(1,7) = 8.71, p = 0.021), although this association likely reflects greater overall methodological rigor rather than a direct causal effect of interpretability on predictive performance. Clinical surveys and electronic health records demonstrated the most stable predictive outputs across validation schemes, whereas neurophysiological data achieved the highest point estimates but with limited sample representation. F1-Score strongly correlated with AUC-ROC (r = 0.950, p < 0.001). Considerable heterogeneity was observed for both metrics (I2 = 74.37% for F1; I2 = 71.49% for AUC), and Egger’s test indicated a publication bias for AUC (p = 0.0048). Overall, findings suggest that algorithmic performance depends more on data quality, context, and interpretability than on the choice of model, with explainable approaches offering practical value for personalized and collaborative clinical decision-making.

## Linked entities

- **Diseases:** depression (MONDO:0002050)

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** Depression (MESH:D003866), mental disorders (MESH:D001523)

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12649417/full.md

---
Source: https://tomesphere.com/paper/PMC12649417