Functional connectivity–based classification and subtyping of major depression for precision mental health: An ensemble graph neural network approach

Kaizhong Zheng; Hongbing Lu; Huaning Wang; Dewen Hu; Badong Chen; Baojuan Li; Dhiya Al-Jumeily OBE; Phat Kim Huynh; Dhiya Al-Jumeily OBE; Phat Kim Huynh

PMC · DOI:10.1371/journal.pdig.0001261·March 4, 2026

Functional connectivity–based classification and subtyping of major depression for precision mental health: An ensemble graph neural network approach

Kaizhong Zheng, Hongbing Lu, Huaning Wang, Dewen Hu, Badong Chen, Baojuan Li, Dhiya Al-Jumeily OBE, Phat Kim Huynh, Dhiya Al-Jumeily OBE, Phat Kim Huynh

PDF

Open Access

TL;DR

This study uses brain connectivity patterns from MRI scans to detect depression and identify three distinct subtypes, offering a more objective and biologically grounded approach to diagnosis and treatment.

Contribution

The novel contribution is an ensemble graph neural network framework for classifying and subtyping MDD using functional connectivity data with cross-national validation.

Findings

01

The classifier achieved 0.73 leave-one-site-out accuracy on the REST-meta-MDD cohort.

02

Three MDD subtypes were identified with distinct connectivity signatures involving key brain networks.

03

The method retained 0.78 sensitivity when transferred from the Chinese to the Japanese cohort.

Abstract

Major depressive disorder (MDD) remains clinically diagnosed based on subjective symptoms rather than objective neurobiological markers, which limits diagnostic accuracy and the ability to tailor treatment. We present an ensemble hybrid framework that integrates graph neural networks (GNN) with unsupervised clustering to classify and subtype MDD using resting-state functional connectivity (rs-fMRI) profiles. A GNN was trained to distinguish MDD from healthy controls using functional connectivity derived brain graphs, and the resulting subject level embeddings were clustered to uncover subtype structure. We evaluated the approach on two public multisite cohorts, REST-meta-MDD (China; N = 1,604; 17 sites) and SRPBS (Japan; N = 446; 4 sites), using leave-one-site-out cross-validation and cross-national transfer. The classifier achieved 0.73 leave-one-site-out accuracy on REST-meta-MDD and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

COX1

Proteins1

Species1

Homo sapiens(human · species)

Chemicals2

ComBat PDIG-D-25-00628R1

Diseases14

Major depressive disorder depression ARI weight loss SBN agitation impairments in cingulate areas NMI MDD schizophrenia ASD insomnia Mental Disorders Alzheimer's disease

Figures7

Click any figure to enlarge with its caption.

Fig 1 — Demographic characteristics of participants in both datasets across different sites.The principal dataset contained 17 sites, while independent validation dataset contained 4 sites. (A) Total number of subjects for each contributing site in both datasets including the principal dataset (left side) and independent validation dataset (right side). (B) Number of female participants and male participants for each site in both datasets including the principal dataset (left side) and independent validation dataset (right side). (C) Age (in years) for all subjects per site in both datasets including the principal dataset (above side) and independent validation dataset (below side).

Fig 2 — Data analysis pipeline of EH-BrainGNN.Data Preprocessing: Raw fMRI data were preprocessed and then mean time series of regions of interest (ROIs) were extracted from brain parcellation. Subsequently, the functional connectivity (FC) matrices were calculated using Pearson correlation between ROIs. From the FC we constructed the brain functional graph G=(A,X), where A was the graph adjacency matrix and X is node feature matrix. In this study, A was a binarized FC matrix where only the top 20-percentile absolute values of FC matrix were transformed into ones, while the rest were transformed into zeros. For node feature X, Xk for node k was an entire row of FC matrix. MDD Classification: We first learn subgraph mask M to generate explanation subgraph Gsub=G⨀M, ⨀ is element-wise multiplication. Then the graph encoder was employed to learn graph embeddings Z and Zsub from G and Gsub, respectively. Finally, we used Zsub to predict label Y. MDD Subtyping: The learned Zsub is used to extract subject similarity matrix by Pearson correlation between graph embeddings of every two participants. Then the subject similarity matrix was sent to the clustering by fast search and find of density peaks (CFDP) method to generate MDD subtypes.

Fig 3 — A schematic of the full network with parameter counts (A) and a clear statement of the role of the subgraph generator (B).

Fig 4 — Generalization performance of EH-BrainGNN.(A) Tenfold cross validation performance on different models in principal dataset. (B) Tenfold cross validation performance on different models in replication dataset. (C) The leave-one-site-out cross-validation performance ranking of different models based on their accuracy. (D-E) Generalization Performance of the classifier for independent cohorts.

Fig 5 — Determination of the optimal number of CFDP subtypes.(A-B) CFDP decision graphs for the two datasets. (C-D) Two dimensional embeddings illustrating the spatial distribution of subjects colored by CFDP subtype. (E-F) Silhouette coefficients as a function of the number of clusters (K=2–9). (G-H) Dunn indices for K=2–9.

Fig 6 — Visualization of distinct patterns in different subtypes of MDD in principal dataset and independent validation dataset, respectively.The colors of brain neural systems are described as: visual network (VN), somatomotor network (SMN), dorsal attention network (DAN), ventral attention network (VAN), limbic network (LIN), frontoparietal network (FPN), default mode network (DMN), cerebellum (CBL) and subcortical network (SBN), respectively.

Fig 7 — Graph embeddings define depression subtypes.(a) Comparison of clinical profiles for depression symptoms (HAMD-17) among three depression subtypes using Wilcoxon rank-sum tests with FDR correction on the principal dataset. (b) Comparison of HAMD total scores among three depression subtypes using Wilcoxon rank-sum tests with FDR correction on the principal dataset. (c) Comparison of BDI-II scores among three depression subtypes using Wilcoxon rank-sum tests on the independent validation dataset. (d) The overlap and distinction of brain patterns and symptom profiles within three subtypes in principal dataset and independent validation dataset.

Equations23

Funding3

—http://dx.doi.org/10.13039/501100001809National Natural Science Foundation of China
—http://dx.doi.org/10.13039/501100001809National Natural Science Foundation of China
—http://dx.doi.org/10.13039/501100001809National Natural Science Foundation of China

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFunctional Brain Connectivity Studies · Mental Health Research Topics · Digital Mental Health Interventions

Full text

1. Introduction

As a debilitating mental disorder, major depressive disorder (MDD) is regarded as the world’s most serious psychiatric disorder [1,2] and represents the second leading contributor to chronic disease [3,4]. However, the treatment efficacy is still unsatisfactory and responses have been divergent. Current diagnosis with the Diagnostic and Statistical Manual of Mental Disorders 5th edition (DSM - 5) is based on subjective symptoms and signs rather than the underlying biological mechanisms. As a result, patients can meet diagnostic criteria through different combinations of symptoms [5], contributing to pronounced within-diagnosis heterogeneity and overlap with other psychiatric conditions [6–9]. This heterogeneity has been widely recognized as a major obstacle to developing mechanism-informed stratification and improving treatment outcomes.

To address these limitations, precision psychiatry initiatives and the Research Domain Criteria (RDoC) framework have advocated for redefining mental disorders using neurobiological systems, particularly brain circuits, rather than relying solely on symptom clusters [10]. In this context, resting-state functional connectivity (FC) has emerged as a promising candidate biomarker, offering a system-level characterization of distributed neural circuits [11,12]. Prior studies have shown that FC patterns can support machine-learning-based discrimination between patients and controls and can reveal clinically meaningful subgroups. For example, neuroimaging-driven approaches have been used to identify circuit-based signatures in schizophrenia [13] and related disorders, and several influential studies in depression have reported FC-defined subtypes associated with symptom profiles and treatment response [5]. Collectively, these findings suggest that connectome-based models may help deconstruct symptom-based categories and move toward biologically grounded subtyping.

However, a critical barrier to translation is limited generalizability. Many neuroimaging classifiers are trained and evaluated within a single dataset, and performance often degrades when models are tested on unseen sites or independent cohorts [5,14–16]. This fragility is likely driven by both disorder heterogeneity and substantial site-related variability in data acquisition and preprocessing, including differences in scanner hardware, sequence parameters and nuisance signals [17]. Consequently, robust validation across multiple sites and cohorts has been proposed as a minimal requirement for clinically relevant neuroimaging biomarkers, yet remains challenging in practice [18]. Even when harmonization strategies are applied, residual site effects and cohort differences can still limit reproducibility of both classification performance and data-driven subtyping. Despite extensive research on Major Depressive Disorder (MDD) classification and the optimization of generalization performance [16,19–22], achieving satisfactory results remains challenging, limiting the clinical applicability of these methods.

These challenges motivate the development of frameworks that simultaneously prioritize interpretability and generalizability, while supporting both case–control discrimination and subtype discovery. Here we present an ensemble hybrid framework, EH-BrainGNN, which learns compact and interpretable connectome-derived signatures and leverages them for classification and subsequent data-driven subtyping. We evaluate generalization in a large multi-site dataset and further test cross-cohort transfer in an independent validation cohort. By emphasizing reproducibility across sites and populations, our work aims to advance neuroimaging-informed stratification of depression that complements symptom-based diagnosis and provides a more biologically grounded basis for future precision-treatment studies.

2. Method

2.1. Participants

We used two rs-fMRI datasets in the current study. The “principal dataset” included data from 1604 participants (848 MDDs and 794 HCs from 17 sites; Table 1) and the “independent validation dataset” included data from 446 participants (177 MDDs and 269 HCs from 4 sites; Table 1). The most data of “principal dataset” and “independent validation dataset” can be downloaded publicly from the DIRECT consortium (http://rfmri.org/REST-meta-MDD) and SRPBS dataset (https://bicr-resource.atr.jp/srpbsfc/). The demographic characteristics of participants in both datasets across different sites are shown in Fig 1. Additional demographic details are provided in S1 Fig and S2 Fig, while the data acquisition parameters for the principal dataset are presented in Table C in S1 Text.

Table 1: Demographic characteristics of participants in both datasets.

Demographic characteristics of participants in both datasets across different sites.The principal dataset contained 17 sites, while independent validation dataset contained 4 sites. (A) Total number of subjects for each contributing site in both datasets including the principal dataset (left side) and independent validation dataset (right side). (B) Number of female participants and male participants for each site in both datasets including the principal dataset (left side) and independent validation dataset (right side). (C) Age (in years) for all subjects per site in both datasets including the principal dataset (above side) and independent validation dataset (below side).

The principal dataset was provided from the DIRECT consortium which was launched in 2017 aiming to pooling neuroimaging data collected from multiple independent sites to boost the statistical power of data analysis and finally facilitate the clinical application of findings from neuroimaging studies. DIRECT consortium provided one of the largest MDD dataset (http://rfmri.org/REST-meta-MDD). Investigators from more than twenty-five research groups including our own had shared resting-state fMRI data. In the current study, 1604 participants (848 MDDs and 794 NCs) were selected according to exclusion criteria from a previous study [23] (S1 Fig).

The independent validation dataset was provided from the Japanese Strategic Research Program for the Promotion of Brain Science (SRPBS) dataset which included neuroimaging data of 2414 psychiatric patients and healthy controls collected from eight Japanese universities and institutes. In this study, we selected 446 participants (117 MDDs and 269 NCs) from 4 sites including Hiroshima University Hospital (HUH), Hiroshima Rehabilitation Center (HRC), Hiroshima Kajikawa Hospital (HKH) and Hiroshima COI (COI).

2.2. Preprocessing

Standard preprocessing of the principal dataset and independent validation dataset were done using the Data Processing Assistant for Resting-State fMRI (DPARSF, http://rfmri.org/DPARSF/) and Graph Theoretical Network Analysis (GRETNA, https://www.nitrc.org/projects/gretna), which were based on Statistical Parametric Mapping (SPM, https://www.fil.ion.ucl.ac.uk/spm/software/spm12/). Standard preprocessing included discarding the initial 10 volumes, slice-timing correction, head motion correction, space normalization and temporal bandpass filtering (0.01-0.1 HZ). Statistical corrections removed effects of head motion, white matter and cerebrospinal fluid signals, as well as linear trends. Here, we adopted the Friston 24-parameter model to regress out head motion effects. After fMRI preprocessing, the brain was parcellated into 116 ROIs (regions of interest) according to the AAL (automated anatomical labelling) atlas [24]. The mean time courses of all the 116 ROIs were then extracted. We evaluated functional connectivity (Fisher’s r-to-z transformed Pearson’s correlation) between all ROI pairs. Therefore, a 116 × 116 symmetric matrix (functional connectivity network) F was obtained for each subject.

2.3. Control of site differences and covariates

To mitigate site differences and confounding effects, we adopted ComBat harmonization method [25–28] to control for site differences and covariates in function connectivity. This method enabled us to preserve biological variability and subtract the variation introduced by site. We assumed that the fMRI data came from m completely different multi-sites, including each ni participants for i = 1, 2, …, m. For each functional connectivity, Combat model can be written as follows:

[eqn]

in which connectivityij represents the FC value for the participant j at site i ( $[eqn]$ ) and const is the average FC value across all subjects from all sites, X is a design matrix for the covariates of interest ( $[eqn]$ , p is the number of covariates including gender, age, education and head motion in the principal dataset, while p includes gender, age and head motion in the replication dataset), $[eqn]$ is the vector of coefficients associated with X ( $[eqn]$ ), We further assume that the residual terms $[eqn]$ follow a normal distribution with mean zero and the terms $[eqn]$ and $[eqn]$ represent the additive and multiplicative site effects of site i.

Next, the Combat estimates $[eqn]$ and $[eqn]$ of the site effect parameters using uses Empirical Bayes. The final ComBat-harmonized functional connectivity is defined as:

[eqn]

in which $[eqn]$ represents the estimated average FC values. More detailed information has been previously described [25–27].

2.4. Problem definition

Given a set of weighted brain networks $[eqn]$ , the model outputs corresponding graph labels $[eqn]$ . Specifically, brain functional graph $[eqn]$ mainly contains the graph adjacency $[eqn]$ characterizing the structure of graph and the node feature matrix $[eqn]$ characterizing the feature of each node. Here, nodes are defined as regions of interest (ROIs) based on brain parcellation. To create $[eqn]$ , we binarized FC matrix by transforming only the top 20-percentile absolute correlation values into ones, and the rest were transformed into zeros. As for the node feature $[eqn]$ , specifically $[eqn]$ for node k, it is defined as $[eqn]$ , where $[eqn]$ is the Pearson’s correlation coefficient for node $[eqn]$ and node $[eqn]$ . Fig 2A illustrates the pipeline from resting-state raw fMRI data to brain functional graphs. Generally, $[eqn]$ is defined as the number of subjects and $[eqn]$ is defined as the number of regions of interest (ROIs).

Data analysis pipeline of EH-BrainGNN.Data Preprocessing: Raw fMRI data were preprocessed and then mean time series of regions of interest (ROIs) were extracted from brain parcellation. Subsequently, the functional connectivity (FC) matrices were calculated using Pearson correlation between ROIs. From the FC we constructed the brain functional graph G=(A,X), where A was the graph adjacency matrix and X is node feature matrix. In this study, A was a binarized FC matrix where only the top 20-percentile absolute values of FC matrix were transformed into ones, while the rest were transformed into zeros. For node feature X, Xk for node k was an entire row of FC matrix. MDD Classification: We first learn subgraph mask M to generate explanation subgraph Gsub=G⨀M, ⨀ is element-wise multiplication. Then the graph encoder was employed to learn graph embeddings Z and Zsub from G and Gsub, respectively. Finally, we used Zsub to predict label Y. MDD Subtyping: The learned Zsub is used to extract subject similarity matrix by Pearson correlation between graph embeddings of every two participants. Then the subject similarity matrix was sent to the clustering by fast search and find of density peaks (CFDP) method to generate MDD subtypes.

2.5. Overall framework of EH-BrainGNN

EH-BrainGNN follows a two-stage pipeline that decouples supervised representation learning from downstream subtyping (Fig 2 B and C and Fig 3). In the first stage, the model is trained to classify MDD versus healthy controls and, in doing so, learns subject-level graph representations. In the second stage, these learned graph-level features are used for unsupervised clustering to identify reproducible MDD subtypes. Importantly, the subtyping procedure operates on fixed embeddings and does not affect classifier training. To improve transparency, we provide a schematic of the complete deep network architecture and report the number of trainable parameters for each module and in total (Fig 3). Parameter counts are computed as the total number of trainable weights and biases in each learnable layer, whereas parameter free operations such as activation, dropout, pooling, and flattening are counted as zero.

A schematic of the full network with parameter counts (A) and a clear statement of the role of the subgraph generator (B).

2.5.1. Stage 1 Supervised representation learning.

Given an input functional connectivity graph $[eqn]$ , EH-BrainGNN learns discriminative representations under the supervised classification objective. This stage comprises three tightly coupled components.

(1)Subgraph generator (explainer regularizer). Subgraph generator learns a probabilistic edge mask and samples a discriminative subgraph $[eqn]$ from the original graph $[eqn]$ . The edge mask is optimized jointly with the classifier, encouraging the model to focus on task relevant connections and providing an interpretable set of edges that contribute most to the prediction.(2)Graph encoder. The graph encoder processes either the full graph $[eqn]$ and the sampled subgraph $[eqn]$ to produce node level embeddings and a compact graph level embedding via bilinear mapping second order pooling. The graph level embedding is then fed to a classifier head to predict the diagnostic label.(3)Mutual information estimator. To prevent the subgraph generator from selecting spurious or overly sparse edges, we introduce a mutual information-based constraint between $[eqn]$ and $[eqn]$ . This term encourages $[eqn]$ to preserve informative content from the original graph while remaining compact and discriminative. The subgraph generator and graph encoder are trained end to end with the combined classification loss and mutual information regularization.

2.5.2. Stage 2 Downstream subtyping.

After training, we extract the graph level embeddings for all MDD subjects and perform clustering to derive subtypes. Because embeddings are produced by the supervised model, the resulting subtypes reflect heterogeneity in connectivity patterns that are relevant to MDD discrimination. This subtyping module is applied post hoc and does not alter the learned classifier.

2.6. Module A: Subgraph generator

The procedure of subgraph generator module is shown in Fig 3B. We firstly learned edge mask $[eqn]$ for each subject to extract the most informative or interpretable subgraph $[eqn]$ from $[eqn]$ . Specifically, the explanation subgraph $[eqn]$ is induced by M, where $[eqn]$ , $[eqn]$ is element-wise multiplication.

To ensure the gradient, w.r.t., $[eqn]$ is computable, We learned $[eqn]$ using the reparameterization trick [29,30], where each element of M represented the probability of each edge existing $[eqn]$ . Here, $[eqn]$ could be calculated as:

[eqn]

where $[eqn]$ denotes the sigmoid function, $[eqn]$ is the learned parameter, and $[eqn]$ is a temperature for the concrete distribution.

Furthermore, to encourage the compactness of the explanation and the discreteness of M, we adopted Lsps and Lent [31], respectively:

[eqn]

2.7. Module B: Graph encoder

Graph encoder was mainly consisted of Graph Isomorphism Network (GIN) [32] encoder and the bilinear mapping second order pooling [33] (see Fig 3A). GINs use the $[eqn]$ and $[eqn]$ to learn the node embeddings $[eqn]$ through a neighborhood aggregation strategy. Specifically, we iteratively updated the representation of a node $[eqn]$ by aggregating and combining the representations of neighboring nodes. In the first iteration, we initialized $[eqn]$ , and $[eqn]$ was the set of neighboring nodes $[eqn]$ . At the $[eqn]$ -th iteration, $[eqn]$ are inputs and $[eqn]$ are outputs. Formally, the $[eqn]$ -th layer of a GIN is

[eqn]

where the multi-layer perceptron (MLP) is used to model and learn node representations, and $[eqn]$ is a learnable parameter.

After obtaining the node embeddings, global graph pooling was applied to gain graph embeddings $[eqn]$ from node embeddings $[eqn]$ . Recent work has proposed a novel global graph pooling methods based on second-order pooling for graph classification tasks (bilinear mapping second order pooling) [33]. Second-order pooling (SOPOOL) is described as:

[eqn]

SOPOOL is able to effectively capture the correlations among features, and the encoded topology information. In contrast to conventional graph pooling approaches, bilinear mapping-based second-order pooling leverages information from all nodes, captures second-order statistical dependencies, and achieves parameter efficiency by substantially reducing the number of trainable variables.

Here, we used bilinear mapping second order pooling to obtain graph representations or graph embeddings $[eqn]$ .

S2 Fig (b) shows an illustration of generating graph embeddings. First, we reduced the dimensionality of $[eqn]$ using a linear mapping:

[eqn]

where $[eqn]$ is a trainable matrix representing a linear mapping. Then we flatten the matrix and obtain a graph embedding:

[eqn]

Note that graph embedding $[eqn]$ is an $[eqn]$ -dimensional vector.

2.8. Module C: Mutual information estimator

We introduced the information bottleneck principle to obtain robust and compact subgraph structure. Specifically, we minimize the mutual information between the original graph G and the explanation subgraph $[eqn]$ and maximize the mutual information between subgraph Gsub and label Y. Formally, the optimization framework is formulated as:

[eqn]

where MI denotes the mutual information and M is an edge mask.

Based on a previous study [34], $[eqn]$ could be interpreted as a classic cross-entropy loss. Maximizing $[eqn]$ encourages $[eqn]$ is most predictable to graph label Y. Mathematically, it could be defined as:

[eqn]

in which $[eqn]$ is the variational approximation to the true mapping $[eqn]$ from $[eqn]$ to Y.

In order to minimize $[eqn]$ , we first extract graph embeddings $[eqn]$ and $[eqn]$ from the original graph $[eqn]$ and its subgraph $[eqn]$ using the graph encoder. According to sufficient encoder assumption [35] that the information of Z was lossless in the encoder precess, we further approximated $[eqn]$ with $[eqn]$ , where Zsub and Z represented graph embeddings of Gsub and G obtained from the graph encoder, respectively. Thus, Eq. (9) was changed as:

[eqn]

where N represents the number of participants, $[eqn]$ is label for the n-th subject, $[eqn]$ is the prediction of model for the n-th subject.

In addition, we used the recently proposed matrix-based R $[eqn]$ nyi’s $[eqn]$ mutual information [36,37] to directly estimate $[eqn]$ . It’s mathematically well-defined and computationally efficient, and dosen’t require an additional neural network unlike the mutual information neural estimator (MINE) [38]. R $[eqn]$ nyi’s $[eqn]$ entropy, defined via the normalized eigen spectrum of the Gram matrix corresponding to the data projected onto a reproducing kernel Hilbert space (RKHS), allows for direct estimation of $[eqn]$ , $[eqn]$ and $[eqn]$ from the data itself. This approach eliminates the need for discrete probability density function (PDF) estimation or the use of auxiliary neural networks.

According to Shannon’s chain rule, $[eqn]$ can be decomposed as:

[eqn]

where $[eqn]$ indicates the entropy of Zsub, $[eqn]$ indicates the entropy of Z and $[eqn]$ represents the joint entropy between Zsub and Z. The specific calculation process is directly given by the definitions.

Definition 1. Let $[eqn]$ be a real valued positive definite kernel that is also infinitely divisible. Given $[eqn]$ , each $[eqn]$ can be a real-valued scalar or vector, and the Gram matrix $[eqn]$ calculated as $[eqn]$ , a matrix-based analogue to Rényi’s $[eqn]$ -entropy can be given by the following functional:

[eqn]

in which $[eqn]$ . A is the normalized K, i.e., $[eqn]$ . $[eqn]$ denotes the i-th eigenvalue of A.

Definition 2. Given a set of $[eqn]$ samples $[eqn]$ , each sample includes two measurements $[eqn]$ and $[eqn]$ obtained from the same realization. Given positive definite kernels $[eqn]$ and $[eqn]$ , a matrix-based analogue to Rényi’s $[eqn]$ -order joint-entropy can be defined as:

[eqn]

where $[eqn]$ , $[eqn]$ and $[eqn]$ shows the Hadamard product between the matrices A and B.

Given $[eqn]$ in a mini-batch of $[eqn]$ samples, we first estimate two Gram metrics $[eqn]$ and $[eqn]$ associated with $[eqn]$ and $[eqn]$ , respectively. According to Definitions 1 and 2, the entropy and joint entropy terms in Eq. (13) and (14), such as $[eqn]$ and $[eqn]$ , can be simply computed over the eigen spectrum of $[eqn]$ and $[eqn]$ . Here, we use the radial basis function (RBF) kernel $[eqn]$ to obtain $[eqn]$ and $[eqn]$ :

[eqn]

and for the kernel width $[eqn]$ , we estimate the mean of the $[eqn]$ (where $[eqn]$ ) nearest distances for each sample.

Thus, the final loss function of EH-BrainGNN is described as:

[eqn]

where $[eqn]$ , $[eqn]$ and $[eqn]$ are the hyper-parameters.

2.9. Subtyping analysis

We proposed a unsupervised model with the clustering by fast search and find of density peaks (CFDP) [39] to automatically identify the depression subtypes using the subgraph embedding signatures (Fig 2C). For the clustering procedure, we first computed the Pearson correlation $[eqn]$ between graph embeddings of every two participants. Then a distance $[eqn]$ between every pair of participants was defined as $[eqn]$ . Finally, we used CFDP method using the $[eqn]$ as the input to obtain the depression subtypes.

Next, we describe the rules used with CFDP to select cluster centers. For each subject, CFDP computes a local density $[eqn]$ and a distance to the nearest point with higher density $[eqn]$ . We plot $[eqn]$ against $[eqn]$ to obtain the decision graph and define a combined prominence score:

[eqn]

Points with large values of both $[eqn]$ and $[eqn]$ correspond to high $[eqn]$ and are treated as candidate cluster centers. The decision graph identifies the prominent corner region with high density and large distance, and select as centers those points with the largest $[eqn]$ . In addition, to validate the reliability of number of subtypes, we added cluster quality analyses across a range of candidate values of $[eqn]$ . Using the learned graph embeddings and the corresponding CFDP assignments, we compute the silhouette coefficient and Dunn index for different choices of $[eqn]$ .

To obtain edge-level connectivity patterns of each subtype, we implemented a three-step procedure.

(1)Computation of subject-level explanation masks. For each subject $[eqn]$ , we consider the input graph $[eqn]$ with functional connectivity edges $[eqn]$ . Using the subgraph generator module of our framework, we obtain an edge-wise explanation mask for subject $[eqn]$ , that assigns to each edge a non-negative importance score

[eqn]

which quantifies how strongly edge $[eqn]$ contributes to the model’s prediction for that subject.

(2)Subject-level edge selection (top-20 edges). Within each subject, we rank all edges in descending order of their explanation scores $[eqn]$ . Let $[eqn]$ denote the $[eqn]$ edges with the largest mask values for subject $[eqn]$ , i.e.,

[eqn]

with $[eqn]$ in all experiments. We then define the subject-level top- $[eqn]$ edge set as

[eqn]

which provides a sparse, subject-specific summary of the most influential connections.

(3)Aggregation across subjects within each subtype. Let $[eqn]$ index the CFDP-derived subtypes, and

[eqn]

denote the set of subjects in subtype $[eqn]$ . For each edge $[eqn]$ in the common edge index space, we compute a selection frequency:

[eqn]

where $[eqn]$ denotes the indicator function, which equals 1 if the condition is satisfied and 0 otherwise.

To assess the stability of CFDP-derived subtypes, we conducted bootstrap resampling with 1,000 iterations. In each iteration, subjects were sampled with replacement and the CFDP pipeline was re-run on the learned embeddings using identical hyperparameters, yielding a bootstrap clustering. Agreement between the reference clustering and each bootstrap clustering was quantified using the adjusted Rand index (ARI) and normalized mutual information (NMI).

We further evaluated the robustness of subtype-associated edges via 1,000 bootstrap resamples. For each resample, we recomputed the subject-level top-20 edges, aggregated them within each subtype to generate resampled subtype-level top-20 sets, and compared these sets to the corresponding reference sets using the Jaccard index.

To determine whether the observed clustering stability and edge-selection frequencies could arise from random variation, we performed a non-parametric label-permutation test (1,000 iterations). In each iteration, subtype labels were randomly permuted across subjects and the full analysis pipeline was repeated: explanation masks were re-aggregated and the edge-selection procedure was re-applied to construct null distributions of edge-selection frequencies, and the CFDP clustering was re-run to obtain null distributions for ARI and NMI. Empirical p-values were derived from these null distributions and adjusted for multiple comparisons using the Benjamini–Hochberg false discovery rate (FDR) procedure.

2.10. Brain-symptoms associations in different subtypes

Brain-symptoms associations were investigated using leave-one-out ridge regression models. On the brain side, graph embeddings for each patient were used to predict clinical symptoms. Principal component analysis (PCA) was used to reduce the dimension of graph embeddings and nine components which explained 95% of variance were retained. On the symptoms side, there were 7 significantly different clinical profiles for depression symptoms. For each ridge regression analysis, we built only one model to examine the relationship between graph embeddings and one clinical symptom score. In each loop, we optimized the regularization coefficient lambda to minimize prediction error through identifying the value of lambda between 10^-5^ and 10^5^. Furthermore, the performance of leave-one-subject-out ridge regression models was quantified using the coefficient of determination (r^2^), which was defined as:

[eqn]

where $[eqn]$ is the true data, $[eqn]$ is prediction data and $[eqn]$ is the average of true data. To further characterize predictive performance, we have also added the mean absolute error (MAE) as a complementary metric to r².

Permutation-based significance testing was performed for all cross-validated r² estimates. For each predictive model, we generated an empirical null distribution of r² values (1,000 iterations) by randomly permuting symptom scores and rerunning the full analysis pipeline, while keeping the functional features, data splits and model hyperparameters identical to those used in the original analysis. The observed cross-validated r² was then compared against this null distribution to derive an empirical permutation p value. To account for multiple testing across symptom dimensions, p values were adjusted using false discovery rate (FDR) correction.

2.11. Baselines

For comparison, we evaluate the performance of our EH-BrainGNN against four traditional psychiatric diagnostic classifiers including support vector machines (SVM) with linear and RBF kernel [40], LASSO [41] and random forest (RF) [42]; three baseline GNNs including GCN [43], GAT [44] and GIN [32]; two state-of-the-art (SOTA) GNN explainers including DIR-GNN and ProtGNN [45]; and three GNNs designed for brain networks including BrainGNN [46], CI-GNN [20] and BrainIB [21].

2.12. Training, testing and hyperparameter optimization

We evaluated model performance using cross-validation [47] and independent external validation, and report the area under the receiver operating characteristic curve (AUC) and accuracy across folds. For cross-national validation, to provide a balanced assessment of classification performance, we additionally report the F1 score and sensitivity alongside balanced accuracy. For primary experiments, we adopted an outer 10-fold cross-validation scheme to split the dataset into training and held-out test folds. Within each outer fold, all preprocessing and model fitting steps were performed using only the outer training data, and the resulting model was then evaluated on the corresponding held-out fold.

Hyperparameters were optimized using a nested cross-validation procedure to ensure unbiased performance estimation. Specifically, within each outer training split, we conducted an inner 5-fold cross-validation to select hyperparameters. For each candidate configuration defined in Table 2 (including learning rate, batch size, number of layers, hidden units, weight decay, dropout rate and other model-specific settings), the model was trained on four inner folds and evaluated on the remaining fold, iterating over all inner folds. The configuration that maximized mean inner-fold accuracy was selected. Using this fixed configuration, the model was then retrained on the full outer training set and evaluated on the outer test fold. Early stopping was applied with a patience of 10 epochs, monitoring validation accuracy on an internal validation split drawn exclusively from the outer training data. At no stage were outer test folds or the independent validation cohort used for hyperparameter tuning, model selection or early-stopping calibration.

Table 2: Range of hyper-parameters and final specification for EH-BrainGNN.

Model training used the Adam optimizer with a learning rate of 0.001, a dropout rate of 0.5 and weight decay of 0.0005, unless otherwise specified. The GIN architecture comprised five layers and was trained with a batch size of 64. For EH-BrainGNN, hyperparameters were selected either by grid search within the nested cross-validation framework or set according to recommended values from prior work; the search ranges and final specifications are reported in Table 2.

To prevent information leakage during site harmonization, ComBat parameters were always estimated exclusively from training data within each evaluation split and then applied to held-out data without refitting. For the 10-fold cross-validation experiments, ComBat was fitted on the outer training data within each fold and applied to the corresponding held-out fold. For leave-one-site-out (LOSO) cross-validation, ComBat was fitted using data from the training sites only and then applied to the left-out site; no information from the held-out site, including its distributional characteristics, was used when fitting the harmonization model. For independent cross-national validation, ComBat was fitted solely on the Chinese training sites, and the learned parameters were subsequently applied to the Japanese cohort, which was treated as a strictly unseen target domain; no ComBat parameters were estimated using any Japanese labels or Japanese distributional information.

3. Results

3.1. Generalization performance: Ten-fold cross-validation

To assess the performance, we conducted 10-fold cross-validation procedure [47] in both datasets. Furthermore, we used ComBat harmonization method [25–28] to mitigate distinct site differences and artifact or confounding effect. Table 3 demonstrate the classification performance for different models. Using 10-fold cross-validation, EH-BrainGNN achieved an accuracy of 74% in the principal dataset and an accuracy of 73% in the independent validation dataset. Compared with other methods, EH-BrainGNN yielded significant and consistent improvements in both datasets, suggesting that the model had formidable discriminatory ability. To assess the statistical significance of performance differences, we compared EH-BrainGNN against the next-best method using paired tests across the 10 cross-validation folds (fold-wise accuracy as paired observations). The results are shown in Fig 4 A-B. For the principal dataset, EH-BrainGNN was compared to BrainIB in terms of accuracy: t = 4.3529, p = 0.0018, Cohen’s d = 1.3765, with a bootstrap 95% confidence interval (CI) of [0.7998, 2.9158]. For the independent validation dataset, EH-BrainGNN was compared to CI-GNN (accuracy): t = 3.1623, p = 0.0115, Cohen’s d = 1.0000, with a bootstrap 95% CI of [0.4454, 2.4244]. In addition to the high classification performance of EH-BrainGNN, the framework provided an information-condensed signature (graph embeddings) which yielded meaningful binary discrimination between depressed patients and healthy controls (S2 Fig A-B).

Table 3: Tenfold cross validation performance on different models in both datasets. The highest performance is highlighted with boldface.

Generalization performance of EH-BrainGNN.(A) Tenfold cross validation performance on different models in principal dataset. (B) Tenfold cross validation performance on different models in replication dataset. (C) The leave-one-site-out cross-validation performance ranking of different models based on their accuracy. (D-E) Generalization Performance of the classifier for independent cohorts.

3.2. Generalization Performance: Leave-one site out cross-validation

To assess the generalizability of classification model to unseen data collected at completely different sites, a leave-one-site-out analysis was performed by dividing the dataset into the training set (16 sites out of 17 sites) to train the model, and the testing set (the left site out of 17 sites) for testing a model (S3 Fig) in the principal dataset. A ComBat harmonization method [25–28] for training set was applied to mitigate distinct site differences and artifact or confounding effect. Then we fitted a model to each site-sample resulting 17 MDD classifiers. We then calculated the area under the curve (AUC) and accuracy for every classifier which were shown in Table 4. And we evaluated the leave-one-site-out cross-validation performance ranking of different models based on their accuracy (Fig 4C).

Table 4: Leave-one-site-out cross validation performance on different models in principal dataset. The highest performance is highlighted with boldface.

As can be seen, EH-BrainGNN achieved the highest average generalization accuracy across sites (73%) compared with the baseline models. Improvements were broadly consistent across sites, indicating better generalization under site-specific variability. Moreover, from the performance rankings across different models, our model consistently maintained the top position across most sites. Although it ranked fourth at site 11, the accuracy reached 0.84, which is still competitive, with the highest accuracy across models being 0.88. This discrepancy may be attributed to the relatively older average age of participants at this site, where the mean age of MDD patients is 46 years.

3.3. Generalization Performance: Generalization of the classifier for independent cohorts

To further test the generalizability of EH-BrainGNN, we used Chinese-population-based data to construct classifier to predict Japanese-population-based data. To be specific, we used data from the principal dataset to train the model, and applied trained classifier to the independent validation dataset (data for each site). Table 5 and Fig 4D-E demonstrate generalization performance in HUH, HRC, HKH and COI from the independent validation dataset. EH-BrainGNN demonstrated a generalization accuracy of 73% and an AUC of 72% at HUH, with an absolute improvement of over 16% in accuracy compared to RBF-SVM. More importantly, EH-BrainGNN achieved the highest performance in terms of multiple metrics, including accuracy, sensitivity, and F1-score, across most sites (Table 5). Specifically, in HRC, HKH, and COI, EH-BrainGNN outperformed all baseline models, showing significant improvements in both accuracy and AUC. At HUH, the model achieved an accuracy of 73%, sensitivity of 79%, and F1-score of 76%. At HRC, the accuracy was 74%, sensitivity 86%, and F1-score 83%. For HKH, EH-BrainGNN yielded an accuracy of 66%, an AUC of 67%, and a sensitivity of 72%, while in COI, the model achieved a balanced accuracy of 66%, an AUC of 63%, and sensitivity of 75%. In contrast, other models like RBF-SVM, CI-GNN and BrainIB showed lower generalization performance across these sites, with accuracy values mostly below 70%, and poor sensitivity and F1-scores in certain cases. These results suggested that a reliable classifier that we developed by only using the training data obtained from China, could be successfully applied to classify MDD and HCs in the Japan validation cohorts. Generalization performance across different models on the independent validation dataset, trained on the principal dataset. The best-performing result is highlighted in bold (Fig 5).

Table 5: Generalization performance on different models in the independent validation dataset using the principal dataset. The highest performance is highlighted with boldface.

Determination of the optimal number of CFDP subtypes.(A-B) CFDP decision graphs for the two datasets. (C-D) Two dimensional embeddings illustrating the spatial distribution of subjects colored by CFDP subtype. (E-F) Silhouette coefficients as a function of the number of clusters (K=2–9). (G-H) Dunn indices for K=2–9.

3.4. Subtyping Analysis: Graph embeddings define three depression subtypes

According to our overall frameworks, we further adopted unsupervised learning to these high-dimensional signatures obtained from deep-learning model to identify subtypes of MDD. To ensure that cluster discovery was not confounded by the data distribution and the number of clusters, we employed a data-driven method (clustering by fast search and find of density peaks, CFDP) which automatically identified the number of clusters and MDD subtypes. The decision graph and subtype distribution are shown in Fig 6A, C. Three depression subtypes were identified, comprising 356 (43.0%), 226 (27.3%) and 246 (29.7%) patients, respectively. To assess clustering robustness, we computed the silhouette coefficient and Dunn index across a range of candidate subtype numbers (Fig 6E, G). Both indices consistently supported a three-subtype solution.

Visualization of distinct patterns in different subtypes of MDD in principal dataset and independent validation dataset, respectively.The colors of brain neural systems are described as: visual network (VN), somatomotor network (SMN), dorsal attention network (DAN), ventral attention network (VAN), limbic network (LIN), frontoparietal network (FPN), default mode network (DMN), cerebellum (CBL) and subcortical network (SBN), respectively.

To ensure that subtypes were biologically meaningful, we compared HAMD total scores and each item of HAMD using two-sided Wilcoxon rank-sum tests. Multiple comparisons across items were controlled using the Benjamini–Hochberg false discovery rate (FDR) procedure, with significance assessed at a preset threshold of q < 0.05. We observed that HAMD total scores were significantly different between subtype1 and subtype3 (p = 0.029, d = 0.41, 95%CI = [0.04, 0.77], Fig 7B). In addition, seven symptom items of HAMD showed significant differences (Fig 7A) including depressed mood (subtype2 vs. subtype3: p = 0.027, d = 0.29, 95% CI =[0.11, 0.47]), insomnia-early (subtype1 vs. subtype3: p = 0.010, d = 0.61, 95% CI = [0.45, 0.78]; subtype2 vs. subtype3: p = 0.018, d = 0.43, 95% CI =[0.24, 0.61]), insomnia-middle (subtype2 vs. subtype3: p = 0.007, d = 0.31, 95% CI = [0.13, 0.50]), agitation (subtype2 vs. subtype3: p = 0.026, d = 0.32, 95% CI = [0.13, 0.50]), genital (subtype2 vs. subtype3: p = 0.026, d = 0.47, 95% CI = [0.28, 0.65]), weight loss (subtype2 vs. subtype3: p = 0.048, d = 0.24, 95% CI = [0.06, 0.42]) and insight (subtype2 vs. subtype3: p = 0.026, d = 0.30, 95% CI = [0.12, 0.48]).

Graph embeddings define depression subtypes.(a) Comparison of clinical profiles for depression symptoms (HAMD-17) among three depression subtypes using Wilcoxon rank-sum tests with FDR correction on the principal dataset. (b) Comparison of HAMD total scores among three depression subtypes using Wilcoxon rank-sum tests with FDR correction on the principal dataset. (c) Comparison of BDI-II scores among three depression subtypes using Wilcoxon rank-sum tests on the independent validation dataset. (d) The overlap and distinction of brain patterns and symptom profiles within three subtypes in principal dataset and independent validation dataset.

Furthermore, we further validated the consistency and reproducibility of our identified MDD subtypes in the independent validation dataset. Interestingly, we consistently identified three MDD subtypes in the independent validation dataset (Fig 6B, D, F, H). Three depression subtypes were identified, comprising 43 (24.3%), 60 (33.9%) and 74 (41.8%) patients, respectively. A trend towards a reduced pattern of BDI scores was observed between subtype 2 and subtype 3 (BDI scores of subtype 2 = 27.21 ± 10.69, and BDI scores of subtype 3 = 30.05 ± 9.79, respectively; p = 0.066, d = -0.28, 95% CI = [-0.67, 0.10]), but no statistically significant difference was found (Fig 7C).

3.5. Subtyping Analysis: Subtype-specific brain connections

To illustrate specific brain patterns in each subtype of MDD, we compared subgraphs obtained from explanation generator among three subtypes (Fig 7). The consistent subtype-specific brain connections on both datasets are summarized in Fig 8d and Table D in S1 Text, and we have the following observations:

(1)We observed that connection between left and right thalamus overlapped among three subtypes, suggesting that this connection played an essential role in MDD diagnosis no matter subtype of MDD. This result is consistent with a recent study [48], where they found that MDD classifications were driven by stronger thalamic connections and the connection between the left and right thalamus is one of the most important connection for classification MDD vs HC.(2)Common diagnostic features existed among different subtypes of major depression and identified 3 common connections between a) left and right cerebellum.6 between subtype1 and subtype2; b) left and right paracentral lobule between subtype1 and subtype3; c) left and right cuneus between subtype2 and subtype3.(3)Different subtypes of MDD also showed distinct brain patterns. Specifically, MDD subtype 1 was characterized by connections within the DMN (precuneus)-Cerebellum network which were associated with abnormal insomnia-early. Previous study has reported that insomnia is associated with impaired disengagement of brain regions involved in self-referential processes (precuneus) [49]. Brain patterns of MDD patients of subtype 2 mainly encompassed connections within Insula-Cingulum-Temporal network, where connections between left and right insula which was involved in socio-emotional processing and general attention [50], were associated with reduced depressed mood and insight in subtype 2. MDD subtype 3 was characterized by connections within the frontostriatal network involved in reward processing and action initiation [51–53], including connections between 1) left and right putamen; 2) left and right caudate; 3) left superior frontal cortex and left superior medial frontal cortex. These connections were associated with increased insomnia, agitation and weight loss.

3.6. Subtyping Analysis: Stability and significance of subtype-specific patterns

To assess the stability of CFDP-derived subtypes, we performed bootstrap resampling (1,000 iterations) and quantified agreement with the original solution using the adjusted Rand index (ARI) and normalized mutual information (NMI). Subtype assignments showed substantial reproducibility (ARI: mean = 0.523, 95% CI = [0.136, 0.863]; NMI: mean = 0.543, 95% CI = [0.198, 0.809]). To determine whether this stability exceeded chance, we compared the observed stability metrics with a null distribution generated from random cluster assignments matched for subtype number and size. The observed distributions were markedly shifted relative to null, yielding large effect sizes (Cohen’s d for ARI = 3.13; Cohen’s d for NMI = 4.38), indicating that CFDP-derived subtypes were substantially more reproducible than expected under random clustering.

Subtype-associated edge sets showed moderate-to-high stability across bootstrap resamples (S4 Fig A). The mean Jaccard overlap was 0.433 (95% CI [0.250, 0.600]) for subtype 1, 0.405 (95% CI [0.176, 0.667]) for subtype 2 and 0.335 (95% CI [0.143, 0.538]) for subtype 3. As both the reference and bootstrap-derived sets comprised 20 edges, these overlaps correspond to ~10–12 shared edges on average, indicating that more than half of the reported edges were consistently recovered across resamples, whereas the remainder formed a more variable peripheral set. The overlaps were well above chance (random top-20 selection yields Jaccard values close to zero), supporting a reproducible core of subtype-specific connections. Consistently, many reference edges showed high bootstrap selection frequencies (> 0.5): 19/20 edges for subtype 1, 16/20 for subtype 2 and 9/20 for subtype 3 (S4 Fig B-D).

Permutation testing further indicated that both the subtype-specific top-20 edge patterns and the CFDP clustering stability metrics were highly unlikely under a random-label null model (empirical p < 0.0001 across 1,000 permutations).

3.7. Correspondences between functional signature and MDD symptoms

To further investigate the association between patterns of discriminative functional signature and clinical symptoms, we used graph embeddings of subgraph and significantly different clinical profiles for depression symptoms (HAMD) to generate GE-HAMD models using leave-one-subject-out ridge regression. We present the r^2^ (95% CI) and MAE for each subtype in Table 6, along with scatter plots showing predicted versus observed symptom scores in S5 Fig. Our findings indicate that graph embeddings serve as strong predictors of clinical indicators across all subtypes, with R² values greater than 0.3 for each subtype. Additionally, symptom profiles in subtype 3 were better predicted by graph embeddings, which aligns with the significantly enhanced symptom profiles observed in subtype 3. To further assess the robustness of our results, we performed a permutation test (Table 6), which showed that, after multiple comparison correction, the p-values were less than 0.0001.

Table 6: Subtype-specific clinical profiles for significantly different depression symptoms. The highest performance is highlighted with boldface.

3.8. Validation analysis

We validated our results by considering several potential confounding factors.

(1)we added an additional analysis to investigate the extent of head motion contribution to predictability with/without age gender effects. Specifically, we conducted a ten-fold cross validation under four different settings in the principal dataset. We observed the performances of classifiers did not show any significant changes under different settings. These results suggested that it is the neural basis rather than the artifact or confounding that contribute to classification (Table 7).(2)we have added an additional experiment to investigate the effect of ROI selection. Specifically, we repeated the entire 10-fold cross-validation procedure using the most popular brain atlas (Power Atlas 49) including 264 brain regions. We observed that model based on Power atlas outperformed model based on AAL atlas for all approaches except random forest, suggesting that model performance was positively correlated with the number of ROIs (Table 8).(3)we repeated leave-one-site-out analysis without harmonization and compared prediction performances to assess effects of Combat harmonization method. We found significant improvements in the prediction performances using harmonization for some sites (Table 9). To ensure that harmonization did not remove disease-relevant effects, we performed a case–control diagnostic comparing MDD versus HC functional connectivity before and after ComBat. As shown in S6 Fig, the spatial distribution and overall pattern of case–control differences remained highly consistent, indicating that the biological signal was preserved. Moreover, the change in effect sizes was minimal, with the absolute difference |∆d| between pre- and post-ComBat estimates being less than 0.08 across edges.(4)We conducted an ablation study over five candidate thresholds for graph construction (10%, 15%, 20%, 25%, and 30%) across all three evaluation schemes used in this study: (i) 10-fold cross-validation, (ii) leave-one-site-out cross-validation, and (iii) training on the principal dataset with testing on the independent validation cohort. The results are summarized in S7 Fig and Table E-F in S1 Text. Across all three evaluation schemes, the 20% threshold consistently achieved the highest or near-highest mean AUC, with only minimal fluctuations in performance at adjacent thresholds.(5)we systematically evaluated the impact of GNN architecture on performance and explicitly report both cross validation and independent validation results for the requested variants. The full results are now summarized in S8 Fig and S9 Fig. Across both datasets and both evaluation schemes, the configuration with 5 layers and a 128-dimensional hidden space consistently achieved the best or near best performance, while the other settings showed only moderate fluctuations in accuracy and AUC.(6)we added a learning curve analysis in which we vary the size of the training set and plot test accuracy as a function of training sample size (S10 Fig). The results show a monotonic increase in test accuracy as the number of training subjects grows, with no evidence of abrupt degradation at larger sample sizes.(7)To test whether medication status affects the subtyping solution, we performed a sensitivity analysis in which medication was treated as a covariate. As shown in S11 Fig A–B, the analysis again yielded three subtypes, and subtype assignments were highly stable: fewer than 2% of participants changed subtype relative to the original solution. Additionally, we conducted an additional analysis restricted to medication-naïve participants and repeated the subtyping. As shown in S11 Fig C–D, this restricted sample likewise produced three subtypes, with subtype assignments differing from the full-sample solution for fewer than 2% of participants.(8)We performed an ablation study to investigate the impact of the multi-term objective function on generalization performance. As shown in Table G in S1 Text, our findings reveal that omitting any individual loss component results in a significant degradation in performance, highlighting the critical contribution of each term to reconciling sparsity with predictive accuracy while enhancing generalization.

Table 7: The effects of age, gender, head motion on ten-fold cross validation performance on different models in the principal dataset.

Table 8: The effects of brain atlas on ten-fold cross validation performance on different models in the principal dataset.

Table 9: Effects of harmonization for the leave-one-site-out analysis.

4. Discussion

The current study could largely promote the recent attempts of MDD redefining and subtyping according to neural mechanisms beyond observable symptoms by providing a novel ensemble hybrid deep-learning framework. Deep-learning approach serves as an extension of traditional machine-learning approach [54] and can automatically learn a latent high-level information from raw input data, making this method ideally suited to studying complex, subtle and scattered brain patterns [55,56]. However, the potential advantages of deep-learning methods have not been fully exploited in neuroimaging domains. Functional connectivity profiles represent nuanced structures of functionally-connected brain networks that were overlooked by previous studies that simply flattened functional connectivity profiles into a vector. To tackle this problem, we adopted GNN which represented functional connectivity profiles as graphs, thus could reserve the inherent structure of functional networks. In addition, although some previous attempts have been made to employ deep learning studies in discriminating patients with mental disorders and healthy controls [57–60], these models are based on ‘black-box’ algorithms which prevent the identification of underlying diagnostic decisions [61,62]. Our model was self-explaining classifier which furnished identification of the structure of subgraphs that were crucial for diagnostic decisions. In order to obtain reliable and reproducible subtypes of MDD, we developed an unsupervised data-driven method which could automatically identify number of MDD subtypes.

Our classifiers achieved an average generalization accuracy of 73% for leave-one-site-out cross validation with REST-meta-MDD dataset collected from 17 independent sites. Moreover, our models achieved the highest generalization accuracy in SRPBS dataset using the REST-meta-MDD as the training data compared other baselines. Recently, there has been increasing attention on improving the generalization of the classifier. It is an emerging consensus that the generalization of a machine learning framework should be first proved before it can be considered as practical in clinical applications to ensure its reproducibility [16,18]. The validation of the generalization for a classifier was even considered as “a bare minimal requirement” for its clinical application [18]. More importantly, the generalization should be tested with independent datasets from multiple sites [16]. Frameworks that showed high generalization on the diagnosis of ASD and Alzheimer’s disease have been established [61,63,64]. As for MDD, pioneering work by Yamashita and colleagues has attempted to develop an MDD classifier using logistic regression. They achieved an average accuracy of 66% with an external dataset [16]. In the current study, we used a leave-one-site-out cross validation in REST-meta-MDD and used Chinese-population-based data to predict Japanese-population-based data, which comprehensively tested the generalization of our algorithm. Across evaluation settings, our model showed the best generalization among baselines, supporting reproducibility across sites and countries. However, with an overall generalization accuracy of 73%, the results are encouraging but not yet sufficient for clinical deployment. The classifier should therefore be viewed as a research tool, pending further prospective validation, calibration and clinical utility testing.

Another major contribution of the current study is that we consistently identified three depression subtypes across two datasets. It is well known that MDD is not a unitary disease but a heterogeneous disorder since the patients present varied symptoms and respond divergently to treatment [5], suggesting that different biological mechanisms underlie different subtypes. The importance of identifying diagnostic biomarkers for MDD subtypes has long been recognized, but effective solutions remain unclear. Few attempts have been made while mixed results were reported. Drysdale and co-authors proposed a framework to identify MDD subtypes according to resting-state functional connectivity profiles based on canonical correlation analysis (CCA) and hierarchical clustering approaches. They found four neurophysiological subtypes using canonical correlation analysis (CCA) and hierarchical clustering approaches [5]. However, another recent study only identified 2 depression subtypes with CCA and K-means clustering algorithm [65]. It remains unaddressed whether consistent and reproducible MDD subtypes could be derived from different datasets in order to provide clinically applicable biomarkers. To test this hypothesis, we adopt the data-driven CFPN algorithm which automatically obtain the number of MDD subtypes. Intriguingly, we consistently found three MDD subtypes across two datasets. In addition, the functional connectivity signatures of the three subtypes were also shown to be highly reproducible. Patients of subtype1 mainly demonstrated impairments in a brain system encompassing DMN network. Patients of subtype 2 were associated with deficits in the frontostriatal network, while the third subtype were mainly associated with the Insula-Cingulum-Temporal network. Our findings were in line with previous study by Drysdale [5], where subtype 3 and 4 showed abnormal connectivity within frontostriatal network and subtype1 and 2 were associated with impairments in cingulate areas. In addition, similar deficits in DMN and subcortical network were observed in subtype2 from previous study by Wang et al. [65]. These results suggested that patients with MDD as a group share some traits and common features of subtypes, which have potential to obtain biomarkers used for clinical guidance. However, these subtypes demonstrated overlapping without clear distinction unlike our depression subtypes in two previous studies. The current study thus extended previous studies and showed that highly consistent and reproducible MDD subtypes could be obtained with two large multisite datasets. Our findings could promote the development of personalized diagnostic and treatment strategy.

The subtype-specific connectivity motifs identified by our framework may be viewed as candidate circuit level biomarkers with potential relevance to precision treatment in major depressive disorder. Because the edge level explanations are derived directly from the predictive model, these signatures provide an interpretable mapping between latent embeddings and biologically meaningful neural circuits, rather than remaining as opaque model internal features. Building on this, the three reproducible subtypes suggest a hypothesis generating framework for individualized neuromodulation target selection. For the subtype characterized by default mode network and cerebellar coupling, cerebellar stimulation may represent a plausible candidate target. For the subtype marked by prominent insula, cingulate, and temporal involvement, targeting the insula or nodes within the salience network may be preferentially considered. For the frontostriatal subtype, the circuit profile is more consistent with conventional dorsolateral prefrontal cortex targeting, suggesting alignment with standard DLPFC based stimulation protocols. Importantly, these translational implications are hypothesis generating and require prospective interventional trials to determine whether subtype guided targeting improves treatment response. In parallel, EH-BrainGNN should be regarded as a decision support tool rather than a replacement for DSM-5 based assessment. Clinical judgment and symptom ratings would remain primary, whereas the model could provide an objective neuroimaging derived risk estimate together with interpretable circuit evidence to support clinical review. Before any real-world deployment, rigorous prospective multisite validation, calibration, and appropriate governance and regulatory oversight will be essential.

One limitation of the current study is that we could not evaluate whether different neurobiological subtypes show differential treatment response, as both REST-meta-MDD and SRPBS provide only baseline clinical and neuroimaging measures. Future prospective studies with longitudinal follow-up are needed to test whether subtype assignment predicts response to rTMS or pharmacotherapy, and to determine whether subtype informed targeting and stimulation parameters improve outcomes beyond one size fits all protocols. Several methodological uncertainties should also be considered. Our results may be sensitive to preprocessing choices, including nuisance regression (for example head motion), parcellation, and graph construction thresholds, which can influence functional connectivity estimates and downstream classification and subtyping. Moreover, scanner and protocol heterogeneity across sites may introduce residual confounding that is mitigated but not eliminated by leave one site out validation and cross cohort transfer. Finally, alternative model designs and hyperparameter settings could yield different performance and subtype structure, motivating broader benchmarking and prospective validation in independent cohorts. In addition, unmeasured factors not captured in available metadata, including cultural or environmental influences, may contribute to cross population differences; thus, claims of biological equivalence across cohorts should be interpreted cautiously.

Supporting information

S1 TextSupplementary Tables.(DOCX)

S1 FigSample Selection of REST-meta-MDD consortium.From 2428 subjects, 1604 subjects were selected through above criteria.(TIF)

S2 FigGeneration procedure of graph embeddings and t-SNE plot of FC network and graph embeddings.(A) FC networks from REST-meta-MDD cohort of 1604 participants were used as inputs and a two-dimensional plot was generated using the t-SNE, where the dark green represented the patients with MDD and the gray represented healthy controls. (B) Procedure of obtaining the graph embeddings from node embeddings in a single participant is depicted. Total_latent_dim = the number of nodes + the number of hidden units × (the number of GNN layers -1). (C) Permutation distribution of the estimate using the trained EH-BrainGNN model (repetition times: 1000) are shown, where x- and y-labels represent the generalization rate and probability density. GR_0_ denotes the generation rate gained by the EH-BrainGNN model trained on the real class labels. (D) Graph embeddings that served as inputs were embedded in a two-dimensional plot generated using the t-SNE for the two classes (MDD and HC). (E) The kernel distributions of graph embeddings of the MDD and HC populations are depicted.(TIF)

S3 FigSchematic diagram of the procedure for leave-one-site-out analysis.(TIF)

S4 FigThe stability of the subtype-specific top-20 edges.(A) the stability of the subtype-specific top-20 edges across bootstrap resamples quantified by the Jaccard index. (B) the bootstrap selection frequency of the reference top-20 edges for each subtype.(TIF)

S5 FigThe scatter plots of predicted vs observed HAMD scores.(TIF)

S6 FigCohen’s $[eqn]$ heatmaps for the raw FC and the ComBat-adjusted FC.(TIF)

S7 FigThe impact of graph construction threshold on 10-fold cross-validation.(TIF)

S8 FigThe impact of GNN layers and hidden dimensions on 10-fold cross-validation performance.(TIF)

S9 FigThe impact of GNN layers and hidden dimensions on generalization performance in the independent validation dataset using the principal dataset.(TIF)

S10 FigLearning curve analysis.(TIF)

S11 FigThe impact of medication status on subtype analysis.(A-B) CFDP decision graph and spatial distribution of subtypes with medication included as a covariate. (C-D) CFDP decision graph and spatial distribution of subtypes with medication-naïve participants.(TIF)

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ferrari AJ, Charlson FJ, Norman RE, Patten SB, Freedman G, Murray CJL, et al. Burden of depressive disorders by country, sex, age, and year: findings from the global burden of disease study 2010. P Lo S Med. 2013;10(11):e 1001547. doi: 10.1371/journal.pmed.1001547 24223526 PMC 3818162 · doi ↗ · pubmed ↗
2GBD 2016 DAL Ys and HALE Collaborators. Global, regional, and national disability-adjusted life-years (DAL Ys) for 333 diseases and injuries and healthy life expectancy (HALE) for 195 countries and territories, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet. 2017;390(10100):1260–344. doi: 10.1016/S 0140-6736(17)32130-X 28919118 PMC 5605707 · doi ↗ · pubmed ↗
3Otte C, Gold SM, Penninx BW, Pariante CM, Etkin A, Fava M, et al. Major depressive disorder. Nat Rev Dis Primers. 2016;2:16065. doi: 10.1038/nrdp.2016.65 27629598 · doi ↗ · pubmed ↗
4Belmaker RH, Agam G. Major depressive disorder. N Engl J Med. 2008;358(1):55–68. doi: 10.1056/NEJ Mra 073096 18172175 · doi ↗ · pubmed ↗
5Drysdale AT, Grosenick L, Downar J, Dunlop K, Mansouri F, Meng Y, et al. Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat Med. 2017;23(1):28–38. doi: 10.1038/nm.4246 27918562 PMC 5624035 · doi ↗ · pubmed ↗
6Goodkind M, Eickhoff SB, Oathes DJ, Jiang Y, Chang A, Jones-Hagata LB, et al. Identification of a common neurobiological substrate for mental illness. JAMA Psychiatry. 2015;72(4):305–15. doi: 10.1001/jamapsychiatry.2014.2206 25651064 PMC 4791058 · doi ↗ · pubmed ↗
7Jacobi F, Wittchen H-U, Holting C, Höfler M, Pfister H, Müller N, et al. Prevalence, co-morbidity and correlates of mental disorders in the general population: results from the German Health Interview and Examination Survey (GHS). Psychol Med. 2004;34(4):597–611. doi: 10.1017/S 0033291703001399 15099415 · doi ↗ · pubmed ↗
8Cross-Disorder Group of the Psychiatric Genomics Consortium, Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SN Ps. Nat Genet. 2013;45(9):984–94. doi: 10.1038/ng.2711 23933821 PMC 3800159 · doi ↗ · pubmed ↗