Fast graph convolutional models incorporating matrix factorization for predicting microbe-disease associations
Qingwen Wu, Sujuan Tang

TL;DR
This paper introduces a new computational method to predict which microbes are associated with specific diseases, using graph convolution and matrix factorization techniques.
Contribution
The novel contribution is the integration of fast graph convolution and matrix factorization for improved microbe-disease association prediction.
Findings
FGCNMF outperforms existing state-of-the-art methods on benchmark datasets.
The method uses enhanced node embeddings from a combined microbe-disease network for accurate predictions.
Abstract
Revealing the relationship between microbe and disease is of great significance to the diagnosis, treatment, and prevention of disease. To overcome the expensive cost and trial-and-error settings, a series of in-silico methods have been proposed to predict microbe-disease association. However, the predictive performance of the current methods is modest. In this paper, we propose a new computational method based on Fast Graph Convolutional and Matrix Factorization, called FGCNMF, which addresses microbe-disease association prediction as a binary classification task by learning embedding representation of nodes on a microbe-disease network. We integrate background information from both microbe and disease spaces into the same global network framework, and use the randomized Singular Value Decomposition algorithm to obtain high-quality initial embedding representations of node. Then, Fast…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —Clinical Research Support Program of the Affiliated Hospital of Jining Medical University (Clinical Medical college)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Gut microbiota and health · Machine Learning in Healthcare
Introduction
An intricate, co-evolutionary relationship exists between humans and a vast consortium of resident microorganisms, primarily bacteria. These microbes colonize nearly every part of the human body, with particularly dense populations found in the gastrointestinal tract and on the skin^1^, forming complex ecosystems critical to host physiology.
The skin’s microbial community, for instance, serves as a protective barrier. It is largely composed of beneficial commensals that actively inhibit pathogenic invasion while also playing a crucial role in calibrating the host’s immune responses^2^. This homeostatic balance, however, can be disturbed by a variety of endogenous and exogenous pressures, leading to pathological states^3^. Clinical examples include the association of Staphylococcus aureus with atopic dermatitis and Streptococcus pyogenes with guttate psoriasis^2^.
In the gastrointestinal tract, the microbiota’s influence on human health is even more profound. These microbes are essential for metabolic functions, such as converting dietary fiber into short-chain fatty acids, and for the proper development and functioning of the immune system^4^. Consequently, a disruption of this gut microbial community, a condition known as dysbiosis, is linked to immunological dysfunction and disease pathogenesis. Such disturbances are known to exacerbate risk factors for a range of cardiometabolic disorders, including diabetes, obesity, and atherosclerosis^5^. Specific microbial shifts, such as the depletion of the genus Odoribacter, have been identified as potential biomarkers for distinct pathologies like inflammatory bowel disease (IBD)^6^ and non-alcoholic fatty liver disease.
Unlocking the intricate roles of the human microbiome is paramount for advancing human health, offering pathways to better diagnostics, treatments, and preventative strategies^7^. However, progress in this area is hampered by the current scarcity of validated microbe-disease associations. Traditional discovery methods, which rely on direct biological and clinical experimentation, are inherently slow, costly, and labor-intensive, creating a significant bottleneck in research. To circumvent these challenges, the field has witnessed a surge in the development of computational, or in-silico, methods. These approaches provide a powerful and efficient framework for predicting novel microbe-disease links, and they can be broadly classified into two dominant paradigms^8^: models based on network algorithms and those that employ machine learning techniques.
The first major category, network-based models^9–12^, operates by representing microbes, diseases, and their known relationships as nodes and edges within a complex heterogeneous network. The core task is then to infer missing edges, representing undiscovered associations, by analyzing the network’s topological properties. A variety of strategies have been proposed to tackle this. Some approaches, such as KATZHMDA^13^, quantify the likelihood of an association by calculating the similarity between nodes based on the number and length of paths connecting them. Other methods leverage techniques like random walks (e.g., NTSHMDA^14^ or network consistency projection^15^ (e.g., NCPHMDA^9^, HMDA-Pred^16^ to propagate information across the network and identify promising candidate associations. Despite their ingenuity, a fundamental limitation plagues many of these models: their strong reliance on a pre-existing network of known interactions. Consequently, their predictive performance is often compromised when dealing with new diseases or microbes that have no previously documented links, rendering them ineffective for “orphan” node predictions.
The second prominent strategy leverages machine learning^17–19^, typically framing the prediction task as a binary classification problem. These models excel at learning from complex data by extracting or constructing discriminative features for microbe-disease pairs. A diverse range of architectures has been explored. For instance, deep learning approaches, such as the three-layer neural network in BPNNHMDA^20^ or the deep matrix factorization model in DMFMDA^21^, aim to learn low-dimensional, dense vector representations (embeddings) directly from the data. Other methods, like RNMFMDA^22^, employ sophisticated matrix factorization techniques combined with clever sampling strategies, such as positive-unlabeled learning, to address the common challenge of lacking confirmed negative examples. While these methods have certainly advanced the field, they often face their own hurdles, including challenges in fully capturing the complex topological characteristics of the underlying biological network, which can limit their predictive accuracy and efficiency. To overcome these limitations, more recent research has embraced advanced Graph Convolutional networks (GCN) architectures. Models like GraphSAGE have introduced sophisticated inductive learning capabilities, while Graph Transformers have adapted the highly successful attention mechanism for graph data, enabling the capture of complex global dependencies. On the frontier, sophisticated frameworks combining contrastive learning with graph diffusion are being explored^23^, aiming to enhance representation robustness, though often at the cost of increased model complexity and computational overhead.
To address the aforementioned limitations, this paper introduces FGCNMF, a novel and highly efficient computational framework that synergistically combines Fast Graph Convolutional networks with Matrix Factorization. The novelty of our approach lies in its unique two-stage embedding pipeline that decouples global structure learning from local neighborhood refinement, leading to significant gains in both speed and accuracy. Our key contributions are threefold:
- An efficient, non-iterative embedding initialization. We bypass traditional end-to-end GCN training by first using Randomized Singular Value Decomposition (SVD) to rapidly generate a high-quality global embedding for each node.
- A lightweight, parameter-free refinement stage. We then enhance these initial embeddings using simple, fast spatial graph convolutions to effectively integrate local neighborhood information without the computational overhead of complex aggregators or attention mechanisms.
- A superior balance of performance and efficiency. The resulting framework is not only an order of magnitude faster than contemporary GCN baselines but also achieves state-of-the-art predictive performance on benchmark datasets.
The FGCNMF workflow unfolds in three key stages: First, the initial global node embeddings are generated via Randomized SVD. Second, these embeddings are enhanced through spatial convolutions. Finally, the refined embeddings are used as features for a downstream classifier to predict microbe-disease associations. Through rigorous cross-validation on benchmark datasets and compelling case studies, we demonstrate that FGCNMF achieves state-of-the-art performance, significantly outperforming existing models and establishing it as a powerful and practical tool for accelerating microbiome research.
Materials and methods
Dataset
The datasets used in this study were downloaded from previous work^24^. We obtained a dataset HM with 450 known microbe-disease associations and a dataset DB with 4351 known microbe-disease associations (see Table 1). These known microbe-disease associations come from HMDAD^25^ database and Disbiome database^26^, respectively. Furthermore, microbe functional similarity was calculated by the Kamneva method^27^, and disease functional similarity was calculated based on the functional associations between disease-related genes. For convenience, we denoted the known microbe-disease associations as MD, microbe similarity as SM, and disease similarity as SD, respectively.
Table 1. The statistical information of microbe-disease datasets.DatasetMicrobeDiseaseLinksSparsenessHM292394500.0395DB105221843510.0189
Microbe-disease network
Similarity information between microbes (or diseases) contains a large amount of topological information. However, a large amount of effective neighbor information is included in the most similar neighbors, and the low-similarity neighbors have little effect on the model. Therefore, for each microbe in SM, we find its top-n similar neighbors and set edges to connect it with its neighbors. In this way, we obtain a new similarity matrix SM by
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:SM(i,j)=\left\{\begin{array}{c}\:1,\:\:\:if\:j\in\:top\:n\:neighbors\:of\:i\\\:\:0,\:\:\:\:\:\:\:\:\:\:\:\:others\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\end{array}\right.$$\end{document}.
Similarly, we can obtain the new similarity matri × SD of disease.
Based on the new similarity matrix and known association, we construct microbe-disease network \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:G\:=\:(V,\:E)$$\end{document} . Each node \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\:i\in\:V$$\end{document} represents microbe or disease and each edge \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:(i,\:j)\in\:E\:$$\end{document} represents relationship between nodes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:j$$\end{document} . The adjacency matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:H$$\end{document} of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:G$$\end{document} is described as follows
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:H\:=\left[\begin{array}{cc}SM&\:MD\\\:{MD}^{T}&\:SD\end{array}\right]$$\end{document}.
where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{MD}^{T}$$\end{document} is the transpose of the matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:MD$$\end{document} . The degree matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:D$$\end{document} of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:G$$\end{document} is a diagonal matrix with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{D}_{ii}={\sum\:}_{j}{H}_{ij}$$\end{document} .
Prediction model FGCNMF
We propose a novel prediction model FGCNMF which can precisely predict microbe-disease associations. FGCNMF including three steps (Fig. 1): (1) learning the initial embedding vector of graph nodes using Matrix Factorization, (2) enhancing node embeddings via spatial convolution, and (3) using machine learning algorithm to predict associations between unknown node pairs.
Fig. 1. The workflow of FGCNMF.
Given the network G, graph representation learning algorithm will learn a mapping function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\:f:\:V\to\:{F}^{d}$$\end{document} that projects each node to a d-dimensional space. It uses a low-dimensional, dense vector to represent the nodes in the graph. This vector represents the structure of the graph, the more adjacent nodes shared by two nodes, the closer the two corresponding vectors are. The biggest advantage of graph representation learning is that the obtained vector representation can be input to any machine learning model to solve specific problems.
Inspired by proNE^28^, we formulates graph embedding as sparse matrix factorization to efficiently achieve initial node representations. Specifically, for node \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:j$$\end{document} , to representing their similarity in graph G, a similarity matrix C was defined as
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{C}_{i,j}=\left\{\begin{array}{c}log\:{p}_{i,j}-{log}{N}_{E,\:j},\:\:\:\:\:H\left(i,j\right)=1\\\:0,\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:H\left(i,j\right)=0\end{array}\right.$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{p}_{i,j}=\frac{{H}_{ij}}{{D}_{ii}}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{N}_{E,\:j}$$\end{document} is the negative samples associated with node j and defined as
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{N}_{E,\:j}\propto\:{\sum\:}_{i:(i,j)\in\:E}{p}_{i,j}$$\end{document}Randomized Singular Value Decomposition is applied to achieve fast embedding learning. Assuming
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:C\approx\:{\text{Q}\text{Q}}^{T}C$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{Z=Q}^{T}C$$\end{document}Z can be decomposed efficiently by the standard SVD, that is
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:Z=S\varSigma\:{V}^{T}$$\end{document}Therefore,
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:C\approx\:Q{Q}^{T}C=QS\varSigma\:{V}^{T}$$\end{document}and get the initial node representation
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{F}^{\left(0\right)}=QS{\varSigma\:}^{\frac{1}{2}}$$\end{document}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:Q\:$$\end{document} can be obtained by QR decomposition
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:CU=QR$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\:U\sim\mathcal{N}(0,\:1/d)\:$$\end{document} is Gaussian random matrix.
However, embeddings learned by sparse matrix factorization capture only local structural information. To further integrate global structure information of graph, spatial convolution is used to enhance initial embeddings. Inspired by LightGCN^29^, the node representation \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:F$$\end{document} is updated by
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{F}^{\left(1\right)}=\widehat{H}{F}^{\left(0\right)}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{F}^{\left(l\right)}=\widehat{H}{F}^{(l-1)}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:F=\frac{1}{l}{(F}^{\left(1\right)}+\dots\:+{F}^{\left(l\right)}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\widehat{H}=D}^{-\frac{1}{2}}H{D}^{-\frac{1}{2}}$$\end{document} , l represents the number of convolutional layers. The final embedding \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:F$$\end{document} combines the localized smoothing and global clustering information.
Due to the main time cost lies in the SVD of Z and the QR decomposition of C, it is very efficient to learn the embedding of nodes for microbe-disease networks. Then, we used machine learning algorithm as classifiers to predict potential microbe-disease associations.
Results
Experimental settings
To assess the predictive performance of our proposed FGCNMF model, we employed a rigorous 5-fold cross-validation (5-CV) framework. The complete set of known microbe-disease associations was randomly partitioned into five equal-sized subsets. For each fold, the procedure was as follows to ensure a fair evaluation and prevent any data leakage: (1) A single subset of associations was held out as the test set. (2) The remaining four subsets were used as the training set to construct the microbe-disease network. The links from the test set were completely excluded from this network graph. (3) The entire embedding generation process, including both the SVD-based initialization and the spatial graph convolutions, was performed solely on this training graph. (4) The resulting node embeddings were then used to train the final machine learning classifier. (5) Finally, the trained model was evaluated on the held-out test set.
This entire 5-CV procedure was repeated 30 times with different random seeds, and the final results were averaged to mitigate any bias arising from data partitioning. The model’s efficacy was quantified using two standard metrics: the Area Under the Receiver Operating Characteristic Curve (AUC) and the Area Under the Precision-Recall Curve (AUPR).
The task of predicting microbe-disease associations is framed as a binary classification problem, which requires both positive and negative samples for training the classifier and evaluating its performance. The set of all known microbe-disease associations, as described in Section "Dataset", constitute the positive samples in our study. These represent the ground-truth links that the model aims to identify. Identifying confirmed non-associations between microbes and diseases is a significant challenge, as the absence of a recorded link does not definitively prove non-association. Therefore, following common practice in link prediction research, we treat all microbe-disease pairs without a known association in our datasets as unlabeled samples. From this large pool of unlabeled pairs, we construct the negative sample set. To create a balanced dataset for our binary classifier, we randomly select a number of these unlabeled pairs equal to the number of known positive associations. This 1:1 sampling ratio of positive to negative samples is maintained throughout our 5-CV procedure. For each fold, the training set consists of 80% of the positive samples and an equal number of randomly selected negative samples. The remaining 20% of positive samples and an equal number of other unlabeled pairs form the test set, from which performance metrics are calculated. This balanced approach ensures a fair and robust evaluation of the model’s predictive capabilities.
To benchmark FGCNMF’s performance, we conducted a comparative analysis against four state-of-the-art methods. These baselines included: GATMDA^24^, a model utilizing graph attention networks and inductive matrix completion; BRWMDA^30^, which performs a bi-random walk on the microbe and disease similarity networks; GRNMFHMDA^31^, a framework based on graph regularized non-negative matrix factorization for simultaneous prediction across all diseases; and KATZHMDA^13^, a foundational method that predicts associations using the KATZ centrality measure.
Parameter studies
There are three key parameters in this study, the number of top similar neighbors n, the embedding dimension d, and the number of convolutional layers l. Next, we analyze the impact of different parameter settings on the FGCNMF model. To test the sensitivity of the proposed model to these parameters, we obtained the AUC and AUPR values by varying these parameters at the 5- CV setting.
The first key hyperparameter in our model is top n, which is used in the construction of the microbe-disease network. This parameter controls the sparsity of the similarity networks by filtering out weaker, potentially noisy connections and retaining only the edges to the n most similar neighbors for each node. To investigate the effect of this parameter, we varied n across the values [3, 5, 10, 15, 20] and evaluated the model’s performance. The results are presented in Fig. 2. For the HM dataset, we observe that the model achieves its peak performance for both AUC and AUPR when n is set to 5. For the DB dataset, the optimal performance is reached when n is 3. The performance on the DB dataset shows a steady decline as n increases, suggesting that for this larger and sparser network, including more neighbors beyond the top 3 introduces more noise than useful signal, thereby degrading performance. The HM dataset benefits from a slightly larger neighborhood, but performance also begins to decline as n becomes too large. This highlights the importance of tuning this parameter to strike the right balance between information retention and noise filtering. Based on these findings, we set n = 5 for the HM dataset and n = 3 for the DB dataset in all other experiments conducted in this study.
Fig. 2. Performance comparison of top n.
Second, we chose \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\:l\:$$\end{document} from 1, 2, 3, 4, 5, and tested it on dataset HM and DB. Figure 3 describes the changes of average AUC and AUPR scores. We can find that when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:l$$\end{document} reaches a certain value, the performance of the model plateaus. The reason may be that the output of each convolutional layer is used as part of the final embedding, which reduces the risk of node over-smooth and improves the robustness of the model. Therefore, for the HM dataset, we set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:l$$\end{document} to be 3, and for the DB dataset, set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:l$$\end{document} be 2.
Fig. 3. Performance comparison of layer depths in GCN.
Third, we set d = 8, 16, 32, 64, 128, 256 in this experiment. The results are shown in Fig. 4. For dataset HM, the best results were obtained with the d is 32, and for the dataset DB, the best results were obtained with the d is 64.
Fig. 4. Performance comparison of different embedding size.
Performance of classifiers
For a machine learning problem, data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit. In this section, we investigate the effect of machine learning-based classifiers (MLC) on the performance of microbe-disease prediction. Three boosting algorithms, extreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT) and adaptive boosting (Adaboost), two bagging algorithm, extremely randomized trees (ET) and random forest (RF), and two classic machine learning algorithms SVM and logistic regression (LR), are used to predict the association between microbes and diseases.
Based on the same parameter setting, we trained seven independent classifiers to predict the association between microbes and diseases. The comparison results are shown in Figs. 5 and 6. It can be seen from the experimental results that the ET algorithm performs better on HM, and the XGBoost algorithm performs better on DB.
Fig. 5. Performance comparison of different classifiers onHM.
Fig. 6. Performance comparison of different classifiers on DB.
The experimental results in Figs. 5 and 6 show that the ET and XGBoost classifiers consistently outperform other methods on the HM and DB datasets, respectively. We selected ET as the primary classifier for our final model due to its robustness and efficiency. ET, a bagging-based ensemble method, builds multiple decision trees from the training data and averages their predictions. Unlike boosting algorithms (e.g., XGBoost, GBDT), which build trees sequentially to correct the errors of prior trees, ET introduces a higher level of randomness by selecting random split points for features. This increased randomness often helps to reduce model variance and prevent overfitting, which is particularly beneficial when dealing with potentially noisy biological data and high-dimensional embedding features.
While boosting methods can sometimes achieve slightly higher performance by aggressively minimizing bias, they can be more sensitive to hyperparameter tuning and noise. Given that ET provided highly competitive and stable performance across both benchmark datasets, we concluded it offered the best balance of accuracy, robustness, and generalizability for the microbe-disease prediction task.
Ablation study
To rigorously assess the contribution of each component within our proposed FGCNMF pipeline, we conducted an ablation study. The primary goal was to disentangle the performance gains attributable to our novel embedding method from the effects of the downstream classifier. To achieve this, we compared our final FGCNMF embeddings against seven other feature representation methods: LINE, SDNE, GraRep, DeepWalk, Node2Vec, HOPE and Matrix Factorization only (MF-only).
We generated embeddings for all nodes in the HM and DB datasets using these eight methods. Subsequently, we used these embeddings as input features to train two top-performing classifiers from our earlier analysis: ET and XGBoost. By keeping the classifiers constant and varying the embedding source, we can directly attribute performance differences to the quality of the embeddings themselves.
The results of this ablation study, presented in Table 2, clearly demonstrate the superiority of our complete FGCNMF embedding framework. Across both datasets and for both classifiers, the FGCNMF embeddings consistently yielded the highest AUC and AUPR scores. This confirms that the primary driver of our model’s state-of-the-art performance is the high-quality feature representation learned by the FGCNMF pipeline, rather than a coincidental synergy with a single classifier.
Table 2. Ablation study results comparing different embedding methods on the HM and DB datasets using ET and XGBoost classifiers.DataEmbedding modelClassifierAUCAUPRHMFGCNMF (Ours)ET 0.9471 ± 0.0075
0.9546 ± 0.0077 XGBoost0.9287 ± 0.01570.9306 ± 0.0159LINEET0.9110 ± 0.01540.9207 ± 0.0108XGBoost0.9061 ± 0.02190.9053 ± 0.0214SDNEET0.7884 ± 0.06090.7770 ± 0.0344XGBoost0.8279 ± 0.03590.8488 ± 0.0325GraRepET0.8790 ± 0.03350.8746 ± 0.0415XGBoost0.8723 ± 0.02140.8709 ± 0.0190DeepWalkET0.9399 ± 0.02260.9499 ± 0.0149XGBoost0.9210 ± 0.01780.9314 ± 0.0139Node2VecET0.8287 ± 0.02970.8401 ± 0.0280XGBoost0.8402 ± 0.02350.8544 ± 0.0215HOPEET0.9279 ± 0.01030.9306 ± 0.0069XGBoost0.9117 ± 0.02180.9059 ± 0.0291MF-onlyET0.9215 ± 0.00940.9188 ± 0.0102XGBoost0.9034 ± 0.01100.8714 ± 0.0160DBFGCNMF (Ours)ET0.9340 ± 0.00400.9395 ± 0.0060XGBoost 0.9467 ± 0.0033
0.9481 ± 0.0034 LINEET0.8740 ± 0.00690.8748 ± 0.0091XGBoost0.9098 ± 0.00660.9108 ± 0.0080SDNEET0.7369 ± 0.01190.7206 ± 0.0139XGBoost0.7833 ± 0.00830.7893 ± 0.0105GraRepET0.8185 ± 0.01150.8008 ± 0.0151XGBoost0.8638 ± 0.00900.8560 ± 0.0116DeepWalkET0.9446 ± 0.00380.9465 ± 0.0040XGBoost0.9456 ± 0.00370.9477 ± 0.0044Node2VecET0.7985 ± 0.01130.7778 ± 0.0134XGBoost0.8361 ± 0.00970.8316 ± 0.0109HOPEET0.9339 ± 0.00420.9345 ± 0.0054XGBoost0.9309 ± 0.00470.9225 ± 0.0057MF-onlyET0.9103 ± 0.00510.9087 ± 0.0059XGBoost0.9248 ± 0.00460.9211 ± 0.0048The best-performing results are in bold.
Computational efficiency analysis
A core design goal of FGCNMF was to achieve high predictive accuracy with exceptional computational efficiency. To substantiate the “Fast” claim in our model’s name, we provide both a theoretical complexity analysis and a comprehensive empirical runtime comparison.
FGCNMF’s efficiency stems from its two-stage, non-iterative design. The use of Randomized SVD for matrix factorization has a complexity of approximately O(N×d^2^+|E|)), where N is the number of nodes, d is the embedding dimension and |E| is the number of edges. This is a highly efficient one-shot calculation compared to iterative methods. The enhancement step involves l sparse-dense matrix multiplications, with a complexity of O(l×|E|×d). This is a lightweight, parameter-free operation.
To provide practical evidence of this efficiency, we conducted a runtime comparison on the large DB dataset. We measured the total time required to complete a single fold of the 5-fold cross-validation process. For methods requiring a separate classifier (e.g., classic embedding methods), this time includes both embedding generation and XGBoost training. For end-to-end GCNs, this is the total training time for a fixed 100 epochs. All experiments were performed on the same hardware with the following specifications: an Intel(R) Xeon(R) Gold 5418Y CPU @ 3.80 GHz, 512 GB of RAM, and an NVIDIA L40 GPU.
The results in Table 3 are conclusive. FGCNMF is dramatically faster than all other evaluated methods. It completes the task in just over one second. This represents a 2x to 8x speedup over classic graph embedding techniques and an even more significant 50x to 60x speedup compared to modern, complex GCN architectures. This outstanding computational efficiency, combined with its state-of-the-art predictive performance, firmly establishes FGCNMF as a powerful and highly practical tool for large-scale microbe-disease association prediction.
Table 3. Comparison of training time for a single cross-validation fold on the DB dataset.ModelTraining time(s)FGCNMF (Ours) 1.13 LINE + Xgboost4.41SDNE + Xgboost2.29HOPE + Xgboost3.88DeepWalk + Xgboost2.94Node2Vec + Xgboost8.76GraRep + Xgboost2.71GAT70.39GraphSAGE60.34GraphTransformer71.07HGCLWithDiffusion63.38The best-performing results are in bold.
Performance comparison with LightGCN
LightGCN^29^ is a widely used, simplified graph neural network that excels at learning node embeddings for recommendation tasks. It is conceptually similar to our approach in that it uses graph convolutions to refine representations. To provide a direct and meaningful comparison, we constructed two baselines based on it:
- LightGCN: This refers to the standard, end-to-end LightGCN model trained with its native Bayesian Personalized Ranking (BPR) loss function to directly output a prediction score for a given node pair.
- LightGCN + MLC: This baseline was designed to mirror our FGCNMF pipeline to provide direct comparison of the embedding quality. First, we used a trained LightGCN model to generate final node embeddings. Then, these static embeddings were used as input features for a downstream classifier. Here, MLC stands for Machine Learning-based Classifier, and for this baseline, we used the same Extra-Trees classifier employed in our FGCNMF model.
This comparison allows us to differentiate the performance of our fast, non-iterative embedding strategy (FGCNMF) from a standard, iteratively trained GCN embedding strategy (LightGCN + MLC), while also comparing against an end-to-end GCN framework (LightGCN). The results of the 5-fold cross-validation are shown in Fig. 7. The data clearly demonstrates that our FGCNMF model outperforms both LightGCN-based approaches, highlighting the effectiveness of our specialized feature extraction and prediction pipeline.
Fig. 7. Performance comparison with LightGCN.
Performance comparison with advanced GCN models
To rigorously contextualize FGCNMF’s performance within the contemporary landscape of graph learning, we benchmarked it against several state-of-the-art GCN models. The selected baselines represent key advancements in the field:
- GraphSAGE: A prominent inductive GCN framework that aggregates features from a node’s local neighborhood.
- GraphTransformer: A powerful architecture employing self-attention mechanisms, allowing nodes to weigh the importance of all other nodes in the graph.
- HGCLWithDiffusion: A cutting-edge model that integrates heterogeneous graph contrastive learning with graph diffusion to learn robust representations.
Based on the torch_geometric(v 2.7.0) library, we implemented and optimized each baseline for the microbe-disease prediction task. The comparative results under the 5-fold cross-validation setting are presented in Table 4.
Table 4. Performance comparison with advanced GCN models.DataGCN ModelAUCAUPRHMFGCNMF(Ours)0.9471 ± 0.00750.9546 ± 0.0077GraphSAGE 0.9449 ± 0.0124 0.9514 ± 0.0098GraphTransformer0.9535 ± 0.0067 0.9573 ± 0.0062 HGCLWithDiffusion0.8910 ± 0.01920.9093 ± 0.0132DBFGCNMF(Ours) 0.9467 ± 0.0033
0.9481 ± 0.0034 GraphSAGE0.9073 ± 0.00660.9073 ± 0.0069GraphTransformer0.9328 ± 0.00500.9296 ± 0.0057HGCLWithDiffusion0.7975 ± 0.01140.8187 ± 0.0100The best-performing results are in bold.
The results in Table 4 reveal several key insights. On the smaller HM dataset, the performance is highly competitive, with the GraphTransformer model achieving the best AUPR score, while our FGCNMF model is a very close second and outperforms GraphSAGE. However, on the larger and more challenging DB dataset, FGCNMF clearly establishes state-of-the-art performance, surpassing all advanced GCNs, including the GraphTransformer, by a significant margin in both AUC and AUPR.
Notably, the complex HGCLWithDiffusion model did not generalize well to this task, suggesting that its architecture may be highly specialized for other problem domains or requires larger-scale data. The outstanding performance of FGCNMF, particularly on the DB dataset, is significant. It demonstrates that our streamlined and efficient hybrid approach, combining fast matrix factorization for a global view with simple spatial convolutions for local refinement, is a more effective strategy for this task than more complex, parameter-heavy models. This highlights the value of FGCNMF as a practical framework that achieves an exceptional balance between predictive power and model simplicity.
Performance comparison with baseline
The comparison results with the baseline method are shown in Table 5. We conduct FGCNMF for 30 times on 5-fold CV to reduce the impact caused by sample division, and the mean values of AUPR achieves 95.46% on HM, and 94.81% on DB, which are both obvious higher than those of the comparison methods.
Table 5. The 5-CV prediction performance of different models.DatasetModelAUCAUPRHMFGCNMF (Ours) 0.9471 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0075
0.9546 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0077 GATMDA^24^ 0.9554 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0184
0.9334 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0417 BRWMDA^30^0.8916 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00290.9064 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0152KATZHMDA^13^0.8703 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01990.8807 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0167GRNMFHMDA^31^0.8806 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01560.8914 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0162DBFGCNMF (Ours) 0.9467 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0033
0.9481 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0034 GATMDA^24^ 0.9307 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0079
0.9211 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0088 BRWMDA^30^0.8266 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00310.8031 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0041KATZHMDA^13^0.6779 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01410.6785 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0163GRNMFHMDA^31^0.8609 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00470.8669 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0060The best and second-best performing results are highlighted in bold and italic, respectively.
A closer analysis of the results in Table 5 reveals important insights into the interplay between model architecture and dataset characteristics. On the smaller and denser HM dataset, GATMDA achieves a slightly higher AUC score than FGCNMF. This can be attributed to GATMDA’s core Graph Attention Network (GAT) mechanism, which learns to assign variable importance weights to neighboring nodes. In a dense graph like HM, where local neighborhoods can be complex, this attention mechanism is highly effective at identifying the most critical connections. Our model’s simpler, unweighted convolutions, while much faster, may not capture these fine-grained local importance signals as effectively.
However, it is crucial to note two points. First, on the same HM dataset, FGCNMF achieves a significantly higher AUPR (0.9546 vs. 0.9334), a metric often more relevant for imbalanced biological data, suggesting superior performance in identifying true positive associations. Second, on the larger and sparser DB dataset, FGCNMF’s efficient and robust architecture demonstrates its strength, outperforming all baselines, including GATMDA, on both AUC and AUPR. This suggests that our approach scales more effectively and is better suited for larger, more challenging networks.
This analysis provides a clear avenue for future improvement, integrating a lightweight or simplified attention mechanism into the FGCNMF framework could potentially combine the strengths of both approaches, leading to a model with enhanced performance on dense networks while retaining the speed and scalability that make it so effective on larger ones.
Performance on external dataset
To verify the generalizability of the FGCNMF architecture, we applied our entire prediction pipeline to three external microbe-drug association datasets (see Table 6). This involves constructing the network, generating embeddings, and training a classifier from scratch for each new dataset. For these validation experiments, we consistently used the Extra-Trees (ET) algorithm as the downstream classifier. This methodological choice is crucial for a fair evaluation. Our analysis in Section "Performance of Classifiers" showed that while boosting models like XGBoost can achieve peak performance on a specific dataset (e.g., DB), the ET classifier provided highly competitive and stable performance across both of our primary benchmarks. By fixing the classifier to a robust and high-performing option like ET, we ensure that the results on these new datasets are a direct reflection of the FGCNMF embedding framework’s ability to generate powerful predictive features, rather than the outcome of dataset-specific classifier tuning.
Table 6. The statistical information of microbe-drug datasets.DatasetMicrobeDrugLinksSparsenessMDAD173137324700.0104aBiofilm140172028840.0120DrugVirus951759330.0561
In this experiment, the depth of the convolutional layer is set to 2, and the classifier adopts the ET algorithm. Referring to the number of samples in the dataset, the embedding size is set to 64 for the first two datasets and 32 for the last dataset. Table 7 shows the results of FGCNMF and other baselines on these three datasets. Overall, FGCNMF achieves SOTA performance on the first two datasets. The area under the precision-recall curve (AUPR) of FGCNMF have reached to 97.42% and 97.86%, respectively, which are significantly superior to the previous method. The AUC and AUPR of the third dataset are 89.04% and 88.82%, which are 0.82% and 1.56% lower than the optimal GCNMDA^14^, respectively.
Table 7. The summary of model performance on microbe-drug datasets.DatasetModelAUCAUPRMDADFGCNMF (Ours) 0.9667 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0025
0.9742 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0017 GCNMDA^14^ 0.9423 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0105
0.9376 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0114 IMCMDA^32^0.7466 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01020.7773 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0113WNN_GIP^33^0.8721 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01620.8922 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0137GCMDR^34^0.8485 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00620.8509 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0040aBiofilmFGCNMF (Ours) 0.9715 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0052
0.9786 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0025 GCNMDA^14^ 0.9517 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0033
0.9488 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0031 IMCMDA^32^0.7750 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00960.8572 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0049WNN_GIP^33^0.9019 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01870.9408 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0132GCMDR340.8772 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.00760.8847 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0061DrugVirusFGCNMF (Ours) 0.8904 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0086
0.8882 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0098 GCNMDA^14^ 0.8986 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0305
0.9038 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0372 IMCMDA^32^0.6235 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.02450.6962 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0302WNN_GIP^33^0.8002 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01920.8436 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0183GCMDR340.8243 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.01680.8206 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\pm\:$$\end{document} 0.0141The best and second-best performing results are highlighted in bold and italic, respectively.
Case studies for novel prediction validation
To demonstrate the practical utility and hypothesis-generation capabilities of FGCNMF, we conducted case studies to assess its ability to predict novel, scientifically plausible associations for two well-known diseases: asthma and inflammatory bowel disease (IBD). This analysis was designed to simulate a real-world discovery scenario where the goal is to identify new research targets. The methodology was as follows:
- The FGCNMF model was first trained on the complete HM dataset, leveraging all 450 known microbe-disease associations to learn a comprehensive model of these relationships.
- The trained model was then used to predict association scores for both asthma and IBD against every microbe in the dataset.
- For each disease, the predicted microbes were ranked by their association scores. Crucially, we then filtered out any associations that were already present in our original training set of 450 known links. This left us with a ranked list of purely novel predictions.
- We selected the top 20 novel candidate microbes for each disease and performed a comprehensive literature search to determine if these data-driven hypotheses were supported by independent biological or clinical evidence not captured in our original dataset.
Asthma is a heterogeneous inflammatory disease with a well-established, albeit complex, link to the microbiome^35^. Our model successfully generated a list of high-confidence novel predictions. Upon validating the top 20 candidates (as detailed in Table 8), our literature review confirmed that 16 of them have been independently reported in scientific publications as being associated with asthma. This 80% confirmation rate for novel predictions strongly underscores our model’s ability to act as an effective hypothesis generation engine, capable of identifying clinically relevant microbes that are missing from existing curated databases.
Similarly, IBD serves as a quintessential example of a condition driven by gut microbiome dysbiosis^36^, making it an ideal test case^37^. When applied to IBD, FGCNMF’s ability to generate valid novel hypotheses was even more pronounced. After validating the top 20 predicted novel microbes (Table 9), we found that 17 of them, representing an 85% confirmation rate, are indeed recognized in published research as being associated with IBD. This result powerfully demonstrates the robustness of FGCNMF and solidifies its potential as a valuable computational tool for prioritizing novel targets in microbiome-centric disease research.
Table 8. The top 20 asthma-related candidate microbes.RankMicrobeEvidenceRankMicrobeEvidence1Clostridium difficilePMID:25,974,30111FirmicutesPMID:23,265,8592BurkholderiaPMID:24,451,91012FusobacteriumPMID: 28,486,9333ActinobacteriaPMID:23,265,85913Escherichia coliPMID:29,161,8044LactobacillusPMID:20,592,92014Dietzia marisUnconfirmed5Propionibacterium acnesPMID: 30,185,22615PseudomonasPMID:27,076,5846EnterococcusPMID:29,788,02716Clostridium leptumPMID:29,445,2577Staphylococcus epidermidisPMID: 29,569,13417StreptococcusPMID:17,950,5028PropionibacteriumPMID: 13,268,97018VerrucomicrobiaceaeUnconfirmed9Staphylococcus aureusPMID:12,743,58219BacteroidaceaePMID:28,947,02910Stenotrophomonas maltophiliaUnconfirmed20LysobacterUnconfirmed
Table 9. The top 20 IBD-related candidate microbes.RankMicrobeEvidenceRankMicrobeEvidence1BacteroidetesPMID:25,307,76511CorynebacteriumUnconfirmed2FirmicutesPMID:25,307,76512Staphylococcus aureusUnconfirmed3StaphylococcusPMID: 27,239,10713BacteroidaceaePMID:17,897,8844PrevotellaPMID:25,307,76514ParabacteroidesPMID:25,307,7655Clostridium difficileUnconfirmed15AlistipesPMID:28,877,0446PorphyromonadaceaePMID:29,573,23716Oxalobacter formigenesPMID:15,610,3157RikenellaceaePMID: 31,708,89017EnterobacteriaceaePMID:24,629,3448ClostridiaPMID: 31,142,85518VeillonellaPMID:28,842,6409CitrobacterPMID: 30,342,28219AcinetobacterPMID:22,720,09410RuminococcaceaePMID: 31,379,79720Clostridium coccoidesPMID:19,235,886
Conclusion
In this study, we introduced FGCNMF, a new graph embedding framework that integrates fast matrix factorization with spatial graph convolutions to predict microbe-disease associations. By efficiently capturing both global network structure and local neighborhood information, FGCNMF achieves state-of-the-art performance on benchmark datasets, even outperforming more complex GCN architectures. The results underscore the effectiveness of our lightweight and practical approach for this important bioinformatics task.
However, we recognize that our model has certain limitations that offer clear directions for future improvement, particularly concerning its scalability and adaptability.
One of the primary limitations of the current FGCNMF framework is its transductive nature. The model learns specific embeddings for the nodes present in the training graph. If a new microbe or disease is discovered, its inclusion requires rebuilding the entire heterogeneous network, recalculating the similarity matrix, and retraining the entire model from scratch. This process can be computationally expensive and limits the model’s utility in dynamic, real-world scenarios where biological databases are continuously updated.
To address this challenge, future work will focus on extending FGCNMF to an inductive framework capable of handling new entities efficiently. We propose two primary strategies:
- Inductive inference for new nodes. A semi-inductive approach can be developed to accommodate new nodes without full retraining. When a new microbe m_new is introduced, we can first calculate its similarity to all existing microbes in the database. Its initial embedding, F(0)_new, can then be approximated as a weighted average of the embeddings of its k most similar neighbors:
where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:m\_i$$\end{document} are the known neighbors and sim() is the similarity function. Following this initialization, a few steps of localized graph convolution can be applied to refine this embedding based on its new connections, without updating the embeddings of the rest of the graph. This would offer a fast and efficient method for generating predictions for new entities.
- (2)Transition to a fully inductive framework. A more powerful long-term solution is to redesign the model to be fully inductive from the start. This would involve incorporating rich initial node features beyond graph structure, such as genomic sequences or functional annotations for microbes and clinical symptom profiles or gene expression data for diseases. By training a GCN (like GraphSAGE or GAT) to learn a function that maps these features to the final embedding space, the model would be inherently capable of generating embeddings for entirely unseen nodes, thus providing true generalization and eliminating the need for retraining.
We believe that pursuing these directions will significantly enhance the practical applicability of our work. There is still ample room for improvement in microbe-disease association prediction, and adapting powerful graph representation learning algorithms for dynamic and evolving biological data remains a critical and exciting challenge.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Zhao, Y., Wang, C. C. & Chen, X. Microbes and complex diseases: from experimental results to computational models. Brief. Bioinform. 10.1093/bib/bbaa 158 (2021).10.1093/bib/bbaa 15832766753 · doi ↗ · pubmed ↗
- 2Yi, H. C. et al. Learning representations to predict intermolecular interactions on large-scale heterogeneous molecular association network. i Science 23, 101261, (2020). 10.1016/j.isci.2020.10126110.1016/j.isci.2020.101261 PMC 731723032580123 · doi ↗ · pubmed ↗
- 3Long, Y., Luo, J., Zhang, Y. & Xia, Y. Predicting human microbe-disease associations via graph attention networks with inductive matrix completion. Brief. Bioinform. 10.1093/bib/bbaa 146 (2021).10.1093/bib/bbaa 14632725163 · doi ↗ · pubmed ↗
- 4Janssens, Y. et al. Disbiome database: Linking the microbiome to disease. BMC Microbiol.10.1186/s 12866-018-1197-5 (2018).10.1186/s 12866-018-1197-5PMC 598739129866037 · doi ↗ · pubmed ↗
- 5Zhang, J., Dong, Y., Wang, Y., Tang, J. & Ding, M. In Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19 (2019).
- 6He, X. et al. Association for computing machinery, virtual event, China. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 639–648 (2020).
- 7Fu, X. et al. Derived habitats of indoor microbes are associated with asthma symptoms in Chinese university dormitories. Environ. Res.10.1016/j.envres.2020.110501 (2020).10.1016/j.envres.2020.11050133221308 · doi ↗ · pubmed ↗
- 8Li, K. et al. Bacteroides Thetaiotaomicron relieves colon inflammation by activating Aryl hydrocarbon receptor and modulating CD 4(+)T cell homeostasis. Int. Immunopharmacol.10.1016/j.intimp.2020.107183 (2020).10.1016/j.intimp.2020.10718333229197 · doi ↗ · pubmed ↗
