Methodological and statistical concerns in MINERVA microbiome-disease knowledge graph
Salvatore Chirumbolo

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGut microbiota and health · Bioinformatics and Genomic Networks · Zoonotic diseases and public health
To the Editor,
Langarica et al. used a MINERVA AI and knowledge graphs to map microbe-disease links, enabling better research, analysis, and clinical decision-making [1].
MINERVA core functionality relies on constructing a knowledge graph (KG) G = (V, E), where nodes V represent microbes and diseases, and edges E ⊆ V × R × V represent directed relationships such as positive (promotive), negative (inhibitory), or unrelated associations. The edges are weighted according to a scoring function derived from multiple publications.
Actually, the paper emphasizes the platform's reliance on open-access data from PubMed and PubMed Central, which introduces a significant selection bias. By excluding paywalled articles, potentially high-quality research published in leading journals is systematically omitted. This leads to a skewed knowledge base that may overrepresent studies from certain regions, institutions, or fields where open-access publishing is more prevalent. Furthermore, the exclusion of grey literature and non-indexed datasets omits potentially valuable, albeit unconventional, sources of microbiome-disease insights. This selection process limits the diversity and scope of associations represented in the knowledge graph.
While this graph formalism is common in biomedical informatics, MINERVA methodology raises several issues in its instantiation of V, E, and the edge-weighting scheme. Confounding and bias in edge construction were observed.
Each edge e_md_ = (m,r,d) connecting a microbe m ∈ M to a disease d ∈ D with relationship type r ∈{positive,negative,unrelated}, is derived from sentence-level relation extraction. This process is effectively a classification function:
where S is a sentence containing both entities, and R is the relation label space. This is performed using a fine-tuned Large Language Model (LLM) with the following risk: the classification is made in isolation, ignoring document- or corpus-level evidence. The local context assumption:
is invalid in biomedical literature, where entities often require co-reference resolution and discourse-level interpretation to infer accurate relationships. For example, negations or hedging (e.g., “may”, “possibly”) alter semantic interpretation but are not explicitly modeled. Therefore, the assumption that each sentence-level prediction can be treated as an independent Bernoulli trial for relation classification is flawed:
This inflates type I errors (false positives) in the graph.
Although the paper by Langarica et al. deserves attention [1], bias from data selection yet occurred.
Let DPub ⊂ DTrue be the subset of all true microbiome-disease publications available in open-access repositories. If the true distribution of associations is p(r,m, d), MINERVA approximates this with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \hat{p}\end{document} (r, m, d) using only DPub. This introduces sampling bias:
But DPub is not i.i.d. sampled from DTrue, hence:
indicating that MINERVA’s inferences are not representative of the true distribution of microbiome-disease relationships.
MINERVA employs a weighting scheme w_md_ based on the journal impact factor F_p_, defined as:
This conflates publication prestige with experimental validity, violating the assumption that citation-level metadata correlates with relationship reliability:
Furthermore, high-impact journals are subject to publication bias, often favoring novel or positive results, skewing w_md_ toward overconfidence in certain associations and introducing systemic bias.
By investigating the relation prediction model evaluation metrics, it was noted that the link prediction model is a Graph Neural Network (GNN) with an F1 score of ~71%, implying that the model predictions have a non-negligible error margin. Let y ∈ {0,1} be the ground truth label and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \hat{y}\end{document} the model prediction for an unobserved relation (m, d). Then:
This corresponds to 29% misclassification, which, given the biological cost of false associations, renders the output suitable only for low-stakes exploratory analyses, not for clinical decision support. Moreover, the link prediction model uses Node2Vec embeddings, which rely on the graph’s structural proximity to infer similarity. However, in bioinformatics, structural similarity in a graph does not always imply functional or causal similarity. The embedding function ϕ: V → R_d_ may preserve walk-based similarities but not biochemical or clinical relevance:
shared biological mechanism
This undermines interpretability and scientific validity.
Moreover, there is confounding in risk score calculation.
The disease risk score is defined as:
This simplistic summation assumes independence among microbial taxa, which violates ecological co-dependence. In microbial ecology, abundances are compositional (i.e. constrained by total sum):
Hence, increases in one taxon inherently imply decreases in others (compositional bias). Standard summative metrics on untransformed relative abundance data yield spurious associations, violating the principles of compositional data analysis (CoDA). Proper risk scoring would require log-ratio transformations such as:
where
without which, inferred disease risk profiles are statistically unsound.
Finally, there is no correction for multiple hypothesis testing.
The graph contains tens of thousands of edges, and multiple hypothesis tests are implicitly performed when evaluating microbe-disease associations [1]. However, the paper provides no mention of controlling for false discovery rate (FDR) using procedures like Benjamini-Hochberg. This omission leads to inflated type I errors:
Without FDR correction, even high-confidence relations may be spurious in aggregate.
In summary, although MINERVA leverages advanced machine learning techniques and offers a user-friendly knowledge interface, its foundational assumptions, sentence-level independence, journal prestige as proxy for quality, absence of CoDA correction, and lack of multiple testing adjustment, are flawed from a bioinformatics and statistical perspective. To improve scientific validity, the platform must incorporate global textual context, compositional data transformations, rigorous FDR control, and probabilistic modeling of relation uncertainty.
Key Points
- MINERVA uses only open-access data, introducing strong selection bias.
- Sentence-level AI extraction ignores context, inflating false positives.
- Journal impact factor wrongly used as proxy for experimental validity.
- Risk score ignores compositional bias; lacks log-ratio data correction.
- No FDR control; model error undermines reliability.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
