KG2ML: integrating knowledge graphs and positive unlabeled learning for identifying disease-associated genes
Praveen Kumar, Vincent T. Metzger, Swastika T. Purushotham, Priyansh Kedia, Cristian G. Bologa, Christophe G. Lambert, Jeremy J. Yang

TL;DR
KG2ML is a new machine learning pipeline that combines knowledge graphs and PU learning to identify disease-associated genes not previously documented.
Contribution
KG2ML introduces a novel machine learning pipeline integrating knowledge graphs and PU learning to uncover hidden disease-gene associations.
Findings
KG2ML identified 14 out of 15 top-ranked genes for several diseases that lacked prior associations in DDKG but were supported by literature and TINX evidence.
Incorporating PULSCAR-imputed genes improved XGBoost classification performance, showing PU learning's effectiveness in uncovering hidden gene-disease relationships.
Expert evaluations of top imputed genes across 12 diseases confirmed the pipeline's potential to reveal missing associations in knowledge graphs.
Abstract
Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge. Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Genetic Associations and Epidemiology · Advanced Graph Neural Networks
