# Using machine learning to predict and analyze complex trait diseases: Lessons from a simple abstract model

**Authors:** Eden Maimon, Ori Bondi, John Moult, Ron Unger

PMC · DOI: 10.1371/journal.pone.0342490 · PLOS One · 2026-02-23

## TL;DR

This paper explores how machine learning can predict complex diseases by using abstract models to understand how genetic variations interact and affect disease susceptibility.

## Contribution

The novel contribution is the use of abstract, non-additive disease models to study how disease structure affects predictability and to gain biological insights from neural network analysis.

## Key findings

- Diseases with more complex structures were better predicted than simpler ones.
- Risk prediction was more accurate for diseases with lower prevalence.
- t-SNE analysis of neural networks provided biological insights into disease structure.

## Abstract

The ability to predict individual genetic susceptibility to a complex trait disease is a major challenge in modern medicine. One approach to addressing this challenge utilizes an additive combination of contributions from a large number of single nucleotide polymorphisms (SNPs), with weights derived from Genome Wide Association Studies (GWAS). While this approach is somewhat successful in predicting whether an individual is likely to develop a specific disease, it does not explain why a person is likely to become sick. Here, we designed and utilized abstract disease models to investigate the relationship between disease structure, susceptibility, and predictability. The model consists of a set of interacting pathways, each including several nodes representing loci at which genetic variants can alter the function of the corresponding proteins. Due to the introduction of thresholds for pathway functionality, and the interplay between the pathways, this model is inherently non-additive. We use this “toy model” together with simulated variant data to examine the effect of changing various properties, some of which cannot be easily controlled in a “real-world” scenario. As expected, larger sample sizes improved the performance; the omission of some contributing variants from the dataset was associated with a significant decrease in performance, whereas adding irrelevant variants had little effect. Surprisingly, diseases with a more complex underlying structure were better predicted than those with a simpler structure. In addition, risk prediction was more accurate for diseases with lower prevalence. The algorithm was robust to a reasonable percentage of false negative disease assignments. The largest decrease in performance occurred when two diseases with different genetic etiologies were classified as a single pathology, as often occurs in clinical situations, and apparently confuses the neural network algorithm. Finally, we show that a post-analysis of a neural network using t-SNE can provide biological insights into the underlying disease structure.

## Full-text entities

- **Genes:** PRPH2 (peripherin 2) [NCBI Gene 5961] {aka AOFMD, AVMD, CACD2, DS, MDBS1, RDS}, ROM1 (retinal outer segment membrane protein 1) [NCBI Gene 6094] {aka ROM, ROSP1, RP7, TSPAN23}, EREG (epiregulin) [NCBI Gene 2069] {aka EPR, ER, Ep}, FGFR3 (fibroblast growth factor receptor 3) [NCBI Gene 2261] {aka ACH, CD333, CEK2, HSFGFR3EX, JTK4}, ERBB2 (erb-b2 receptor tyrosine kinase 2) [NCBI Gene 2064] {aka CD340, HER-2, HER-2/neu, HER2, MLN 19, MLN-19}, NOD2 (nucleotide binding oligomerization domain containing 2) [NCBI Gene 64127] {aka ACUG, BLAU, BLAUS, CARD15, CD, CLR16.3}
- **Diseases:** Autism (MESH:D001321), cancer (MESH:D009369), Retinitis Pigmentosa type 59 (OMIM:613861), Alzheimer's Disease (MESH:D000544), Parkinson's Disease (MESH:D010300), Digenic diseases (MESH:D004194), Mendelian diseases (MESH:D030342), thrombophilia (MESH:D019851), CD (MESH:D003424), CVD (MESH:D002318), viral infection (MESH:D014777), Achondroplasia (MESH:D000130), DLBCL (MESH:D016403), ABC (MESH:D016393), inflammatory bowel disease (MESH:D015212), breast cancer (MESH:D001943), Type 2 Diabetes (MESH:D003924), lymphoma (MESH:D008223)
- **Chemicals:** Mg (MESH:D008274), Th (MESH:D013910)
- **Species:** Homo sapiens (human, species) [taxon 9606], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12928469/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12928469/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC12928469/full.md

---
Source: https://tomesphere.com/paper/PMC12928469