# Evaluating genetic-based disease prediction approaches through simulation

**Authors:** Max Shpak, Eric Parfitt, Soroush Mahmoudiandehkordi, Mehdi Maadooliat, Steven J. Schrodi

PMC · DOI: 10.1007/s00439-025-02798-y · 2026-01-21

## TL;DR

This study uses simulations to compare how well different machine learning models predict disease risk based on genetic data.

## Contribution

The study introduces a systematic simulation framework to evaluate genetic-based disease prediction models under various inheritance modes.

## Key findings

- Random forest models outperformed other classifiers in predicting disease phenotypes across different inheritance modes.
- AUC was found to have a curvilinear relationship with the difference in polygenic risk scores between cases and controls.
- Odds-risk models better estimate AUC-PRS associations for small genetic effects, while liability threshold models are better for strong effects.

## Abstract

Common diseases exhibit substantial heritability, and GWAS of these diseases have revealed hundreds of thousands of high-frequency disease susceptibility variants throughout the genome. These studies offer the prospect of using genomic data to improve disease prediction and diagnosis, however, the relative performance of different predictive modeling approaches is not well-characterized. To investigate this systematically, we constructed a Monte Carlo simulation generating model genomes with 500 SNPs carrying risk alleles that are parameterized by the strength of their effects and by different modes of inheritance—additive, dominant, recessive, and combinations thereof. After generating genotypes for cases and controls, several machine learning classifiers (logistic regression, naïve Bayes, random forests, and neural networks, with and without feature selection) were applied to predict disease phenotypes from genotypes. Each classifier’s error rates were evaluated and compared using AUC. We found that random forest models were the most accurate predictors of disease over the range of inheritance parameters, followed by logistic regression and naïve Bayes, while the feedforward multilayer neural network model had lower AUC. We also investigated the association of AUC with the difference in polygenic risk score (PRS) between disease and control samples by comparing AUC in the simulations to the values predicted from the PRS distributions, finding a monotonic, curvilinear relationship as predicted analytically from odds-risk and liability threshold models. Our results also show that with small risk effects, the odds-risk model provided an accurate estimate of the AUC-PRS association while a liability threshold model performed better when risk alleles had strong effects.

The online version contains supplementary material available at 10.1007/s00439-025-02798-y.

## Full-text entities

- **Diseases:** Schizophrenia (MESH:D012559), behavioral disorders (MESH:D001523), age-related macular degeneration (MESH:D008268), non-familial Alzheimer's (MESH:C536596), Type II diabetes (MESH:D003924), rheumatoid arthritis (MESH:D001172), asthma (MESH:D001249), hereditary cancers (MESH:D009386)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12823641/full.md

---
Source: https://tomesphere.com/paper/PMC12823641