# Exploring Random Forest in Genetic Risk Score Construction

**Authors:** Vaishnavi Venkat, Kaylyn Clark, X. Jessie Jeng, Tsung‐Chieh Yao, Hui‐Ju Tsai, Tzu‐Pin Lu, Tzu‐Hung Hsiao, Ching‐Heng Lin, Shannon Holloway, Cathrine Hoyo, Shin‐Yi Chou, Hui Wang, Wan‐Ping Lee, Li‐San Wang, Jung‐Ying Tzeng

PMC · DOI: 10.1002/gepi.70022 · 2025-10-25

## TL;DR

This paper explores using random forest models to build genetic risk scores that better capture complex genetic interactions compared to traditional methods.

## Contribution

The study introduces two novel random forest-based genetic risk score strategies, ctRF and wRF, which improve performance by incorporating genetic variant interactions and base data information.

## Key findings

- ctRF outperforms other random forest-based and classical additive models for traits with complex genetic architectures.
- Incorporating informative base data into random forest-based genetic risk scores enhances predictive accuracy.
- Random forest-based genetic risk scores effectively capture nonlinear genetic interactions in complex traits.

## Abstract

Genetic risk scores (GRS) are crucial tools for estimating an individual's genetic liability to various traits and diseases, computed as a weighted sum of trait‐associated allele counts. Traditionally, GRS models assume additive, linear effects of risk variants. However, complex traits often involve nonadditive interactions, such as epistasis, which are not captured by these conventional methods. In this study, we investigate the use of random forest (RF) models as a model‐free approach for constructing GRS, leveraging RF's capacity to capture complex, nonlinear interactions among genetic variants. Specifically, we introduce two new RF‐based GRS strategies to boost RF performance and to incorporate base data information if available, including (1) ctRF, which optimizes linkage disequilibrium (LD) clumping and p‐value thresholds within RF; and (2) wRF, which adjusts the chance of SNP inclusion in tree nodes based on their association strength. Through simulation studies and real data applications of Alzheimer's disease, body mass index, and atopy, we find that ctRF consistently outperforms other RF‐based methods and classical additive models when traits exhibit complex genetic architectures. Additionally, incorporating informative base data into RF‐GRS construction can enhance predictive accuracy. Our findings suggest that RF‐based GRS can effectively capture intricate genetic interactions, and offer a robust alternative to traditional GRS methods, especially for complex traits with nonlinear genetic effects.

## Linked entities

- **Diseases:** Alzheimer's disease (MONDO:0004975)

## Full-text entities

- **Diseases:** Alzheimer's disease (MESH:D000544)

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12553327/full.md

---
Source: https://tomesphere.com/paper/PMC12553327