# Simplifying causal gene identification in GWAS loci

**Authors:** Marijn Schipper, Jacob Ulirsch, Danielle Posthuma, Stephan Ripke, Karl Heilbron

PMC · DOI: 10.1371/journal.pgen.1012079 · PLOS Genetics · 2026-03-17

## TL;DR

This paper introduces CALDERA, a simpler and effective tool for identifying causal genes in genetic studies, outperforming complex models with fewer inputs.

## Contribution

CALDERA is a novel gene prioritization tool that simplifies causal gene identification in GWAS loci using a logistic regression model with minimal features.

## Key findings

- A simple logistic regression model performed as well as complex models like XGBoost in gene prioritization.
- CALDERA achieved state-of-the-art performance in benchmarking datasets compared to existing methods.
- Applying CALDERA to UK Biobank traits predicted 11,956 putative causal genes, resolving up to 52% of loci.

## Abstract

Genome-wide association studies (GWAS) help to identify disease-linked genetic variants, but pinpointing the most likely causal genes in GWAS loci remains challenging. Existing GWAS gene prioritization tools are powerful but often use complex black box models trained on datasets containing biases. Here, we used a data-driven approach to construct a truth set of causal genes in 200 GWAS loci. We found that a simple logistic regression model performed as well as a more complex XGBoost model, and that many commonly-used gene prioritization features could be removed without meaningfully affecting performance (e.g., expression quantitative trait locus colocalization and Mendelian randomization). We present CALDERA, a gene prioritization tool that uses a logistic regression model and uses just four input features. In independent benchmarking datasets of resolved GWAS loci, CALDERA achieved state-of-the-art performance in comparison with other methods (FLAMES, L2G, and cS2G). CALDERA outputs causal gene probabilities for all genes in a given GWAS locus and we show that these probabilities are well-calibrated. Applying CALDERA to 93 UK Biobank traits, we predicted 11,956 putative causal genes, potentially resolving up to 52% of loci. Overall, CALDERA provides a powerful solution for prioritizing potentially causal genes in GWAS loci that minimizes the data processing required to construct input features and generates an easily-interpretable output score.

Genome-wide association studies are a type of genetic study that have identified many genetic mutations that are involved in disease. However, in most cases the genes that are affected by these mutations are unknown. To predict these “effector genes”, we introduce a new tool called CALDERA. We show that existing tools may be unnecessarily complex: CALDERA achieves state-of-the-art prediction of known effector genes despite using a simpler machine learning model and far fewer input variables. By applying CALDERA to the results of genetic studies for 93 different traits and diseases, we were able to predict the likely effector gene for 52% of all trait- and disease-associated mutations for a total of 11,956 likely effector genes. By accurately linking mutations to genes, we gain a better understanding of disease biology and uncover potential opportunities to treat disease by manipulating the function of these genes.

## Full-text entities

- **Genes:** Runx1 (runt related transcription factor 1) [NCBI Gene 12394] {aka AML1, CBF-alpha-2, Cbfa2, Pebp2a2, Pebpa2b}, Bmp5 (bone morphogenetic protein 5) [NCBI Gene 12160] {aka se}, Tpp1 (tripeptidyl peptidase I) [NCBI Gene 12751] {aka Cln2, LPIC, TPP-1, TPP-I}, Nox4 (NADPH oxidase 4) [NCBI Gene 50490], Cux1 (cut-like homeobox 1) [NCBI Gene 13047] {aka CDP, Cutl1, Cux, Cux-1}, Slc30a8 (solute carrier family 30 (zinc transporter), member 8) [NCBI Gene 239436] {aka C820002P14Rik, ZnT-8, ZnT8}, Setbp1 (SET binding protein 1) [NCBI Gene 240427] {aka C130092E12, Seb, mKIAA0437}, Adamts17 (ADAM metallopeptidase with thrombospondin type 1 motif 17) [NCBI Gene 233332], Tgfbr2 (transforming growth factor, beta receptor II) [NCBI Gene 21813] {aka 1110020H15Rik, DNIIR, RIIDN, TBR-II, TbetaR-II, TbetaRII}, Sfmbt1 (Scm-like with four mbt domains 1) [NCBI Gene 54650] {aka 4930442N21Rik, 9330180L21Rik, Sfmbt, Smr}, IGF1 (insulin like growth factor 1) [NCBI Gene 3479] {aka IGF, IGF-I, IGFI, MGF}, Pde4d (phosphodiesterase 4D, cAMP specific) [NCBI Gene 238871] {aka 9630011N22Rik, Dpde3}, Fgf10 (fibroblast growth factor 10) [NCBI Gene 14165] {aka AEY17, Fgf-10, Fgf5a, Gsfaey17}, Cblb (Cbl proto-oncogene b) [NCBI Gene 208650] {aka Cbl-b}, Glis3 (GLIS family zinc finger 3) [NCBI Gene 226075] {aka 4833409N03Rik, E330013K21Rik}, Zbtb20 (zinc finger and BTB domain containing 20) [NCBI Gene 56490] {aka 1300017A20Rik, 7330412A13Rik, A930017C21Rik, D16Wsu73e, DPZF, HOF}, Lama2 (laminin, alpha 2) [NCBI Gene 16773] {aka 5830440B04, dy, mKIAA4087, mer, merosin}, Pot1a (protection of telomeres 1A) [NCBI Gene 101185] {aka 1500031H18Rik, Pot1}, Wfs1 (wolframin ER transmembrane glycoprotein) [NCBI Gene 22393] {aka wolframin}, Tbx3 (T-box 3) [NCBI Gene 21386] {aka D5Ertd189e}, Ank (progressive ankylosis) [NCBI Gene 11732] {aka Ankh, D15Ertd221e, mKIAA1581}, Npr3 (natriuretic peptide receptor 3) [NCBI Gene 18162] {aka ANP-C, ANPR-C, EF-2, NPR-C, lgj, stri}, Pcsk5 (proprotein convertase subtilisin/kexin type 5) [NCBI Gene 18552] {aka PC5, PC6, SPC6, b2b1549Clo, b2b585Clo}, Pappa2 (pappalysin 2) [NCBI Gene 23850] {aka PAPP-A2, PLAC3, Pappe}, Igf1 (insulin-like growth factor 1) [NCBI Gene 16000] {aka C730016P09Rik, Igf-1, Igf-I}, Pip (prolactin induced protein) [NCBI Gene 18716] {aka GCDFP-15, GP17, SMGP}, Syk (spleen tyrosine kinase) [NCBI Gene 20963] {aka Sykb}
- **Diseases:** Mendelian disorder (MESH:D025861), purpura (MESH:D011693), breast cancer (MESH:D001943), hematological neoplasms (MESH:D019337), short stature (MESH:D006130), thrombocytopenia (MESH:D013921), post- (MESH:D000094025), skeletal defects (MESH:C567306), type 2 diabetes (MESH:D003924), hindlimb hypoplasia (MESH:D000080344), decline in kidney function (MESH:D007680), type 1 diabetes (MESH:D003922), Parkinson's disease (MESH:D010300)
- **Chemicals:** urate (MESH:D014527), bilirubin (MESH:D001663), calcium (MESH:D002118), FLAMES (MESH:C481028), testosterone (MESH:D013739)
- **Species:** Rattus norvegicus (brown rat, species) [taxon 10116], Mus musculus (house mouse, species) [taxon 10090], Danio rerio (leopard danio, species) [taxon 7955], Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** PCHi-C — Trichoplusia ni (Cabbage looper), Spontaneously immortalized cell line (CVCL_C190)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13012518/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13012518/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/PMC13012518/full.md

---
Source: https://tomesphere.com/paper/PMC13012518