# PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank

**Authors:** Ziqi Yang, Ziyang Song, Shadi Zabad, Marc-André Legault, Yue Li

PMC · DOI: 10.1093/bib/bbag030 · Briefings in Bioinformatics · 2026-02-02

## TL;DR

A new method called MixEHR-SAGE improves disease prediction and genetic discovery by combining multiple types of health data.

## Contribution

MixEHR-SAGE is a novel multi-modal topic model that enhances phenotyping from EHRs using probabilistic inference and expert priors.

## Key findings

- MixEHR-SAGE identified over 1000 interpretable phenotype topics from UK Biobank data.
- The model accurately predicted type 2 diabetes and leukemia diagnoses.
- Genome-wide association studies using MixEHR-SAGE found new disease-associated loci missed by traditional methods.

## Abstract

Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.

## Linked entities

- **Genes:** PPP1R15A (protein phosphatase 1 regulatory subunit 15A) [NCBI Gene 23645], JMJD6 (jumonji domain containing 6, arginine demethylase and lysine hydroxylase) [NCBI Gene 23210], SRSF2 (serine and arginine rich splicing factor 2) [NCBI Gene 6427]
- **Diseases:** type 2 diabetes (MONDO:0005148), leukemia (MONDO:0004355)

## Full-text entities

- **Genes:** PPP1R15A (protein phosphatase 1 regulatory subunit 15A) [NCBI Gene 23645] {aka GADD34}, MFSD11 (major facilitator superfamily domain containing 11) [NCBI Gene 79157] {aka ET}, ASXL1 (ASXL transcriptional regulator 1) [NCBI Gene 171023] {aka BOPS, MDS}, HNF1B (HNF1 homeobox B) [NCBI Gene 6928] {aka ADTKD3, FJHN, HNF-1-beta, HNF-1B, HNF1beta, HNF2}, INS (insulin) [NCBI Gene 3630] {aka IDDM, IDDM1, IDDM2, ILPR, IRDN, MODY10}, IDH2 (isocitrate dehydrogenase (NADP(+)) 2) [NCBI Gene 3418] {aka D2HGA2, ICD-M, IDH, IDH-2, IDHM, IDP}, DNMT3A (DNA methyltransferase 3 alpha) [NCBI Gene 1788] {aka DNMT3A2, HESJAS, M.HsaIIIA, TBRS}, TET2 (tet methylcytosine dioxygenase 2) [NCBI Gene 54790] {aka IMD75, KIAA1546, MDS}, CDKAL1 (CDKAL1 threonylcarbamoyladenosine tRNA methylthiotransferase) [NCBI Gene 54901], JMJD6 (jumonji domain containing 6, arginine demethylase and lysine hydroxylase) [NCBI Gene 23210] {aka PSR, PTDSR, PTDSR1}, METTL23 (methyltransferase 23, arginine) [NCBI Gene 124512] {aka C17orf95, MRT44}, MXRA7 (matrix remodeling associated 7) [NCBI Gene 439921], SRSF2 (serine and arginine rich splicing factor 2) [NCBI Gene 6427] {aka PR264, SC-35, SC35, SFRS2, SFRS2A, SRp30b}
- **Diseases:** Unspecified diabetes mellitus (MESH:D003920), sex chromosome aneuploidy (MESH:D025064), T1D (MESH:D003922), agranulocytosis (MESH:D000380), hematologic malignancies (MESH:D019337), cholelithiasis (MESH:D002769), chronic diseases (MESH:D002908), myeloid leukemia (MESH:D007951), Chronic Myeloid Leukemia (MESH:D015464), Alzheimer disease (MESH:D000544), MIMIC-III (MESH:C537189), AML (MESH:D015470), cardiovascular conditions (MESH:D002318), diabetic retinopathy (MESH:D003930), Delirium (MESH:D003693), MDS (MESH:D009190), CAD (MESH:D003324), retinal lesions (MESH:D012164), ICD (MESH:D008310), chronic lymphocytic leukemia (MESH:D015451), lesion (MESH:D009059), Parkinson's disease (MESH:D010300), hypercholesterolemia (MESH:D006937), acute promyelocytic leukemia (MESH:D015473), metabolic disorders (MESH:D008659), inflammatory (MESH:D007249), vision loss (MESH:D014786), microvascular complications (OMIM:603933), infection (MESH:D007239), leukemic transformation (MESH:D002472), nonepithelial cancer of skin (MESH:D012878), Lymphoid Leukemia (MESH:D007945), hematologic abnormalities (MESH:D006402), cataracts (MESH:D002386), essential hypertension (MESH:D000075222), neurodegenerative disorders (MESH:D019636), Non-insulin-dependent diabetes mellitus (MESH:D003924), Dementia (MESH:D003704), Leukemia (MESH:D007938), uterine leiomyoma (OMIM:150699), chronic myelomonocytic leukemia (MESH:D015477), cancer (MESH:D009369), retina (MESH:D019572), bleeding (MESH:D006470), asthma (MESH:D001249), peripheral nerve disorders (MESH:D010523), lymphoid neoplasms (MESH:D008223), diabetic kidney disease (MESH:D003928)
- **Chemicals:** phenoxymethylpenicillin (MESH:D010404), blood sugar (MESH:D001786), gliclazide (MESH:D005907), diclofenac (MESH:D004008), metformin (MESH:D008687), A10BB09 (-), pioglitazone (MESH:D000077205), imatinib (MESH:D000068877)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** A10A, rs2058215, rs9897202, rs610308

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12862981/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12862981/full.md

## References

61 references — full list in the complete paper: https://tomesphere.com/paper/PMC12862981/full.md

---
Source: https://tomesphere.com/paper/PMC12862981