# Unsupervised learning reveals novel disease-associated proteins in high-dimensional human proteomic data

**Authors:** Elvis Bernard, Yiling Wang, Manlin Chen, Shunqing Xu

PMC · DOI: 10.1038/s41598-026-41385-7 · Scientific Reports · 2026-02-22

## TL;DR

This paper introduces a new unsupervised learning method to discover disease-related proteins in large proteomic datasets.

## Contribution

The novel DIRAM/COD framework enables efficient unsupervised analysis of high-dimensional proteomic data.

## Key findings

- Confirmed known biomarkers like UBE2L6 for hypertension and LRCH4 for leukemia.
- Identified IGF2BP3 as a novel protein associated with celiac disease.
- Discovered several previously unlinked proteins to various diseases.

## Abstract

Modern advancements in precision medicine have led to the generation of vast proteomic datasets, capturing the concentrations of thousands of proteins across tens of thousands of participants. These datasets are traditionally processed using supervised learning methods due to their relative simplicity to implement and assess the output. However, this approach can sometimes overlook subtle patterns that might offer deeper insights. In contrast, unsupervised learning, while capable of revealing hidden relationships, struggles with the challenge of high dimensionality, meaning that brute-force analysis could take millennia to complete. In this study, we developed the Dimensionality Reduction with Avoidance of Missing/COmmunity Detection (DIRAM/COD) framework to address this problem by combining dimensionality reduction techniques with unsupervised learning to analyze the massive proteomic dataset of the UK Biobank, which includes the concentrations of 2,923 plasma proteins from 52,691 participants. By applying this novel approach, we not only confirmed well-established biomarkers for diseases such as hypertension (UBE2L6) and leukemia (LRCH4) but also identified novel protein candidates. For instance, we identified IGF2BP3 in connection with celiac disease, a protein previously linked to intestinal barrier function, along with several other proteins not yet associated with these diseases. This approach opens up exciting possibilities for future research and may pave the way for the discovery of new biomarkers and therapeutic targets.

The online version contains supplementary material available at 10.1038/s41598-026-41385-7.

## Linked entities

- **Genes:** UBE2L6 (ubiquitin conjugating enzyme E2 L6) [NCBI Gene 9246], LRCH4 (leucine rich repeats and calponin homology domain containing 4) [NCBI Gene 4034], IGF2BP3 (insulin like growth factor 2 mRNA binding protein 3) [NCBI Gene 10643]
- **Diseases:** leukemia (MONDO:0004355), celiac disease (MONDO:0005130)

## Full-text entities

- **Genes:** UBE2L6 (ubiquitin conjugating enzyme E2 L6) [NCBI Gene 9246] {aka RIG-B, UBCH8}, IGF2BP3 (insulin like growth factor 2 mRNA binding protein 3) [NCBI Gene 10643] {aka CT98, IMP-3, IMP3, KOC, KOC1, VICKZ3}, LRCH4 (leucine rich repeats and calponin homology domain containing 4) [NCBI Gene 4034] {aka LRN, LRRN1, LRRN4, PP14183}
- **Diseases:** celiac disease (MESH:D002446), leukemia (MESH:D007938), hypertension (MESH:D006973)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13022487/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13022487/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC13022487/full.md

---
Source: https://tomesphere.com/paper/PMC13022487