# Illuminating the Druggable Human Proteome with an AI Protein Profiling Platform

**Authors:** Guy W. Dayhoff, Daniel Kortzak, Ruibin Liu, Mingzhe Shen, Zhong-Yin Zhang, Jana Shen

PMC · DOI: 10.21203/rs.3.rs-7667948/v1 · Research Square · 2025-10-03

## TL;DR

A new AI platform called AiPP predicts ligand interaction sites in proteins using sequence data, enabling the discovery of druggable targets in the human proteome.

## Contribution

AiPP is a novel AI platform that uses large language models and harmonized training data to predict ligandable sites in proteins with high accuracy.

## Key findings

- AiPP recovers 80% of cysteine liganding events from co-crystal structures with high AUPRC and AUROC scores.
- AiPP identifies ligandable sites in 'undruggable' transcription factors and protein tyrosine phosphatases missed by ABPP.
- AiPP discovers an allosteric site in MC3R, a potential therapeutic target for eating disorders and obesity.

## Abstract

Creating a ligandable atlas for the proteome would transform our understanding of protein functions and accelerate therapeutic discovery; however, proteomic approaches are constrained by insufficient proteome coverage and data heterogeneity, while existing machine learning (ML) models have limited power due to structural dependencies and heterogeneous experimental labels. Here we developed AiPP, a multimodal AI platform that predicts and characterizes ligand interaction sites directly from protein sequence. AiPP is powered by the evolutionary-scale protein large language models (LLMs) and leverages two harmonized ML training sets derived from the new databases comprising cysteine ligandability from activity-based protein profiling (ABPP) studies and reversible binding evidenced from co-crystal structures. We developed a LLM representation based clustering framework to interrogate, reconcile, and augment experimental labels in both databases. Two complementary protocols were implemented to iteratively expand the training data while improving model performance. Although trained exclusively on ABPP data, AiPP recovers 80% (Top-1) of cysteine liganding events from cocrystal structures, with 84% AUPRC and 89% AUROC. AiPP recapitulates consistently and heterogeneously liganded cysteines across cancer cell lines and reliably identifies dynamic, ligandable pockets in “undruggable” transcription factors. Remarkably, AiPP accurately predicts active-site and allosteric cysteines in protein tyrosine phosphatases that were undetected by ABPP. Finally, we applied AiPP to the entire human proteome, identifying ligandable sites in proteins that were undetected or unliganded by ABPP, including an allosteric site in MC3R, which is a therapeutic target for treatment of eating disorder and obesity. This proteomewide covalent ligandability atlas (version 1.0) is anticipated to guide future development of chemical probes and pharmaceutical modulators, particularly for understudied proteins and currently undruggable targets. The LLM-based approach to interrogate large-scale heterogeneous data is broadly applicable to protein research and development of proteomics-derived ML models for diverse applications.

## Linked entities

- **Proteins:** MC3R (melanocortin 3 receptor)
- **Diseases:** eating disorder (MONDO:0005451), obesity (MONDO:0011122)

## Full-text entities

- **Genes:** MC3R (melanocortin 3 receptor) [NCBI Gene 4159] {aka BMIQ9, MC3, MC3-R, OB20, OQTL}
- **Diseases:** cancer (MESH:D009369), eating disorder (MESH:D001068), obesity (MESH:D009765)
- **Chemicals:** cysteine (MESH:D003545), AiPP (MESH:C033561)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12622158/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12622158/full.md

## References

69 references — full list in the complete paper: https://tomesphere.com/paper/PMC12622158/full.md

---
Source: https://tomesphere.com/paper/PMC12622158