# Generalizable compound protein interaction prediction with a model incorporating protein structure aware and compound property aware language model representations

**Authors:** Yiming Zhang, Ryuichiro Ishitani, Mizuki Takemoto, Atsuhiro Tomita

PMC · DOI: 10.1038/s42004-025-01844-0 · 2025-12-19

## TL;DR

The paper introduces GenSPARC, a deep learning model that improves compound-protein interaction prediction by using protein structure and compound property data, enhancing drug discovery.

## Contribution

GenSPARC introduces structure-aware protein and compound property-aware representations to improve CPI prediction accuracy and generalizability.

## Key findings

- GenSPARC demonstrates strong generalizability across challenging CPI data splits.
- The model achieves competitive results in virtual screening tasks.
- Structure-aware and multimodal representations enhance interaction modeling.

## Abstract

Compound–protein interaction (CPI) prediction plays a crucial role in drug discovery by aiding the identification of binding and affinities between small molecules and proteins. Current deep learning models rely heavily on sequence-based representations and suffer from a lack of labeled data, which restricts their accuracy and generalizability. To overcome these challenges, we propose GenSPARC (a model with Generalized Structure- and Property-Aware Representations of protein and chemical language models for CPI prediction), a deep learning model that leverages structure-aware protein representations derived from AlphaFold2 predictions and FoldSeek’s three-dimensional interaction alphabet. Compound features were extracted using graph convolutional networks and a pretrained chemical language model, thereby ensuring comprehensive multimodal representation. An attention mechanism further enhanced interaction modeling by capturing intricate binding patterns. GenSPARC was validated successfully with multiple CPI benchmark datasets, demonstrating strong generalizability across challenging data splits and competitive results in virtual screening tasks. Therefore, GenSPARC will substantially advance artificial intelligence-driven drug discovery.

Compound–protein interaction prediction is essential for drug discovery, yet current models struggle with accuracy due to reliance on sequence-based data and limited labeled datasets. Here, the authors introduce GenSPARC, a deep learning model utilizing structure-aware protein representations and advanced multimodal molecular representations, achieving generalizability in virtual screening.

## Full-text entities

- **Genes:** F2 (coagulation factor II, thrombin) [NCBI Gene 2147] {aka PT, RPRGL2, THPH1}, MMP12 (matrix metallopeptidase 12) [NCBI Gene 4321] {aka HME, ME, MME, MMP-12}, PSC (Cholangitis, primary sclerosing) [NCBI Gene 100653366]
- **Diseases:** MAN (MESH:D001289), DUD-E (MESH:C564835), DL (MESH:D007859), CPI (MESH:C563663)
- **Chemicals:** KIBA (-), amino acid (MESH:D000596)
- **Cell lines:** DUD-E — Rattus norvegicus (Rat), Transformed cell line (CVCL_5U39)

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12823658/full.md

---
Source: https://tomesphere.com/paper/PMC12823658