# Predicting protein–carbohydrate binding sites: a deep learning approach integrating protein language model embeddings and structural features

**Authors:** Md Muhaiminul Islam Nafi, M Saifur Rahman

PMC · DOI: 10.1093/bib/bbag008 · 2026-01-29

## TL;DR

This paper introduces DeepCPBSite, a deep learning model that predicts where proteins bind to carbohydrates, using language model embeddings and structural features to improve accuracy.

## Contribution

The novel contribution is the development of DeepCPBSite, an ensemble deep learning model that integrates protein language model embeddings and structural features for predicting protein–carbohydrate binding sites.

## Key findings

- DeepCPBSite achieved 78.7% balanced accuracy and 59.6% sensitivity on the TS53 dataset.
- It outperformed existing methods like DeepGlycanSite by 1.16% in balanced accuracy and 2.94% in sensitivity.
- The model's F1, MCC, and AUPR scores showed improvements of up to 60.21% compared to state-of-the-art methods.

## Abstract

Protein–carbohydrate interactions play an important role in many biological processes and functions, like inflammation, signal transduction, and cell adhesion. In our work, we will study non-covalent carbohydrate binding sites. In this paper, we aim to build a deep-learning model to predict non-covalent protein–carbohydrate binding sites. We were motivated by the fact that experimental approaches for predicting these sites are expensive. So, computational tools are necessary for identifying these interactions. We explored several sequence-based features as well as structural features. We also leveraged protein language model embeddings. We analyzed different architectures and selected the most suitable deep learning architecture for our finalized prediction model, DeepCPBSite. DeepCPBSite is an ensemble model that combines three separate models with three approaches (random undersampling, weighted oversampling, and class-weighted loss) built on the ResNet+FNN architecture. We made separate datasets from three sources: RCSB, UniProt, and CASP. We also compared the structural features extracted from the structures predicted by AlphaFold and ESMFold in the context of our prediction tasks. We employed three different feature selection techniques and finally did a SHAP (SHapley Additive exPlanations) analysis on the structural features after categorizing the proteins based on their organism information. DeepCPBSite achieved 78.7% balanced accuracy and 59.6% sensitivity on the TS53 set, outperforming the second-best competitor, DeepGlycanSite, by 1.16% and 2.94%, respectively. Additionally, its F1, MCC, and AUPR scores outperformed other state-of-the-art methods, with improvements ranging from 3.77%–47.6%, 3.84%–32.7%, and 8.18%–60.21%, respectively.

## Full-text entities

- **Genes:** CASP16P (caspase 16, pseudogene) [NCBI Gene 197350] {aka CASP16}, SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, PSG5 (pregnancy specific beta-1-glycoprotein 5) [NCBI Gene 5673] {aka FL-NCA-3, PSG}, PSS (Potocki-Shaffer syndrome) [NCBI Gene 780904], CLEC3B (C-type lectin domain family 3 member B) [NCBI Gene 7123] {aka MCDR4, TN, TNA}
- **Diseases:** inflammation (MESH:D007249), ESM-2 (MESH:D020803), DL (MESH:D007859)
- **Chemicals:** oligosaccharide (MESH:D009844), galactose (MESH:D005690), CB513 (-), sugar (MESH:D000073893), amino acids (MESH:D000596), CA (MESH:D002118), inositol (MESH:D007294), Dipeptide (MESH:D004151), hydrogen (MESH:D006859), Mannose (MESH:D008358), t (MESH:D014316), acids (MESH:D000143), carbohydrate (MESH:D002241), ARG- (MESH:D001120), carbon (MESH:D002244)
- **Species:** Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395]
- **Cell lines:** -XL — Xenopus laevis (African clawed frog), Spontaneously immortalized cell line (CVCL_6743), ESM-2 — Homo sapiens (Human), Transformed cell line (CVCL_XI05), TS37 — Mus musculus (Mouse), Malignant neoplasms of the mouse mammary gland, Cancer cell line (CVCL_F736)

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12853128/full.md

---
Source: https://tomesphere.com/paper/PMC12853128