# Machine learning models for prediction of (Pro)cathepsin–glycosaminoglycan binding free energies based on molecular structure

**Authors:** Krzysztof K. Bojarski, Patrick K. Quoika, Martin Zacharias

PMC · DOI: 10.1016/j.csbj.2025.11.059 · Computational and Structural Biotechnology Journal · 2025-12-08

## TL;DR

This paper uses machine learning to predict how strongly (pro)cathepsin enzymes bind to glycosaminoglycans, helping to understand and design enzyme interactions more efficiently.

## Contribution

The study introduces a machine learning framework using structural and energetic descriptors to predict binding free energies in (pro)cathepsin–GAG complexes.

## Key findings

- A fully connected neural network achieved the highest accuracy in predicting binding free energies.
- Incorporating Linear Interaction Energy components significantly improved model performance.
- Approximately 17,000 data points were sufficient for stable model performance.

## Abstract

Cathepsins are papain-like proteolytic enzymes localized in lysosomes and the extracellular matrix, where they participate in diverse physiological and pathological processes. They are synthesized as inactive precursors—procathepsins—containing a propeptide domain that blocks access to the active site. The activity of (pro)cathepsins can be modulated by glycosaminoglycans (GAGs), which are negatively charged, sulfated polysaccharides. This study aimed to develop machine learning (ML) models to predict MM-GBSA binding free energies in (pro)cathepsin–GAG complexes. Molecular dynamics simulations were performed using the ff14SB/GLYCAM06j force field for six (pro)cathepsins and six GAGs, representing four periodic states and six binding poses. Structural and energetic descriptors derived from these simulations were used as input features for eight ML algorithms: ElasticNet, Linear Regression, LinearSVR (with RBFSampler), LightGBM, Histogram Gradient Boosting, Fully Connected Neural Network (FCNN), and Random Forest. The FCNN yielded the most accurate predictions (R2 = 0.7124 ± 0.0089; MAE = 5.2033 ± 0.0876 kcal/mol), with GradientBoost-based models performing comparably. Optimal FCNN performance was achieved with a minimal architecture (no hidden layers, dropout rate 0.01, ReLU activation). Incorporating Linear Interaction Energy (LIE) components significantly improved prediction accuracy, and approximately 17,000 data points were sufficient for stable model performance. Overall, this study provides a proof of concept for using ML to estimate binding free energies in protein–GAG systems and establishes a foundation for generalizable, structure-based predictors applicable to a broad range of biomolecular complexes. Beyond predictive accuracy, this approach enables rapid screening of MMGBSA interactions, facilitating the identification of favorable binding regions and accelerating structure-guided design efforts.

## Full-text entities

- **Genes:** CTSS (cathepsin S) [NCBI Gene 1520]
- **Chemicals:** MM-GBSA (-), polysaccharides (MESH:D011134), GAG (MESH:D006025)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12771359/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12771359/full.md

## References

66 references — full list in the complete paper: https://tomesphere.com/paper/PMC12771359/full.md

---
Source: https://tomesphere.com/paper/PMC12771359