# Evaluating data partitioning strategies for accurate prediction of protein-ligand binding free energy changes in mutated proteins

**Authors:** Liangxu Xie, Guoming Bao, Dawei Zhang, Lei Xu, Xiaojun Xu, Shan Chang

PMC · DOI: 10.1016/j.csbj.2025.10.020 · Computational and Structural Biotechnology Journal · 2025-10-14

## TL;DR

This paper evaluates how different data partitioning methods affect predictions of protein-ligand binding energy changes due to mutations and introduces a new framework to improve accuracy.

## Contribution

The novel contribution is the anchor-query partitioning framework that improves prediction accuracy using limited reference data.

## Key findings

- Random partitioning leads to inflated performance estimates compared to UniProt-based splitting.
- The proposed anchor-query framework improves prediction accuracy with minimal reference data.
- ML/DL models show high correlation under random partitioning but lower accuracy under UniProt-based partitioning.

## Abstract

Accurate prediction of the relative free energy of protein-ligand binding, especially regarding protein mutations, is vital for drug design and interpreting drug resistance. However, machine learning (ML) / deep learning (DL) methods often struggle with generalization due to dataset partitioning strategy. Random data partitioning potentially produces spuriously high correlations that inflate performance estimates. UniProt-based splitting preserves data independence but lacks high prediction accuracy. In this study, we first evaluate six distinct ML/DL models on the MdrDB database using two data partitioning methods. Protein sequences are embedded using the ESM-2 protein large language model, integrating wild-type and mutant features. Although all models show high predictive correlations (Pearson coefficients up to 0.70) under random partitioning, their performance declines with UniProt-based partitioning. To address this issue, we propose a query-anchor pairwise learning framework, utilizing known states as anchor points for predicting unknown query states. The proposed method is validated across three systems, revealing that even a small amount of reference data can significantly enhance prediction accuracy. This enhancement suggests that leveraging known states as anchor points allows for more precise predicting of unknown query states.

•Evaluated impact of different data partitioning strategies on predicting mutation-induced changes in binding free energy.•UniProt-based partitioning reduces model prediction accuracy, highlighting potential overestimation from conventional methods.•Proposed an anchor-query partitioning framework, leveraging limited reference data to improve predictive generalization.

Evaluated impact of different data partitioning strategies on predicting mutation-induced changes in binding free energy.

UniProt-based partitioning reduces model prediction accuracy, highlighting potential overestimation from conventional methods.

Proposed an anchor-query partitioning framework, leveraging limited reference data to improve predictive generalization.

## Full-text entities

- **Genes:** ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) [NCBI Gene 25] {aka ABL, BCR-ABL, CHDSKM, JTK7, bcr/abl, c-ABL}, F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}, EGFR (epidermal growth factor receptor) [NCBI Gene 1956] {aka ERBB, ERBB1, ERRP, HER1, NISBD2, NNCIS}, ATM (ATM serine/threonine kinase) [NCBI Gene 472] {aka AT1, ATA, ATC, ATD, ATDC, ATE}, CASP16P (caspase 16, pseudogene) [NCBI Gene 197350] {aka CASP16}, RET (ret proto-oncogene) [NCBI Gene 5979] {aka CDHF12, CDHR16, HSCR1, MEN2A, MEN2B, MTC1}, TP53 (tumor protein p53) [NCBI Gene 7157] {aka BCC7, BMFS5, LFS1, P53, TRP53}
- **Diseases:** tumor (MESH:D009369), DL (MESH:D007859)
- **Chemicals:** Gefitinib (MESH:D000077156), DNN (-), Gleevec (MESH:D000068877)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** T315I, serine/threonine, E255K, A-15 A, G719S, T790M, L858R
- **Cell lines:** ESM-2 — Carassius auratus (Goldfish), Spontaneously immortalized cell line (CVCL_L020)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12569818/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12569818/full.md

## References

81 references — full list in the complete paper: https://tomesphere.com/paper/PMC12569818/full.md

---
Source: https://tomesphere.com/paper/PMC12569818