# Graph-Based Deep Learning Models for Predicting pK a Values of Protein-Ionizable Residues via Physically Inspired Feature Engineering

**Authors:** Ziyu Song, Ruixuan Wang, Xun Jiao, Zuyi Huang

PMC · DOI: 10.1021/acs.jcim.5c01681 · 2026-01-22

## TL;DR

This paper introduces a new method using deep learning and physics-based features to accurately predict pK a values of protein residues, which is important for drug discovery and protein engineering.

## Contribution

The study proposes a novel framework combining molecular dynamics and graph-based deep learning models to improve pK a prediction accuracy.

## Key findings

- Three graph-based models outperformed PROPKA3.5.1 in predicting pK a values for four residue types.
- The graph attention network model showed high accuracy and generalizability compared to recent machine learning models.
- Feature importance analysis revealed biophysically meaningful patterns related to residue pK a values.

## Abstract

The pK
a value of a protein-ionizable
residue reflects its potency to donate a proton at a given pH value,
which is essential for understanding a wide range of biological activity.
Therefore, the accurate prediction of pK
a values of protein residues is crucial for understanding enzymatic
activity and protein–ligand binding, which are fundamental
to drug discovery. Despite significant time and resources being invested
to develop computational methods for protein residue pK
a prediction, the accuracy of existing tools, such as
the widely used PROPKA, remains limited. In this study, an integrated
framework that fuses molecular dynamics simulations and deep learning
models is proposed to improve the predictive accuracy of pK
a values for ionizable residues. Specifically,
we employ high-throughput molecular modeling using the AMOEBA polarized
force field to construct a protein structure data set enriched with
atomic electrostatics and other physics-inspired features. Using the
experimentally determined pK
a values from
the PKAD-2 data set, we trained three graph-based neural network models.
All three models demonstrated substantial improvements in prediction
accuracy across four ionizable residue types, aspartic acid, glutamic
acid, lysine, and histidine, when compared to PROPKA3.5.1, with the
graph attention networks-based model exhibiting both high accuracy
and strong generalizability when benchmarking against several recently
published machine learning models. Beyond these improvements in predictive
performance, feature importance analysis of the best-performing models
revealed physically meaningful patterns of the descriptive features
that aligned with the underlying biophysical principles governing
protein residue pK
a values, most notably,
the complexity of the local microenvironment and the atomic geometric
arrangement within the protein structure. Together, the trained pK
a models and the curated dipole moment-enhanced
data set based on a polarizable FF offer a valuable resource for the
research community, with potential applications in early-stage drug
target identification and protein engineering.

## Full-text entities

- **Chemicals:** aspartic acid (MESH:D001224)

## Figures

27 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12892328/full.md

---
Source: https://tomesphere.com/paper/PMC12892328