# Protein Electrostatic Properties are Fine-Tuned Through Evolution

**Authors:** Mingzhe Shen, Guy W. Dayhoff, Jana Shen

PMC · DOI: 10.21203/rs.3.rs-6471091/v1 · Research Square · 2025-04-28

## TL;DR

This paper shows that protein electrostatic properties can be accurately predicted from their sequence using a machine learning model called KaML-ESM.

## Contribution

KaML-ESM is a novel model that uses protein language models to predict pKa values from sequence alone with high accuracy.

## Key findings

- KaML-ESM achieves RMSEs near the experimental precision limit for Asp, Glu, His, and Lys residues.
- Cys prediction errors are reduced to 1.1 pH units with potential for further improvement.
- The model's performance was validated through proteome-wide analysis and external evaluations.

## Abstract

Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate pKa predictions through KaML-ESM, a model pretrained on a synthetic pKa dataset that leverages evolutionary representations from large-scale protein language models ESMs. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of ~0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units – with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein pKa values. Our results support the notation that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. Lastly, we provide KaML, a sequence-based end-to-end ML platform that enables researchers to map protein electrostatic landscapes, facilitating applications ranging from drug design and protein engineering to molecular simulations.

## Full-text entities

- **Chemicals:** Cys (MESH:D003545), Glu (MESH:D018698), Asp (MESH:D001224), Lys (MESH:D008239)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12060968/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12060968/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12060968/full.md

---
Source: https://tomesphere.com/paper/PMC12060968