Protein Electrostatic Properties are Fine-Tuned Through Evolution

Mingzhe Shen; Guy W. Dayhoff; Jana Shen

PMC · DOI:10.21203/rs.3.rs-6471091/v1·April 28, 2025

Protein Electrostatic Properties are Fine-Tuned Through Evolution

Mingzhe Shen, Guy W. Dayhoff, Jana Shen

PDF

Open Access

TL;DR

This paper shows that protein electrostatic properties can be accurately predicted from their sequence using a machine learning model called KaML-ESM.

Contribution

KaML-ESM is a novel model that uses protein language models to predict pKa values from sequence alone with high accuracy.

Findings

01

KaML-ESM achieves RMSEs near the experimental precision limit for Asp, Glu, His, and Lys residues.

02

Cys prediction errors are reduced to 1.1 pH units with potential for further improvement.

03

The model's performance was validated through proteome-wide analysis and external evaluations.

Abstract

Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate pKa predictions through KaML-ESM, a model pretrained on a synthetic pKa dataset that leverages evolutionary representations from large-scale protein language models ESMs. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of ~0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units – with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein pKa values.…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals4

Cys Glu Asp Lys

Figures4

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics · RNA and protein synthesis mechanisms · Mass Spectrometry Techniques and Applications

Full text

Introduction

Protein structure and function are encoded in its amino acid sequence. Since ionization states play important roles in protein functions, we hypothesized that they can be predicted from the protein sequence alone. Emerging protein large language models (pLLMs) demonstrate powerful performance in predicting protein structures and functions through masked learning of protein sequences evolved over hundreds of millions of years.^1–4^ In a recent publication, the latest evolutionary scale model 3 (ESM3) was able to generate (without supervised learning) a fluorescent protein with a sequence identity of only 58% from known fluorescent proteins.^4^ This is because the representations emerging within the pLLMs reflect the biological structure and function of proteins and improve with scale, e.g., ESM3 is trained with 2.78 billion protein sequences.^4^

We posited that residue-level representations learned by pLLMs such as ESMs encode information about ionization states of protein sidechains and pKa shifts that occur when residues transition from solution to protein environment, which are often large in magnitude for functional sites. To test this, we developed sequence-based pKa prediction models, where per-token (i.e., residuespecific) embeddings extracted from specific layers of an ESM model were used as inputs to a multilayer perceptron (MLP; a feed-forward neural network with fully connected neurons) for predicting residue-specific pKa values. The MLP is trained on the experimental pKa database PKAD-3^5^ which is an expansion and refinement of the widely used PKAD-2 database.^6^ We named the new models KaML-ESMs, as the training protocol is derived from our most recent structure-based KaML (pKa Machine Learning) models, especially KaML-CBTree which achieved the state-of-the-art (SOTA) prediction accuracies for all five amino acids, Asp, Glu, His, Lys, and Tyr.^5^

The sequence-based KaML-ESM models establish a new SOTA in pKa predictions, pushing the accuracy boundary to near the experimental precision (about 0.5 pH units) for Asp, Glu, His and Lys, while reducing the average pKa error of Cys to 1.1 pH units. External validation using newly curated experimental data confirms predictive performance. These results suggest that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. We developed an end-to-end pKa sequence-based platform KaML and performed proteome-wide pKa predictions for proteins identified in chemical proteomic experiments to further validate model performance and platform efficiency. We expect KaML to enable a wide range of applications, from drug design and protein engineering to molecular dynamics simulations.

Results and Discussion

ESM-learned representations can differentiate between identical amino acids with distinct pKa shifts.

We first tested if the residue-specific representations extracted from the pLLM ESM2^3^ (trained with ~65 million unique sequences and 650 million parameters, see Supplemental methods) could differentiate between identical titratable amino acids exhibiting distinct pKa shifts from the solution values. To do so, we employed t-distributed stochastic neighbor embedding (t-SNE) algorithm^7,8^ to generate two-dimensional visualization of pairwise similarities between identical amino acids with experimental pKa values from the PKAD-3 database^5^ (Fig. 1). Significant positive pKa shifts of Asp and Glu form the most prominent clusters. Considering that carboxylic acids with upshifted pKa’s are enriched in functional sites, this analysis confirms that residue-level evolutionary conservation and functional properties encoded by the ESM2 embeddings are indeed linked to pKa shifts. Clustering of positive and negative pKa shifts is also observed for His and Lys, while patterns appear less distinct for Cys and are absent for Tyr likely due to limited training data (60 Cys and 39 Tyr pKa values).

Pretraining combined with distinct acid and base models boosts the performance of KaML-ESM models for pKa predictions.

Encouraged by the t-SNE analysis, we proceeded to build a KaML-ESM model, in which ESM serves as a foundation model to generate residue embeddings from protein sequences which are then fed to an MLP trained on the experimental pKa shifts from the PKAD-3 database^5^ (Supplemental methods). Since PKAD-3 is small (1,167 pKa’s of 330 Asp, 382 Glu, 219 His, 60 Cys, 39 Tyr, and 137 Lys in 247 unique proteins), we conducted model pretraining using a synthetic dataset comprised of the pKa shifts of 29,457 residues in 9,945 proteins predicted by the KaML-CBTree model, which demonstrated SOTA performance previously^5^ (details see Supplemental Methods and Fig. S1). The pretrained model was then fine-tuned on PKAD-3.^5^

Due to the distinctive mechanisms of pKa shifts for acidic (Asp, Glu, Cys, Tyr) and basic (His and Lys) residues,^5^ we reasoned that training separate acid and base models is more appropriate. To evaluate the contributions from model pretraining (PT) and separation of acid/base models (AB), we trained and tested four KaML-ESM2 models: (1) no PT and no AB (baseline model); (2) PT only; (3) AB only; and (4) PT and AB. Compared to the baseline model, applying PT decreased the hold-out test root-mean-square error (RMSE) from 0.93 to 0.89, and applying AB decreased the test RMSE from 0.93 to 0.76 (Supplemental Table S1). Moreover, the combination of PT and AB demonstrates a synergistic effect, reducing RMSE from 0.93 to 0.73 (Supplemental Table S1). Therefore, the remainder of the work will focus on models trained using both strategies.

Representation learning rates vary across amino acids.

Current pLLMs such as ESMs are built on the Transformer architecture,^9^ which processes inputs through a series of blocks that alternate self-attention with feed-forward connections.^1^ Consequently, pLLMs enable each residue to allocate “heightened attention” (increased attention weights) to important residues independently, regardless of their distances in the protein sequence. One earlier study analyzed pLLMs and found that different aspects of protein evolutionary features such as structures and functions are learned across different layers of the transformer, with deeper layers attending to residue contact relationships.^10^ Our recent development of structure-based KaML models demonstrated that protein pKa values can be accurately predicted from the local structural environment.^5^ Considering that protein contact maps reflect the local environment, we asked if there is a particular ESM2 layer that offers the most accurate representation of protein ionization states. To test this, we extracted residue embeddings from the final 50% of layers (17–33) and evaluated their effectiveness by training dedicated models using each layer’s embeddings and examining the overall and amino acid-specific RMSEs. To reduce computational cost, layer evaluation was conducted without model pretraining unless otherwise noted.

Interestingly, the overall test RMSE of the model does not decrease monotonically as learning progresses through the transformer layers; instead, it exhibits multiple local minima (Fig. 2, blue curve). This is due to the different rates of representation learning for different amino acids. The model RMSEs for Cys, His, Lys pKa’s decrease to the lowest value in the final layer (33), while the RMSEs for Asp, Glu, and Tyr reach minima at layers 31, 31, and 30, respectively (Supplemental Table S2). To confirm this pattern, we examined the last few layers by including model pretraining (Supplemental Table S2). The trend is similar, with layer 31 giving the lowest overall RMSE while amino acid-specific RMSEs reach minima at different layers. Since layer 31 embeddings yield the lowest overall test RMSE (0.68), we focus our subsequent discussion on this model and refer to it as KaML-ESM2.

Influence of the ESM parameter scale and capacity of learning emergent structures.

We asked whether ESM’s parameter scale and architectural design influence its representation learning capabilities for protein ionization states. To address these questions, we trained KaML-ESM models using the embeddings from ESM2_15B.^3^ and the latest ESM Cambrian (ESMC, 6B parameters),^11^ which predicts emergent structures with significantly higher precision than ESM2 (even those with larger parameter scales) due to the use of a different architecture and orders of magnitude larger protein sequence space (2.78 billion sequences).^4,11^

For ESM2 15B, the models trained with the final four layers (45–48) give similar performances, with layer 47 (the second last) achieving the lowest RMSE of 0.73 (Supplemental Table S3). In contrast to ESM2 650M and ESM2 15B, the model trained with the final layer (layer 80) of ESMC gives the lowest RMSE of 0.70 (Fig. 2 and Supplemental Table S4). Interestingly, using the final 17 layers of ESMC, the overall RMSE steadily decreases with progressively deeper layers (Supplemental Table S4), suggesting that the model’s capacity to learn protein electrostatic properties may not have reached saturation. The overall RMSEs of KaML-ESMC are consistently lower than the corresponding RMSEs of KaML-ESM2 (with 640M or 15B parameters), suggesting that the enhanced capacity of learning emergent structures^4,11^ plays a more important role in accurate prediction of electrostatic properties than raw parameter scale. It is also noteworthy that the amino acid-specific RMSEs decrease steadily toward the final layer, suggesting that the representation learning rates across amino acids are more uniform compared to ESM2. Since the best KaML-ESM2_15B model gives a higher RMSE, we drop the model in the following discussion. We then retrained KaML-ESMC using the representations of the final layer (80) by including pretraining. We refer to it as KaML-ESMC hereafter.

KaML-ESM2a and KaML-ESMCb establish a new SOTA benchmark for predicting pKa’s and protonation states.

We compared the RMSE, Pearson’s correlation co-efficient (PCC), and maximum error (MAXE) of the predicted pKa’s of acidic and basic residues by KaML-ESMs and structure-based KaML-CBTree and the empirical PROPKA3 method (Table 1). For clarity, we added a suffix a or b to denote the model type, i.e., KaML-ESM2a/KaML-ESMCa for acidic and KaML-ESM2b/KaML-ESMCb for basic pKa predictions.

In identical 20 hold-out tests, both acid and base KaML models outperform the previous SOTA ML pKa predictor KaML-CBTree, which substantially surpasses the widely-used PROPKA3 method (Table 1). Interestingly, KaML-ESM2 and KaML-ESMC demonstrate complementary strengths: KaML-ESM2a excels at predicting acidic residue pKa’s (RMSE=0.67; PCC=0.91), while KaML-ESMCb achieves superior performance for basic residues (RMSE=0.57; PCC=0.96).

We also examined the protonation-state prediction metrics, precision, recall, and critical error rates. Following our previous work,^5^ the continuous pKa values are discretized into three classes based on the protonation probability (Prob) at pH 7: protonated (Prob >0.75, pKa <6.52), deprotonated (Prob <0.25, pKa >7.48), and titrating (0.25≤ Prob ≤0.75, 6.52≤ pKa ≤7.48). According to all classification metrics, both KaML-ESMs outperform KaML-CBTree and PROPKA3 (Table 1). Consistent with the pKa regression metrics, KaML-ESM2a delivers the highest recall, precision, and lowest critical error rates (CERs) when classifying protonation states of acidic residues, while KaML-ESMCb provides the best classification metrics for basic residues (Table 1).

When evaluating individual amino acid pKa and protonation state predictions, KaML-ESM2a establishes a new SOTA for Asp, Glu, and Cys, while KaML-ESMCb a new SOTA for His and Lys (Table 2, Supplemental Fig. S4 and Fig. S5). An exception is Tyr, for which KaML-CBTree remains the SOTA performer, which is unsurprising given the decision tree’s effectiveness when trained on the extremely small dataset of just 39 Tyr pKa’S.

KaML-ESMs offer the most significant improvement for predicting Cys and His pKa’s and protonation states.

The most significant improvement over the previous SOTA KaML-CBTree (KaML-CBT) is for Cys and His. Comparing KaML-ESM2a and KaML-CBTa, the RMSE of Cys pKa’s is reduced by 0.4 units and the CER is reduced by 25% (Table 2). KaML-ESM2a achieves the precision and recall of 89% and 88% in predicting Cys^−^, as compared to 73% and 76% by KaML-CBTa, respectively. This level of performance in predicting deprotonated cysteines, which are highly nucleophilic and frequent linkage sites for targeted covalent inhibitors,^13^ positions KaML-ESM2a as a valuable tool for rational covalent drug design.

The second most significant improvement is for His. The solution pKa of His is ~6.5,^14,15^ which is close to the cytosolic pH 7.1. This means that any small errors in pKa prediction may lead to a critical error (predicting protonated as deprotonated state or vice versa). Along with a 0.24-unit decrease in RMSE when comparing KaML-ESMCb to KaML-CBTb, the CER is reduced by threefold and the recall for His^+^ increased from 0.37 to 0.90 with the precision of 0.92 (Table 2 and Supplemental Fig. S4 and Fig. S5). This remarkable improvement suggests that KaML-ESMCb can be used to improve fidelity of molecular dynamics (MD) simulations, which typically set histidines in the neutral state.

KaML-ESM predictions for Asp, Glu, His, and Lys approach the experimental precision.

An earlier study that analyzed NMR titration data from different laboratories suggested that the minimum average error of pKa estimates is roughly 0.5 units.^16^ Using this knowledge as a guide and noting the convergence in RMSE (~0.6) for Glu between the predictions by KaML-ESM2a, KaML-CBTa (and KaML-ESMa), we suggest that the models have reached an accuracy threshold approaching the experimental measurement uncertainty. Similarly, KaML-ESM2a appears to approach the performance ceiling for Asp, while KaML-ESMCb appears to approach the performance ceiling for His and Lys. Hereafter, we refer to the combined KaML-ESM2a and ESMCb model as KaML-ESM.

External evaluation of KaML-ESM confirms the SOTA performance.

To provide a production KaML-ESM model to the community, the models were retrained using the complete dataset (Supplemental methods). The production KaML-ESM was further evaluated on a newly collected experimental dataset composed of pKa values of 55 residues (39 His, 3 Cys, and 13 Lys) from 16 proteins, which are not in PKAD-3^5,6^ (Supplemental Fig. S6). Examining individual amino acids, KaML-ESM gives RMSE of 0.52 for His, 0.60 for Cys, and 0.47 for Lys, as compared to the respective RMSEs of 0.57, 0.87, 0.97 with KaML-CBTree. The enhanced performance stems from eliminating KaML-CBTree’s systematic tendency to overestimate pKa downshifts of Cys and Lys residues. Lastly, no critical errors are found in the KaML-ESM predictions.

Developing the KaML platform for pKa predictions and visualization.

To provide an accessible tool for the scientific community, we developed a sequence-based pKa prediction platform, utilizing ESM2^3^ and ESMC^11^ as foundation models for downstream sequence-based pKa predictions and the most recent ESM3^4^ to generate protein structures for visualization and optional structure-based pKa predictions (Fig. 3 left, Supplemental Methods). To make KaML-ESM broadly usable, we provide both a command-line interface and an online browser-based GUI (https://kaml.computchem.org) that support input via protein sequence, UniProt ID, PDB ID, or user-provided PDB file, along with an ESM Forge API token to initiate predictions (Supplemental Methods and Fig. S9).

Fig. 3 illustrates the pKa prediction results for an example protein, cyclin-dependent kinase 5 (CDK5), which regulates the mammalian central nervous system.^17^ The prediction process takes about 35 seconds on the command line or web interface. KaML-ESM predicts that Cys83 and Cys157 have down-shifted pKa values, i.e., both are highly reactive, consistent with the finding that they are modified by S-nitrosylation events in neurodevelopmental and neurodegenerative processes.^18^ Our previous work showed that highly reactive cysteines adjacent to binding pockets can serve as covalent linkage sites for targeted covalent inhibitors.^13,19^ Consistent with its reactivity, Cys157 has been identified as ligandable in chemoproteomic experiments,^20–22^ which offers an exciting opportunity for disrupting the interface between CDK5 and p25, as aberrant formation of this complex leads to CDK5 hyperactivation, contributing to tau hyperphosphorylation and neurodegeneration.^17,23^

Proteome-wide predictions further validate the accuracy for Asp/Glu/His/Lys while highlighting improvement opportunities for Cys/Tyr.

To further validate the model performance and demonstrate the high-throughput capability of KaML-ESM, we made sequence-based pKa predictions for proteins identified in chemoproteomic activity-based protein profiling (ABPP) experiments across various cell lines.^20–22,24–30^ A total of 509,837 pKa values of Asp, Glu, His, Cys, Tyr, and Lys residues in proteins expressed by 3,892 unique genes with sequence length < 1022 were predicted (Fig. 4). Note, the majority of these proteins do not have experimental structures. Remarkably, the pKa ranges of Asp, Glu, His, and Lys reflect the experimental distributions of PKAD-3^5^ and the mode/mean pKa values align within 0.1 units from their solution pKa values (Fig. 4), despite this not being an explicit model constraint – supporting our hypothesis that KaML-ESM predictions approach experimental precision.

In contrast, compared to the solution values, the mode/mean of the predicted pKa’s of Cys is lower by 0.6–0.9 units, while that of Tyr is higher by 1.0–1.2 units. These deviations are consistent with the higher prediction errors for Cys and Tyr (RMSEs of 1.1 and 1.2, respectively) and suggest the presence of systematic errors, which are attributed to the limited training data (60 Cys and 39 Tyr pKa’S).

Recent related work based on pLLMs.

Prior to submission, we became aware of an ESM-derived pKa prediction model pKAML,^31^ which utilizes the concatenated vectors comprising the embeddings from a pLLM (e.g., ESM2) and the predicted protein and peptide isoelectric points. pKAML differs from KaML-ESM in many aspects. pKAML is a combined acid and base model directly trained (i.e., no pretraining) on experimental pKa shifts from a subset of the PKAD-2 database^6^ (significantly smaller than PKAD-3^5^ used in this work) and evaluated on a single hold-out test. Among the evaluated pLLMs, pKAML based on ESM2 35M gives the best performance, achieving an overall RMSE of 0.9 in the hold-out test. In comparison, KaML-ESM gives an overall RMSE of 0.65±0.10 in 20 hold-out tests. When evaluated on our external test data, pKAML gives an RMSE of 0.67, compared to the RMSE of 0.49 given by KaML-ESM (Supplemental Fig. S6).

Concluding Discussion

Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. In the past, prediction and interpretation of ionization states have relied on structure-based approaches, including physics-based calculations,^32,33^ empirical methods,^12^ and ML models.^5,32,34–36^ Our work establishes that primary sequence alone enables remarkably accurate pKa predictions. The KaML-ESM model achieves RMSEs approaching experimental precision limits (~0.5 pH units) for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units – with further improvement expected as the training dataset expands. These results support the notion that protein sequence encodes not only structure and function but also precise electrostatic properties, which may have been co-optimized through evolution.

Beyond improving pKa prediction accuracy for Cys and Tyr through the use of larger training dataset, the KaML platform can be refined and expanded. For example, our analysis revealed distinct representation learning patterns across amino acids. While learning saturates at layer 31 for carboxylic acids, the RMSE for Cys continues to decrease, reaching 1.0 in the ESM2 model’s final layer (33). This suggests the potential to develop more refined models that leverage amino acid-specific layer embeddings for improved predictive performance. While KaML-CBTree is currently used to predict conformational state-dependent pKa’s incorporating sequence information and making use of other architectures may improve model performance. Finally, we envisage the integration of KaML with constant pH MD simulation^33^ to model the dynamic interplay between protonation state changes and conformational transitions critical to biological functions. Such an integrated approach would further advance our understanding of how electrostatic remodeling drives protein functions, e.g., in proton-coupled gating of ion channels and activation of membrane transporters.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Rives A.; Meier J.; Sercu T.; Goyal S.; Lin Z.; Liu J.; Guo D.; Ott M.; Zitnick C. L.; Ma J.; Fergus R. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proc. Natl. Acad. Sci. U.S.A. 2021, 118, e 2016239118.10.1073/pnas.2016239118 PMC 805394333876751 · doi ↗ · pubmed ↗
2Elnaggar A.; Heinzinger M.; Dallago C.; Rehawi G.; Wang Y.; Jones L.; Gibbs T.; Feher T.; Angerer C.; Steinegger M.; Bhowmik D.; Rost B. Prot Trans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127.34232869 10.1109/TPAMI.2021.3095381 · doi ↗ · pubmed ↗
3Lin Z.; Akin H.; Rao R.; Hie B.; Zhu Z.; Lu W.; Smetanin N.; Verkuil R.; Kabeli O.; Shmueli Y.; Fazel-Zarandi M.; Sercu T.; Candido S.; Rives A. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 2023, 379, 1123–1130.36927031 10.1126/science.ade 2574 · doi ↗ · pubmed ↗
4Hayes T.; Rao R.; Akin H.; Sofroniew N. J.; Oktay D.; Lin Z.; Verkuil R.; Tran V. Q.; Deaton J.; Wiggert M.; Badkundri R.; Shafkat I.; Gong J.; Derry A.; Molina R. S.; Thomas N.; Khan Y. A.; Mishra C.; Kim C.; Bartie L. J.; Nemeth M.; Hsu P. D.; Sercu T.; Candido S.; Rives A. Simulating 500 Million Years of Evolution with a Language Model. Science 2025, 387, 850–858.39818825 10.1126/science.ads 0018 · doi ↗ · pubmed ↗
5Shen M.; Kortzak D.; Ambrozak S.; Bhatnagar S.; Buchanan I.; Liu R.; Shen J. Ka M Ls for Predicting Protein p Ka Values and Ionization States: Are Trees All You Need? J. Chem. Theory Comput. 2025, 21, 1446–1458.39882632 10.1021/acs.jctc.4c 01602 PMC 12323819 · doi ↗ · pubmed ↗
6Ancona N.; Bastola A.; Alexov E. PKAD-2: New Entries and Expansion of Functionalities of the Database of Experimentally Measured p Ka’s of Proteins. J. Comput. Biophys. Chem. 2023, 1–10.10.1142/s 2737416523500230 PMC 1037350037520074 · doi ↗ · pubmed ↗
7Hinton G. E.; Roweis S. Stochastic Neighbor Embedding. Adv Neur Inf. Process Sys. 2002.
8Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; DuchesnayÉ. Scikit-Learn: Machine Learning in Python. J. Machine Learn. Res. 2011, 12, 2825–2830.