# A High-Performance and Interpretable pKa Prediction Framework Integrating Count-Based Fingerprints and Ensemble Learning

**Authors:** Hui Shen, Yongquan He, Juefeng Deng, Xiaoying Li, Chenqiang Yang, Dingren Ma, Dehua Xia, Haiying Yu

PMC · DOI: 10.3390/molecules31060961 · Molecules · 2026-03-12

## TL;DR

A new machine learning framework improves pKa prediction by using count-based fingerprints and ensemble learning, offering both high accuracy and interpretability.

## Contribution

The novel integration of count-based Morgan fingerprints with ensemble learning provides a more accurate and chemically interpretable pKa prediction model.

## Key findings

- Count-based fingerprints outperformed traditional binary fingerprints in pKa prediction.
- Catboost with SHAP-RFE achieved high accuracy (R2 = 0.890, RMSE = 1.026) using only 81 features.
- The model demonstrated strong generalizability with accurate predictions on an external dataset of 6876 compounds.

## Abstract

The acid dissociation constant (pKa) is a fundamental parameter governing the environmental fate of organic compounds. Accurate pKa prediction remains challenging, as traditional binary Morgan fingerprints (B-MF) fail to capture stoichiometric information critical for modeling substituent effects. This study developed an interpretable machine learning framework for pKa prediction by integrating count-based Morgan fingerprints (C-MF) with ensemble algorithms. Through systematic comparison across four algorithms (Catboost, XGBoost, GBDT, RF), C-MF consistently outperformed B-MF due to its ability to quantify functional group multiplicity. Subsequent SHAP-based recursive feature elimination (SHAP-RFE) optimized the model, identifying Catboost with only 81 features as the optimal architecture, achieving a test-set R2 of 0.890 and RMSE of 1.026. SHAP analysis revealed that the model’s decisions are driven by chemically intuitive features, forming a hierarchical framework where primary ionizable sites set the baseline pKa and electronic modifiers fine-tune it. The applicability domain, defined using the ADSAL method, yielded high-confidence predictions (R2 = 0.926). External validation on an independent open-source dataset containing 6876 acidic compounds, combined with results from ADSAL application domain characterization, enabled accurate pKa prediction for 390 compounds within the application domain (R2 = 0.890, RMSE = 0.942). This further confirms the model’s strong generalizability. This work provides a robust and generalizable tool for high-performance pKa prediction, with significant potential for applications in environmental risk assessment.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13029067/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13029067/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC13029067/full.md

---
Source: https://tomesphere.com/paper/PMC13029067