# Application of machine learning models in pharmaceutical engineering for prediction of pharmaceuticals solubility in supercritical solvent: study on phenytoin solubility

**Authors:** Shengnan Yu, Yang Chen, Weidong Qiang

PMC · DOI: 10.3389/fchem.2026.1775080 · Frontiers in Chemistry · 2026-03-13

## TL;DR

This study uses machine learning to accurately predict the solubility of phenytoin in supercritical carbon dioxide, showing strong performance with a bagging model and polynomial regression.

## Contribution

The novel use of bagging with polynomial regression for predicting drug solubility in supercritical solvents is demonstrated with high accuracy.

## Key findings

- The BAG + PR model achieved an R² score of 0.9949 for CO2 density and 0.97833 for phenytoin solubility.
- BAG + PR had the lowest RMSE, AARD%, and Maximum Error compared to BAG + KNN and BAG + GR models.
- Bagging with polynomial regression outperformed other ensemble combinations in prediction accuracy and precision.

## Abstract

This research investigates the predictive performance of ensemble learning models, specifically Bagging, when combined with weak models including Polynomial Regression (PR), K-Nearest Neighbors (KNN), and Gamma Regression (GR) to estimate drug solubility in supercritical carbon dioxide as the solvent. The models were trained and optimized using the Bat Algorithm (BA). The objective was to accurately predict two important properties: CO2 density and the solubility of phenytoin in it. The bagging technique was applied to combine the predictions of multiple weak models, enhancing overall performance. The results demonstrated remarkable predictive capabilities of the Bagging model with Polynomial Regression (BAG + PR) for both CO2 density and drug solubility. It achieved a high R
2 score of 0.9949 for CO2 density and 0.97833 for solubility. The BAG + PR model also exhibited the lowest Root Mean Square Error (RMSE), indicating superior accuracy in predictions. Moreover, it exhibited the lowest Average Absolute Relative Deviation (AARD%) and Maximum Error, further validating its effectiveness in accurately capturing the relationships among the variables. Comparatively, the BAG + KNN and BAG + GR models also performed well but fell short of the BAG + PR model. While they showed respectable R
2 scores, their RMSE values were higher, suggesting larger prediction errors. The AARD% and Maximum Error metrics were also higher for these models, indicating less precise and more variable predictions.

## Linked entities

- **Chemicals:** phenytoin (PubChem CID 1775)

## Full-text entities

- **Chemicals:** CO2 (MESH:D002245), BAG (-), phenytoin (MESH:D010672)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13021901/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13021901/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC13021901/full.md

---
Source: https://tomesphere.com/paper/PMC13021901