# Multi-fidelity graph neural networks for predicting toluene/water partition coefficients

**Authors:** Thomas Nevolianis, Jan G. Rittig, Alexander Mitsos, Kai Leonhard

PMC · DOI: 10.1186/s13321-025-01057-6 · Journal of Cheminformatics · 2025-08-08

## TL;DR

This paper introduces a method using multi-fidelity learning and graph neural networks to better predict toluene/water partition coefficients, even with limited experimental data.

## Contribution

The novel approach combines low-fidelity quantum chemical data with high-fidelity experimental data using multi-target learning to improve prediction accuracy.

## Key findings

- Multi-target learning achieves an RMSE of 0.44 logP units on the EXT-Zamora dataset.
- The method shows reasonable performance on more complex molecules with an RMSE of 1.02 logP units on the EXT-SAMPL9 dataset.
- Combining COSMO-RS data with experimental data improves model accuracy and applicability.

## Abstract

Accurate prediction of toluene/water partition coefficients of neutral species is crucial in drug discovery and separation processes; however, data-driven modeling of these coefficients remains challenging due to limited available experimental data. To address the limitation of available data, we apply multi-fidelity learning approaches leveraging a quantum chemical dataset (low fidelity) of approximately 9000 entries generated by COSMO-RS and an experimental dataset (high fidelity) of about 250 entries collected from the literature. We explore the transfer learning, feature-augmented learning, and multi-target learning approaches in combination with graph neural networks, validating them on two external datasets: one with molecules similar to training data (EXT-Zamora) and one with more challenging molecules (EXT-SAMPL9). Our results show that multi-target learning significantly improves predictive accuracy, achieving a root-mean-square error of 0.44 \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\log {P}$$\end{document}logP units for the EXT-Zamora, compared to a root-mean-square error of 0.63 \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\log {P}$$\end{document}logP units for single-task models. For the EXT-SAMPL9 dataset, multi-target learning achieves a root-mean-square error of 1.02 \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\log {P}$$\end{document}logP units, indicating reasonable performance even for more complex molecular structures. These findings highlight the potential of multi-fidelity learning approaches that leverage quantum chemical data to improve toluene/water partition coefficient predictions and address challenges posed by limited experimental data. We expect the applicability of the methods used beyond just toluene/water partition coefficients.

The online version contains supplementary material available at 10.1186/s13321-025-01057-6.

We investigate the benefits of transfer learning, feature-augmented learning, and multi-target learning approaches in combination with graph neural networks for the prediction of toluene–water partition coefficients. We show how a combination of a large number of cheap data from the semi-empirical COSMO-RS model with a few high-fidelity experimental data and multi-target learning efficiently leads to machine learning models with broad applicability and low uncertainties of 0.44 to 1.02 log units in the partition coefficient, depending on the test set.

The online version contains supplementary material available at 10.1186/s13321-025-01057-6.

## Linked entities

- **Chemicals:** toluene (PubChem CID 1140)

## Full-text entities

- **Chemicals:** toluene (MESH:D014050), water (MESH:D014867)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12333204/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12333204/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/PMC12333204/full.md

---
Source: https://tomesphere.com/paper/PMC12333204