# Out-of-distribution evaluation of active learning pipelines for molecular property prediction

**Authors:** Tianzhixi Yin, Peiyuan Gao, Gihan Panapitiya, Emily G. Saldanha

PMC · DOI: 10.1039/d5ra08055j · RSC Advances · 2026-01-23

## TL;DR

This paper evaluates active learning for predicting molecular properties, showing it outperforms random sampling when dealing with unseen chemical data.

## Contribution

The study introduces an active learning framework using evidential deep learning to improve generalization on out-of-distribution molecular data.

## Key findings

- Evidential deep learning-based active learning outperforms random sampling in predicting solvation energy for out-of-distribution molecules.
- Active learning improves generalization to unseen chemical space compared to random sampling.
- The similarity between training and test datasets significantly affects active learning performance.

## Abstract

Active learning (AL) has been widely applied as a strategy to reduce the data requirements of training machine learning models. Such a strategy can be especially valuable in fields where data collection is costly or time-consuming, as is the case for molecular property data. In this study, we evaluate AL for molecular property prediction, focusing on the performance on out-of-distribution (OOD) data. This OOD evaluation framework mimics the scenario found in real-world applications but is understudied in the prior literature. In our study, we focus on the prediction of solvation energy from molecular structure and develop an AL framework based on prediction uncertainties derived from Evidential Deep Learning (EDL). We started by training our model on an in-distribution training dataset and progressively augmented it with molecules from an OOD dataset sampled from PubChem, selected either randomly or using the AL strategy. We further examined generalization capabilities of AL by beginning with a subset of the in-distribution dataset, intentionally chosen to reduce initial diversity. Our results indicate that EDL demonstrates an advantage over random sampling. To further understand the behavior of the AL algorithm, we performed analysis of how the similarity between the training dataset and the held-out dataset affects the AL performance and of the distributional differences in the types of molecules selected by random sampling and AL.

Evidential deep learning–driven active learning beats random sampling for OOD solvation free-energy prediction, delivering lower test RMSE at the same labeling budget and improving generalization to unseen chemical space.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12829442/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12829442/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12829442/full.md

---
Source: https://tomesphere.com/paper/PMC12829442