# Controlled exploration of chemical space by machine learning of   coarse-grained representations

**Authors:** Christian Hoffmann, Roberto Menichetti, Kiran H. Kanekal, Tristan, Bereau

arXiv: 1905.01897 · 2019-09-11

## TL;DR

This paper introduces a systematic method combining importance sampling and machine learning to efficiently explore and expand chemical space, enabling more reliable property predictions and creating a large database of drug-membrane interactions.

## Contribution

It presents a novel approach that uses coarse-grained ML models and MCMC sampling to systematically explore chemical space and improve property prediction reliability.

## Key findings

- Boosted the number of explored compounds by a factor of 2 to 10.
- ML model accurately recovers linear relationships in transfer free energies.
- Created a database of 1.3 million compounds with predicted drug-membrane insertion energies.

## Abstract

The size of chemical compound space is too large to be probed exhaustively. This leads high-throughput protocols to drastically subsample and results in sparse and non-uniform datasets. Rather than arbitrarily selecting compounds, we systematically explore chemical space according to the target property of interest. We first perform importance sampling by introducing a Markov chain Monte Carlo scheme across compounds. We then train an ML model on the sampled data to expand the region of chemical space probed. Our boosting procedure enhances the number of compounds by a factor 2 to 10, enabled by the ML model's coarse-grained representation, which both simplifies the structure-property relationship and reduces the size of chemical space. The ML model correctly recovers linear relationships between transfer free energies. These linear relationships correspond to features that are global to the dataset, marking the region of chemical space up to which predictions are reliable---a more robust alternative to the predictive variance. Bridging coarse-grained simulations with ML gives rise to an unprecedented database of drug-membrane insertion free energies for 1.3 million compounds.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.01897/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1905.01897/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/1905.01897/full.md

---
Source: https://tomesphere.com/paper/1905.01897