# Representative Random Sampling of Chemical Space

**Authors:** Diego J. Monterrubio-Chanca, Guido Falk von Rudorff

arXiv: 2508.20609 · 2025-08-29

## TL;DR

This paper introduces a method for unbiased, representative random sampling of chemical space, enabling estimation of the total number of molecules and assessing the representativeness of existing chemical databases without enumerating all molecules.

## Contribution

The work presents a novel sampling technique for chemical space that is unbiased, efficient, and applicable to graph-representable molecules, along with a method to estimate total molecule counts.

## Key findings

- The method efficiently samples molecules of up to 30 atoms.
- It enables estimation of the total number of molecules in chemical space.
- It assesses the representativeness of current chemical databases.

## Abstract

The overwhelming majority of molecules remains unexplored. This is mostly due to the sheer number of them, which prohibits any enumeration of chemical space, the set of all such molecules. In practice, only subsets of chemical space are considered, but those subsets exhibit substantial bias, prohibiting data-driven characterization of chemical space itself. In this work, we provide a method produce unbiased representative random samples of chemical space without enumeration of constituent molecules and to estimate the number of molecules in any custom chemical space. The approach is applicable to molecules which can be represented as graph and runs efficiently even for molecules of 30 atoms. We use it to estimate the representativeness of current databases with respect to their underlying chemical space and to establish a necessary criterion for a lower bound of database sizes to be representative of an underlying chemical space.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20609/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20609/full.md

## References

83 references — full list in the complete paper: https://tomesphere.com/paper/2508.20609/full.md

---
Source: https://tomesphere.com/paper/2508.20609