# Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules

**Authors:** Zsuzsanna Koczor-Benda, Joe Gilkes, Francesco Bartucca, Abdulla Al-Fekaiki, Reinhard J. Maurer

PMC · DOI: 10.1021/acs.jcim.5c00665 · Journal of Chemical Information and Modeling · 2025-06-25

## TL;DR

This paper shows that a machine learning model for generating 3D organic molecules creates biased outputs that differ from the training data, affecting chemical properties.

## Contribution

The study reveals a structural bias in the G-SchNet model's molecule generation, impacting chemical space and electronic properties.

## Key findings

- Generated molecules are less saturated and contain more heteroatoms compared to training data.
- Purely aliphatic molecules are rarely produced by the model.
- Generated molecules show altered electronic properties like HOMO–LUMO gaps.

## Abstract

A range of generative machine learning models for the
design of
novel molecules and materials have been proposed in recent years.
Models that can generate three-dimensional structures are particularly
suitable for quantum chemistry workflows, enabling direct property
prediction. The performance of generative models is typically assessed
based on their ability to produce novel, valid, and unique molecules.
However, equally important is their ability to learn the prevalence
of functional groups and certain chemical moieties in the underlying
training data, that is, to faithfully reproduce the chemical space
spanned by the training data. Here, we investigate the ability of
the autoregressive generative machine learning model G-SchNet to reproduce
the chemical space and property distributions of training data sets
composed of large, functional organic molecules. We assess the elemental
composition, size- and bond-length distributions, as well as the functional
group and chemical space distribution of training and generated molecules.
By principal component analysis of the chemical space, we find that
the model leads to a biased generation that is largely unaffected
by the choice of hyperparameters or the training data set distribution,
producing molecules that are, on average, less saturated and contain
more heteroatoms. Purely aliphatic molecules are mostly absent in
the generation. We further investigate generation with functional
group constraints and based on composite data sets, which can help
to partially remedy the model generation bias. Decision tree models
can recognize the generation bias in the models and discriminate between
training and generated data, revealing key chemical differences between
the two sets. The chemical differences we find affect the distributions
of electronic properties such as the HOMO–LUMO gap, which is
a common target for functional molecule design.

## Full-text entities

- **Chemicals:** Organic (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12264931/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12264931/full.md

## References

41 references — full list in the complete paper: https://tomesphere.com/paper/PMC12264931/full.md

---
Source: https://tomesphere.com/paper/PMC12264931