# Data augmentation in a triple transformer loop retrosynthesis model

**Authors:** Yves Grandjean, David Kreutter, Jean-Louis Reymond

PMC · DOI: 10.1039/d5dd00465a · Digital Discovery · 2026-01-21

## TL;DR

This paper introduces a method to reduce bias in chemical reaction datasets by generating and validating new reactions using a triple transformer model, improving synthesis planning.

## Contribution

A novel data augmentation approach using a triple transformer loop to generate and validate balanced chemical reactions.

## Key findings

- Generated 27.5 million validated fictive reactions using USPTO templates.
- A model trained on the augmented dataset outperformed one trained on USPTO data alone.
- The method effectively mitigates dataset bias in chemical reaction data.

## Abstract

Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). To obtain an equilibrated dataset, we applied retrosynthesis templates to USPTO molecules as products (P) to generate starting materials (SM). We then used transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R) for the SM → P reaction. Finally, we validated the prediction by requesting a high confidence prediction (>95%) for the prediction of P from SM + R by TTL transformer T3. We generated up to 5000 reactions per template, resulting in 27.5m validated fictive reactions covering the chemical space of the original USPTO dataset. To exemplify the use of this dataset, we demonstrate that a single-step retrosynthesis transformer model trained on a template equilibrated subset of 1 097 374 fictive reactions outperforms the corresponding model trained on USPTO reactions only.

To mitigate bias in the USPTO dataset, we generated fictive reactions from USPTO templates and validated them with a triple transformer loop. Retrosynthesis models trained on this data outperform those trained on USPTO alone.

## Full-text entities

- **Chemicals:** SM   P (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12878001/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12878001/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC12878001/full.md

---
Source: https://tomesphere.com/paper/PMC12878001