# Transformer Learning in Sequence‐Based Drug Design Depends on Compound Memorization and Similarity of Sequence‐Compound Pairs

**Authors:** Jürgen Bajorath

PMC · DOI: 10.1002/minf.70016 · Molecular Informatics · 2026-01-08

## TL;DR

This paper shows that transformer models in drug design rely on memorization and data similarity rather than learning specific chemical or biological rules.

## Contribution

The study reveals that transformer models for drug design depend on memorization and statistical correlations rather than learning specific molecular information.

## Key findings

- Compound reproducibility depends on similarity between training and test data.
- Specific sequence information is not learned by transformer CLMs.
- Predictions are driven by memorization and statistical correlations.

## Abstract

Chemical language models (CLMs), particularly encoder‐decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context‐dependent rules. However, their black‐box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence‐based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence‐compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over‐interpretation of model outputs and informs the appropriate application of transformer‐based CLMs in molecular design.

Transformer model. Shown is a schematic representation of an encoder‐decoder transformer trained for protein sequence‐based compound design.© 2026 WILEY‐VCH GmbH

## Full-text entities

- **Genes:** GPR160 (G protein-coupled receptor 160) [NCBI Gene 26996] {aka GPCR1, GPCR150}, VN1R17P (vomeronasal 1 receptor 17 pseudogene) [NCBI Gene 441931] {aka GPCR}
- **Diseases:** CLMs (MESH:D007806)
- **Chemicals:** MT (-), ATP (MESH:D000255)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12782052/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12782052/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12782052/full.md

---
Source: https://tomesphere.com/paper/PMC12782052