# A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products

**Authors:** Tomoki Kawano, Taro Shiraishi, Tomohisa Kuzuyama, Maiko Umemura, Boyang Ji, Boyang Ji, Boyang Ji

PMC · DOI: 10.1371/journal.pcbi.1013181 · PLOS Computational Biology · 2026-02-23

## TL;DR

This paper introduces a transformer-based AI model that predicts and designs biosynthetic gene clusters, enabling the discovery of new natural and unnatural bioactive compounds.

## Contribution

A novel transformer-based framework for modeling and predicting biosynthetic gene clusters using language-like modeling of protein domains.

## Key findings

- Over 50% of true domains in experimentally-validated BGCs were ranked first by the model.
- The model predicted novel domains absent in known BGCs, which led to the production of an unknown cyclooctatin derivative.
- Classification accuracies exceeded 70% for major compound classes like polyketides and terpenes.

## Abstract

Biosynthetic gene clusters (BGCs), comprising sets of functionally related genes responsible for synthesizing complex natural products, are a rich source of bioactive compounds with pharmaceutical potential. Here, we present a transformer-based framework that models functional domains as linguistic units to capture and predict their positional relationships within genomes. Using a RoBERTa architecture, we trained models on four progressively broader datasets: bacterial BGCs, Actinomycetes genomes, bacterial genomes, and bacterial plus fungal genomes. Evaluation using 2,492 experimentally-validated BGCs from the MIBiG database showed that more than 50% of true domains were ranked first and over 75% within the top 10 candidates. Our models also achieved classification accuracies exceeding 70% for major compound classes including polyketides (PKs) and terpenes. To explore model-guided BGC design, we compared predictions from the BGC-trained and genome-trained models using the BGC for the bacterial diterpenoid cyclooctatin as a case study. The genome-trained model uniquely predicted several domains absent from both the original BGC and the prediction by the BGC-trained model. Heterologous expression of one of those predicted domains in Streptomyces albus, together with the biosynthetic genes for cyclooctatin, yielded an unknown cyclooctatin derivative. This framework not only provides a novel BGC prediction method using machine learning but also facilitates rational design of artificial BGCs. Future integration of transcriptomic, protein structural, and phylogenetic data will enhance the models’ predictive and generative capabilities, supporting accelerated discovery and engineering of natural products.

BGCs encode diverse natural products, including antibiotics and anticancer agents. Identifying and designing BGCs in microbial genomes is crucial for discovering new bioactive compounds. In this study, we developed a transformer-based deep learning model that treats protein domains as language-like tokens and learns how they are arranged in genomes. By training on both known BGCs and whole genomes, the model successfully predicts biologically plausible combinations of domains, including those absent in known BGCs. We experimentally validated one such prediction by expressing a newly identified gene alongside known cyclooctatin biosynthetic genes, confirming the production of an unknown cyclooctatin derivative. Our results demonstrate how language models can uncover hidden biosynthetic potential and offer a promising new AI tool for natural product discovery and synthetic biology.

## Linked entities

- **Species:** Streptomyces albus (taxon 1888)

## Full-text entities

- **Diseases:** Type I PKS (MESH:D006969)
- **Chemicals:** formate (MESH:C030544), tacrolimus (MESH:D016559), penicillin (MESH:D010406), cyclosporine (MESH:D016572), GGDP (MESH:C002963), C20 diterpene (-), Cyclooctatin (MESH:C078947), ethyl acetate (MESH:C007650), methanol (MESH:D000432), Cyanobactins (MESH:C000627612), FAD (MESH:D005182), acetonitrile (MESH:C032159), saccharide (MESH:D002241), acetone (MESH:D000096), resorcinol (MESH:C031389), paclitaxel (MESH:D017239), polyethylene glycol (MESH:D011092), terpene (MESH:D013729), H2O (MESH:D014867), diterpenoid (MESH:D004224), PKs (MESH:D061065), H (MESH:D006859), alkaloid (MESH:D000470)
- **Species:** Streptomyces sp. (species) [taxon 1931], Homo sapiens (human, species) [taxon 9606], Anabaena sp. (species) [taxon 1167], Streptomyces albus (species) [taxon 1888], Streptomyces melanosporofaciens (species) [taxon 67327], Spirulina (suborder) [taxon 551299]
- **Mutations:** X500R

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12956122/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12956122/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12956122/full.md

---
Source: https://tomesphere.com/paper/PMC12956122