# The DNA dialect: a comprehensive guide to pretrained genomic language models

**Authors:** Marcell Veiner, Fran Supek

PMC · DOI: 10.1038/s44320-025-00184-4 · Molecular Systems Biology · 2026-01-19

## TL;DR

This paper reviews genomic language models, their development, applications, and challenges in genomics.

## Contribution

It provides a comprehensive guide and analysis of genomic language models, highlighting trends and practical considerations.

## Key findings

- Genomic language models are increasing in complexity and diversity.
- Task-specific design and pretraining data are more impactful than general model scale.
- Applications show potential but also reveal unresolved gaps and pitfalls.

## Abstract

Following their success in natural language processing and protein biology, pretrained large language models have started appearing in genomics in large numbers. These genomic language models (gLMs), trained on diverse DNA and RNA sequences, promise improved performance on a variety of downstream prediction and understanding tasks. In this review, we trace the rapid evolution of gLMs, analyze current trends, and offer an overview of their application in genomic research. We investigate each gLM component in detail, from training data curation to the architecture, and highlight the present trends of increasing model complexity. We review major benchmarking efforts, suggesting that no single model dominates, and that task-specific design and pretraining data often outweigh general model scale or architecture. In addition, we discuss requirements for making gLMs practically useful for genomic research. While several applications, ranging from genome annotation to DNA sequence generation, showcase the potential of gLMs, their use highlights gaps and pitfalls that remain unresolved. This guide aims to equip researchers with a grounded understanding of gLM capabilities, limitations, and best practices for their effective use in genomics.

This Review provides a guide to help researchers gain a clear understanding of the capabilities, limitations, and best practices of genomic language models for their effective use in genomics.

## Full-text entities

- **Genes:** ELAVL1 (ELAV like RNA binding protein 1) [NCBI Gene 1994] {aka ELAV1, HUR, Hua, MelG}, PTEN (phosphatase and tensin homolog) [NCBI Gene 5728] {aka 10q23del, BZS, CWS1, DEC, GLM2, MHAM}, CST9 (cystatin 9) [NCBI Gene 128822] {aka CLM, CTES7A}, SNAP91 (synaptosome associated protein 91) [NCBI Gene 9892] {aka AP180, CALM}, F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}, SRSF1 (serine and arginine rich splicing factor 1) [NCBI Gene 6426] {aka ASF, NEDFBA, SF2, SF2p33, SFRS1, SRp30a}, PLXNB1 (plexin B1) [NCBI Gene 5364] {aka PLEXIN-B1, PLXN5, SEP}, NINL (ninein like) [NCBI Gene 22981] {aka NLP}
- **Diseases:** LLM (MESH:D007806), CL (MESH:D007859), GLM (MESH:D005910), MLM (MESH:D059468)
- **Chemicals:** m6A (MESH:C005955), AlphaGenome (-)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Bacteriophage sp. (species) [taxon 38018], Escherichia coli (E. coli, species) [taxon 562], Mus musculus (house mouse, species) [taxon 10090], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Homo sapiens (human, species) [taxon 9606]
- **Mutations:** T2T
- **Cell lines:** SK-N-SH — Homo sapiens (Human), Neuroblastoma, Cancer cell line (CVCL_0531), K562 — Homo sapiens (Human), Blast phase chronic myelogenous leukemia, BCR-ABL1 positive, Cancer cell line (CVCL_0004), HepG2 — Homo sapiens (Human), Hepatoblastoma, Cancer cell line (CVCL_0027)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12953581/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12953581/full.md

## References

13 references — full list in the complete paper: https://tomesphere.com/paper/PMC12953581/full.md

---
Source: https://tomesphere.com/paper/PMC12953581