Personalized gene expression prediction in the era of deep learning: a review

Viksar Dubey; Li Shen

PMC · DOI:10.1093/bib/bbag022·January 30, 2026

Personalized gene expression prediction in the era of deep learning: a review

Viksar Dubey, Li Shen

PDF

Open Access

TL;DR

This paper reviews how deep learning models can predict gene expression from genomic sequences, but highlights challenges in adapting them to personal genomic data.

Contribution

The paper provides a comparative analysis of deep learning and linear models for personalized gene expression prediction, emphasizing current limitations and novel fine-tuning strategies.

Findings

01

Deep learning models trained on reference genomes struggle with personal genomic data.

02

Linear models often outperform deep learning models in cross-individual gene expression prediction.

03

Fine-tuning strategies and genomic language models are emerging solutions to improve personalization.

Abstract

Predicting gene expression from genomic sequences is a central goal in computational genomics. Recent advances have demonstrated that deep learning models trained on large-scale epigenomic datasets hold significant promise for this task. However, their success heavily depends on how they are applied: most models are trained exclusively on a reference genome, limiting their ability to capture individual-specific genetic variation. Consequently, while these models perform well on reference genomes, they often struggle when applied to personal genomic data. This review discusses recent efforts to overcome these limitations and explores methods aimed at improving the prediction of personalized gene expression. In particular, we compare the performance of deep learning models with traditional expression quantitative trait loci-based linear approaches, examining novel fine-tuning strategies,…

Figures1

Click any figure to enlarge with its caption.

A generic deep learning-based S2F model with an encoder-decoder architecture. The model uses DNA sequences as input and can predict transcriptomic coverage at high resolution as output. The encoder consists of stacked convolution layers followed by stacked transformer layers. A convolution layer is illustrated as series of triangles that represent convolutional filters and squares that represent the activations. A transformer layer is illustrated as overlapping vectors of query (Q), key (K) and value (V) at each position, followed by a feed forward layer to produce the embeddings for the sequence. The output of the final transformer layer is a bottleneck layer that represents the final embeddings for the sequence. In some designs, the final embeddings are directly used to predict the sequence’s functional outcomes. While in others, the bottleneck layer is followed by up-sampling to increase the resolution and 1 x 1 convolution to reduce the number of channels before output

Tables4

Table 1. Summary of S2F models that use genome sequence as input to predict functional outcomes. All models were evaluated across genomic regions

Model	Year	Modeling approach	Context window	Resolution	Performance highlights
DeepSEA [19]	2015	Three-layer CNN	1 kb	1 kb	Median AUC 0.86–0.96 for various chromatin features
ExPecto [20]	2018	Two-stage model: CNN + exponential basis linear model	40 kb around TSS	200 bp	0.82 Spearman correlation on tissue-specific expression prediction
Basenji2 [21]	2020	Deep residual CNN (dilated conv layers)	131,072 bp	128 bp	Pearson \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $ R = 0.81 $\end{document} for gene expression (CAGE signal at TSS)
Xpresso [22]	2020	Shallow CNN on promoter sequence (+ mRNA stability features)	10.5-kb promoter region	Gene-level output	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $ R^{2} = 0.59 $\end{document} (human) and 0.71 (mouse)
Enformer [23]	2021	Hybrid CNN–Transformer	196 kb	128 bp	Pearson \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $ R = 0.85 $\end{document} for gene expression (CAGE across genes)
Borzoi [24]	2025	CNN + Transformer (Enformer-based architecture with upsampling for fine resolution)	524 kb	32 bp	Pearson \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $ R = 0.87 $\end{document} for gene-level expression
AlphaGenome [26]	2025	Deep CNN + Transformers (Borzoi-like architecture with sequence parallelization)	1 Mb	1 bp	State-of-the-art performance (e.g. 17.4% improvement in gene expression prediction versus Borzoi)

Table 2. Spearman rank correlations for different gene expression prediction tasks on the Geuvadis consortium dataset

Task	Input	Target	Enformer	Basenji2	ExPecto	Xpresso	PrediXcan
Reference correlation	Reference sequence	Median expression	0.57	0.52	0.53	0.33	–
Across genes	Individual sequence	Individual expression	0.55	0.51	0.52	0.32	–
Across individuals	Individual sequence	Individual expression	0.045	0.037	0.034	0.007	0.25

Table 3. List of datasets with paired whole genome sequencing and gene expression data

Dataset name	Tissue/Cell type	N donors
GTEx [14]	54 tissue types such as uterus, ovary, adipose, stomach, and lung	838
ROSMAP [29]	Post-mortem brain tissues, with the major region being the dorsolateral prefrontal cortex (DLPFC)	1894 for DLPFC
Geuvadis [28]	Lymphoblastoid cell lines	462

Table 4. Overview of the gLMs used for personal gene expression prediction

Model name	Sequence corpus	Context	Pretraining	Architecture
Nucleotide Transformer [36]	Human reference, 3200 diverse human genomes, and 850 genomes of various species	6000 bp	MLM	Transformer
Nucleotide Transformer v2 [36]	Same as NT	12 kb	MLM	Transformer
DNABERT-2 [35]	Human reference and 135 species: 32.5 billion nucleotides total	10 kb	MLM	Transformer
UKBioBert [52]	Synthetic sequences created by inserting 13 million SNPs from UKBioBank into human reference	100 kb	MLM	Transformer
Caduceus [45]	Human reference genome	1 Mb	AR	SSM
Evo 2 [37]	OpenGenome2: a database of 8.84 trillion nucleotides of curated DNA from diverse species	1 Mb	AR	SSM

Equations5

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \hat{y}=X \beta+\epsilon\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \min_{\beta} { \frac{1}{2} ||X \beta - y||_{2}^ 2 + \alpha \rho ||\beta||_{1} + \frac{\alpha(1-\rho)}{2} ||\beta||_{2}^ 2}\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \hat{y} = f(g(X))\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & \mathbf{h}_{t+1} = \mathbf{A} \mathbf{h}_{t} + \mathbf{B} x_{t}, \quad y_{t} = \mathbf{C} \mathbf{h}_{t}, \end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & y_{t} = \sum_{k=0}^{t} K_{k} \, x_{t-k}, \quad \mathrm{where}\ K_{k} = \mathbf{C} \mathbf{A}^{k} \mathbf{B}. \end{align*}\end{document}

Keywords

gene expressiondeep learningpersonalized genomicsgenetic variantsgenome AI

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Machine Learning in Bioinformatics · Genomics and Chromatin Dynamics

Full text

Introduction

Predicting gene expression has become a central objective in computational genomics over the past decade [1]. Accurate prediction and analysis of gene expression enable researchers to better understand the role of regulatory variants in gene function and activation, as well as the broader relationship between genotype and phenotype [2]. By linking genetic variants to quantitative changes in gene expression, researchers can uncover how mutations contribute to disease mechanisms in relevant cell types and facilitate the development of personalized medical treatments tailored to an individual’s genetic profile.

With growing interest in gene expression prediction, machine learning (ML) has emerged as the leading framework for modeling gene regulatory activity. Traditional ML approaches—such as regularized linear regressions—have been widely used to “impute” gene expression from predefined genetic variants [3–7]. These models, however, are limited by their relatively simple architectures, which constrain their ability to generalize to unseen variants and to capture the complex, nonlinear cis-regulatory relationships. They are typically trained using expression quantitative trait loci (eQTLs)—including single nucleotide variants (SNVs) and small insertions or deletions (indels)—that are statistically associated with variation in gene expression.

In recent years, deep learning has emerged as a promising alternative to addressing the limitations of classical ML models [8]. Deep learning models can be trained directly on raw genomic sequences and can make predictions from arbitrary input sequences, allowing them to generalize to previously unseen variants. Their ability to capture complex regulatory relationships is supported by their success in other fields, such as computer vision [9] and natural language processing [10]. While deep learning models excel at predicting gene expression across genes within an individual, recent evaluations have shown that linear models often outperform deep learning approaches in predicting gene expression across individuals [11, 12]. This counterintuitive result may stem from the robustness of linear models in capturing the genetic basis of expression variation under noisy data conditions, despite their simpler structures.

The goal of this review is to provide an overview of ML approaches developed for personal-level sequence-to-expression prediction, with a particular focus on recent advances in deep learning. We will highlight strategies that have been employed to address the current limitations of deep learning models and explore the emergence of genomic language models and their potential applications to this problem.

Classic machine learning methods

Classic ML methods, such as linear models [3–7] and random forest [13], have long been used to predict gene expression in the field of eQTL. These models are trained on SNVs and small indels, rather than raw genomic sequences, for the prediction of gene expression. The genetic variants that are associated with expression are known as eQTLs. One of the first, and still widely used, linear models is PrediXcan [3]. Although nonlinear models such as random forest [13] have been used for eQTL analysis, we find linear models to be much more dominant in the field due to their robust performance and easy interpretation. Therefore, this review will focus on linear models and their effectiveness compared to deep learning models.

To use classic ML methods, the features need to be constructed from predefined genetic variants; thereby, the input to a model is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X=[X_{1},X_{2},...X_{m}] $\end{document}$ where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i} $\end{document}$ represents a SNV or indel that is stored in a variant call format file. When high-coverage whole genome sequencing is used for genotyping, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i}\in {0,1} $\end{document}$ if the genotype is phased or $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i}\in {0,1,2} $\end{document}$ if the genotype is unphased. Otherwise, the genotype can be imputed to improve the quality of the data, leading to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i}\in [0,1] $\end{document}$ . Using genotypes and paired tissue-level expression data from individuals found in a reference transcriptomic database like GTEx [14], PrediXcan builds elastic net models for each gene based on the following formula:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \hat{y}=X \beta+\epsilon\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \hat{y} $\end{document}$ is the estimated expression of a given gene, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \beta $\end{document}$ is the learned weight vector that represents how each SNV/indel affects gene expression, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \epsilon $\end{document}$ is the error term. For brevity, the bias term is omitted in this review. The elastic net model fitting aims to solve the following optimization objective:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \min_{\beta} { \frac{1}{2} ||X \beta - y||_{2}^ 2 + \alpha \rho ||\beta||_{1} + \frac{\alpha(1-\rho)}{2} ||\beta||_{2}^ 2}\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ represents the penalty strength and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \rho $\end{document}$ represents the balance between $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L1 $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L2 $\end{document}$ terms. While $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L1 $\end{document}$ offers desirable feature sparsity, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L2 $\end{document}$ provides stability in learning $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \beta $\end{document}$ when the input is perturbed.

The PrediXcan program produces a collection of per-gene elastic net models after training. These models typically focus on cis-eQTLs located near the transcriptional start sites (TSSs) ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ <1 $\end{document}$ Mb) of the genes they regulate—rather than trans-eQTLs, which act across chromosomes or from distant regions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ >1 $\end{document}$ Mb). Although trans-eQTLs have been found to have effects on gene expression, the effect size is often much smaller than that of cis-eQTLs [3]. The trained PrediXcan models can be applied to genome-wide association study (GWAS) data, which may only include genotypes but not the respective gene expression values. The models then impute the gene expression based on the GWAS data. Finally, the imputed gene expressions are tested for association with the phenotypes of interest [3].

The idea of imputing gene expression using an additive model of genetic variants has been explored extensively in the field of eQTL analysis. Later works include: incorporating GWAS summary statistics as input [4]; extending the statistical framework to predict cross-tissue expression [5]; using Bayes methods to allow multiple tissues to share variance structure [6]; and exploring various penalty terms to improve accuracy [7]. However, PrediXcan remains one of the most popular tools.

A main differentiating factor between linear and deep learning models is how the genome sequences are represented as the model’s input. In a linear model, each feature represents a predefined SNV or indel that is found in the training data. Identifying these genetic variants often involves an elaborate process of variant calling, quality assessment, and filtering. This binary-like ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i}\in {0,1} $\end{document}$ or $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ [0,1] $\end{document}$ ) vector representation of the genome sequence restricts its ability to generalize to previously unseen genetic variants, such as rare variants or variants derived from another cohort with a different genetic background.

Another limitation of linear models is the inability to incorporate nonlinear relationships between genotype and phenotype. If $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{i} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X_{j} $\end{document}$ are two variants that are located in two interacting DNA regulatory elements, there might be a synergistic effect of the pair $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {X_{i},X_{j}} $\end{document}$ on gene expression. If the transcriptional machinery involves many regulatory elements in a concerted fashion, a linear model will be incapable to learn such complicated relationships.

Deep learning-based sequence-to-function models

Recently entering the scope of computational genomics and gene expression prediction are deep learning-based sequence-to-function (S2F) models. Deep learning models can automatically learn hierarchical features directly from the data, enabling them to accept raw genome sequences as inputs, and thereby substantially increasing the flexibility of input representation. The availability of large-scale epigenomic and transcriptomic databases, such as ENCODE [15], Roadmap Epigenomics [16], 4DN [17], and GTEx [14], has enabled the development of genome AI models to predict a myriad of functional outcomes, including histone modifications, transcription factor binding, DNA accessibility, and gene expression across various tissues and cell types.

Introduction of S2F models

Before introducing S2F models, it is essential to outline how raw genomic information is formatted for computational analysis. Deep learning architectures typically require numerical inputs; and therefore, genomic sequences must be converted into standardized representations. A common approach is one-hot encoding, in which a sequence of length $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ s $\end{document}$ is transformed into a binary matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ X \in {0,1}^{s \times 4} $\end{document}$ . Ambiguous IUPAC nucleotide symbols (https://iupac.org/) can also be incorporated using probabilistic encodings, such as representing $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ N $\end{document}$ as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ [0.25, 0.25, 0.25, 0.25] $\end{document}$ . An S2F model can be described as the sequential application of a genome sequence encoder followed by a decoder:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \hat{y} = f(g(X))\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ g $\end{document}$ is a sequence encoder which produces a representation (i.e. embedding) of the sequence; $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ f $\end{document}$ is a decoder that uses the embedding to predict the various functional outcomes for different tissues and cell types; $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \hat{y}\in \mathbb{R}^{t\times d} $\end{document}$ represents the output data tracks across $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ t $\end{document}$ bins and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ d $\end{document}$ molecular phenotypes. The decoder $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ f $\end{document}$ can be a simple linear layer or a sequence of stacked up-sampling and 1 × 1 convolution or up-convolution layers. Often times, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ t < s, $\end{document}$ which indicates the reduction in resolution by a deep learning model thereby each bin represents the average value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ s/t $\end{document}$ base-pairs. Depending on the design of decoder $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ f $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ t $\end{document}$ can also be the same as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ s $\end{document}$ , which means no loss in resolution nevertheless incurs more computation in the decoder. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ d $\end{document}$ is typically in the order of thousands to represent different epigenomic events and gene expression across hundreds of tissues and cell types. The function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ g $\end{document}$ can be a neural network with a sophisticated architecture to reflect the complicated nature of transcriptional regulation. For example, 1D convolution layers are used to detect transcription factor binding motifs and their combinations; transformer layers [18] are used to detect long-range interactions among DNA regulatory elements. Figure 1 illustrates the idea of an S2F model. In principle, these models can accept arbitrary genomic sequences as input, with the maximum feasible sequence length $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ s $\end{document}$ limited only by available computational resources. Because the training and evaluation sequences need not overlap, the models are capable of processing inputs that contain novel genetic variants not observed during training.

A generic deep learning-based S2F model with an encoder-decoder architecture. The model uses DNA sequences as input and can predict transcriptomic coverage at high resolution as output. The encoder consists of stacked convolution layers followed by stacked transformer layers. A convolution layer is illustrated as series of triangles that represent convolutional filters and squares that represent the activations. A transformer layer is illustrated as overlapping vectors of query (Q), key (K) and value (V) at each position, followed by a feed forward layer to produce the embeddings for the sequence. The output of the final transformer layer is a bottleneck layer that represents the final embeddings for the sequence. In some designs, the final embeddings are directly used to predict the sequence’s functional outcomes. While in others, the bottleneck layer is followed by up-sampling to increase the resolution and 1 x 1 convolution to reduce the number of channels before output

Convolution and transformer layers are the key neural network elements that compose an S2F model for sequence-to-expression prediction. A convolution or transformer layer is controlled by different parameters and can be enhanced by techniques such as skip connection. These design choices can greatly affect a model’s performance and computational cost. Here are some of the most important factors in model design:

Convolution: A convolutional layer applies a set of learnable filters across local regions of the input to extract informative features. Because the same filter is applied at every position, convolutional layers efficiently capture local patterns while providing a degree of translational invariance. In most genomics applications, kernel size and number of filters are the two most influential design parameters. A larger kernel captures a broader spatial context (i.e. expands the receptive field), but it increases the number of weights and the computational cost linearly. Each filter acts as a specialized feature detector, so increasing the number of filters generally improves the model’s capacity to learn diverse patterns; however, doing so also raises the computational cost and can increase the risk of overfitting. Dilation: Dilation introduces gaps between the elements of a convolutional kernel, expanding its effective receptive field without increasing computational cost. However, large dilation rates can cause the model to miss fine-scale details and may introduce “gridding” artifacts. Transformer: A transformer layer is built around self-attention, a mechanism that allows each position in a sequence to weight and integrate information from all other positions. This enables transformers to model long-range dependencies more effectively than convolutional or recurrent architectures. The most influential parameters include the embedding dimension, the number of attention heads, and the feed-forward network (FFN) width. Increasing the embedding dimension or FFN width expands the model’s representational capacity, but it also increases computational cost and memory usage, with the self-attention operation scaling quadratically as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ O(L^{2}) $\end{document}$ . The number of attention heads controls how many distinct relation patterns the model can attend to in parallel; while more heads can enrich feature learning but add parameters and runtime. Up-sampling and 1 x 1 convolution: Up-sampling layers are used in the decoder to progressively restore spatial or sequential resolution after the encoder compresses it through pooling or strided convolutions. Up-sampling can be performed in several ways—including interpolation, up-convolution (i.e. transposed convolution), and the simple repetition of values—to increase resolution before refining the features with subsequent convolutions. A “1 $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \times $\end{document}$ 1” convolution is a conventional term originating from 2D image processing; in genomics applications, the analogous operation is a 1D convolution with a kernel size of 1, which mixes information across channels without changing the sequence length. This operation is commonly applied when merging up-sampled decoder features with encoder skip connections, allowing for flexible channel adjustment and efficient feature fusion. Together, up-sampling and 1D “1 $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \times $\end{document}$ 1” convolutions enable the network to recover fine-grained structure while maintaining computational efficiency. Skip connection: Skip connections directly pass features from earlier layers to later layers, bypassing intermediate operations. In encoder–decoder architectures, they connect encoder feature maps to corresponding decoder layers at matching resolutions. This provides the decoder with high-resolution, fine-grained information that would otherwise be lost through pooling or downsampling, improving localization and detail recovery. More generally, skip connections help gradients flow more easily during training, mitigating vanishing gradient issues and enabling deeper networks to learn effectively.

Figure 1 illustrates an encoder–decoder architecture used for an S2F model. Although higher resolution and longer context windows are desirable for capturing fine-grained regulatory signals, they substantially increase computational demands and model complexity, which can in turn raise the risk of overfitting. Designing an S2F model for sequence-to-expression prediction therefore becomes a careful balance between efficiency and representational power. Choosing appropriate convolutional and transformer parameters—such as kernel size, number of filters, embedding dimension, and attention heads—plays a central role in navigating this tradeoff and tailoring the model to the available data and computational budget.

Overview of selected S2F models

DeepSEA [19] is a pioneering model that first demonstrated the ability of deep learning to predict an array of functional genomic annotations, including transcription factor binding, DNase I hypersensitivity, and histone modifications, solely from a DNA sequence. Utilizing a three-layer convolutional neural network (CNN) trained on 1000 bp inputs, DeepSEA established the S2F prediction paradigm. Importantly, while revolutionary, this initial architecture was a small-scale network by modern standards, outputting only one prediction value per chromatin feature for the entire 1000 bp region, thus lacking explicit spatial information within the input. Despite this limitation, DeepSEA served as a crucial proof-of-concept, laying the foundation for quantifying the impact of noncoding variants on chromatin features and launching the field of S2F prediction.

Subsequent S2F models introduced architectural and design variations to manage the inherent trade-offs among output feature richness, resolution, context length, and computational efficiency. For instance, ExPecto [20] built upon the DeepSEA foundation by significantly enhancing the network’s capacity: it doubled the number of convolutional layers, increased the number of predicted tissues and cell types, and expanded the direct input context window to 2000 bp. Critically, ExPecto introduced a spatial feature integration layer that aggregates predictions within a 40-kb TSS-centered context to ultimately predict gene expression in specific tissues. While these improvements substantially enhanced model capability, particularly for expression prediction, the approach was still constrained by a relatively short effective context window (40 kb) for gene expression and a medium resolution (200 bp) for chromatin features.

Basenji2 [21] expanded the input context dramatically to 131 kb, using deep stacks of dilated convolutions to enlarge the model’s receptive field and capture long-range regulatory interactions across genomic regions. This broader context substantially improved predictive performance, with Basenji2 achieving a Pearson correlation of 0.81 for gene expression prediction. However, the expanded architecture also increased the number of trainable parameters, resulting in a larger memory footprint and greater computational cost. Moreover, although dilated convolutions extend the range of dependencies that can be modeled, they inherently specialize in detecting local patterns and lack explicit global integration mechanisms. As a result, Basenji2 is still limited in its ability to capture ultra-long-range chromatin interactions and genome-wide regulatory context.

Xpresso [22] is a lightweight sequence-to-expression model designed to predict steady-state mRNA abundance directly from a promoter sequence alone. It combines a shallow 2-layer CNN with hand-engineered features reflecting mRNA stability—such as GC content, UTR length, and coding exon density—to capture both transcriptional and post-transcriptional determinants of gene expression. By integrating these biologically informed features in a 10.5-kb window with a compact CNN architecture, Xpresso achieved strong performance with far fewer parameters than deep regulatory models, explaining up to 60%–70% of the variance in mRNA levels across human and mouse genes. Its simplicity also makes it computationally efficient and easily interpretable, enabling insights into promoter architecture and sequence determinants of expression. However, Xpresso’s reliance on the promoter sequence and a fixed set of stability features limits its ability to model the rich landscape of distal regulation, chromatin state, enhancer–promoter interactions, and context-specific regulatory logic. As a result, while it performs well for genes whose expression is largely promoter-driven, it is less effective for genes governed by long-range or chromatin-mediated regulatory mechanisms, highlighting the need for architectures capable of integrating a broader genomic context.

A central challenge that emerges from these models is the difficulty of capturing long-range regulatory interactions that underlie gene expression. While ExPecto made early steps toward addressing this by introducing a spatial integration layer that aggregates predicted chromatin features across a 40-kb window around each TSS, this mechanism remains relatively coarse and limited in its ability to model the complex, nonlinear interplay among distal enhancers, promoters, and chromatin architecture. Basenji2 pushed the field further by dramatically expanding the input context and employing dilated convolutions to propagate information across 131 kb. However, dilated convolutions are fundamentally local operators: although they extend the receptive field, they lack explicit mechanisms for global interaction modeling and struggle to integrate signals across hundreds of kilobases or megabase scales. Together, these limitations highlight a major bottleneck in sequence-to-expression modeling—effectively learning regulatory interactions that span vast genomic distances—and motivate the development of architectures capable of true long-range reasoning, such as transformer-based models.

Enformer [23] represents a major breakthrough in sequence-to-function modeling, introducing a hybrid architecture that combines convolutional layers with transformer blocks to capture long-range regulatory interactions through self-attention. Whereas convolutional models are fundamentally limited by their finite receptive fields, the transformer layer enables Enformer to integrate information across the entire input window and model interactions between distant genomic elements. This directly addresses a key limitation of earlier S2F models, whose top-layer receptive fields remained modest—approximately $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \pm 20 $\end{document}$ kb for ExPecto, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \pm 27.5 $\end{document}$ kb for Basenji2, and only 7 kb upstream and 3.5 kb downstream for Xpresso. In contrast, Enformer achieves a receptive field spanning $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \pm 98.3 $\end{document}$ kb, allowing it to “see” roughly 84% of high-confidence enhancer–gene pairs, compared with just 47% for the previous state-of-the-art Basenji2. This expanded context yields substantial improvements in predictive accuracy, increasing the Pearson correlation for CAGE-based gene expression prediction to 0.85.

Building on Enformer, Borzoi [24] further extends the context window to 524 kb while increasing the spatial resolution to 32 bp using a U-Net–like architecture [25]. Unlike Enformer, which applies linear layers at the bottleneck to generate predictions, Borzoi incorporates a decoder that progressively up-samples feature maps to recover high-resolution chromatin profiles. The recently introduced AlphaGenome [26] adopts a similar encoder–decoder design but expands the receptive field to 1 Mb and achieves single–base-pair resolution. In both Borzoi and AlphaGenome, U-Net–style decoders employ 1 $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \times \ $\end{document}$ 1 convolutions to modulate channel depth during up-sampling and use skip connections to fuse encoder and decoder representations while preserving spatial detail. These advances greatly enhance the complexity of functional outputs and improve predictive accuracy; however, they come with substantial computational costs—e.g. AlphaGenome training required hundreds of TPUv3 chips and dozens of NVIDIA H100 GPUs.

Table 1 summarizes the models discussed above, illustrating the progression from DeepSEA to AlphaGenome as context windows expand and spatial resolution increases. In S2F architectures, model design inevitably involves negotiating trade-offs among four primary factors: the breadth of output features representing tissue- and cell type-specific functional readouts; the context length required to capture long-range regulatory interactions; the spatial resolution necessary for variant-level and mechanistic interpretation; and the computational cost, which escalates rapidly with model capacity.

Despite these architectural advances, all existing deep learning models are trained on the reference genome, using annotations sourced from reference databases such as ENCODE [15] and FANTOM [27]. Consequently, it remains unclear to what extent the regulatory mechanisms learned from this single genomic background can generalize to explain personal variation in gene expression.

Using S2F models for personal gene expression prediction

The high accuracy of S2F models in predicting gene expression and other molecular phenotypes across genomic regions can be attributed to a neural network’s ability to learn the transcriptional regulation language using genome sequence alone. This has raised the question whether they can be used to interpret genetic variations between individuals. A recent study [12] compares four models—ExPecto, Xpresso, Basenji2, and Enformer on the Geuvadis consortium [28] dataset to address this question. The key benchmark statistics of the study are summarized in Table 2. They found that these models achieved relatively high correlation values when tested across genes in a single individual. This can be attributed to their ability to learn generalizable cis-regulatory rules from sequence, such as transcription factor binding sites, which remain largely consistent across genes within an individual genome. However, the effects of individual-specific genetic variants on gene expression remain elusive, even when the same gene is tested. Because these models are not trained on individual-level genetic variation, they learn cis-regulatory rules although do not account for how specific genetic differences in individual genomes impact gene expression. As a result, all deep learning models achieved drastically lower correlation values during cross-individual testing: even the highest performing Enformer achieved a Spearman correlation of only 0.045 on the Geuvadis consortium dataset. While PrediXcan when trained and evaluated on the same dataset can achieve a Spearman correlation of approximately 0.25.

A similar study with the same goal draws the same conclusion on a different dataset [11]. They found that PrediXcan effectively modeled gene expression across individuals, significantly predicting 921 of 13,397 protein-coding genes with a mean Pearson correlation of 0.26 when tested on the ROSMAP [29] dataset, while Enformer only predicted 162 genes and obtained a mean Pearson correlation of only 0.02. This benchmark study also indicates that Enformer may over-rely on short-range DNA motifs, focusing primarily on promoter sequences near the TSS rather than on distant enhancers. Furthermore, Enformer mis-attributed the effects of variants, assigning too much importance to the wrong SNVs, leading to mis-predictions. Finally, many SNVs flagged by Enformer as important did not affect gene expression in real eQTL datasets. These unsupported driver SNVs contributed to Enformer’s lack of effectiveness, and even when Enformer identified potential driver variants, it often mis-predicted the direction of their effect on gene expression. Enformer’s problems with directionality may stem from its limited understanding of cis-regulatory variants. If an enhancer normally activates a gene yet is mutated in a way that disrupts activation, then gene expression should decrease. However, if Enformer ignores the enhancer and instead focuses on a promoter SNV that has little impact on gene expression, it may inaccurately infer that the expression should increase.

Several factors likely contribute to the inability of S2F models to predict personalized gene expression. First, all of the S2F models discussed above are trained exclusively on the reference genome, which increases the risk of overfitting and limits their ability to generalize to individual genomes. Second, personal genomes often contain rare variants and structural variants that are absent from the reference genome, further reducing model accuracy. Third, S2F models still struggle to capture long-range regulatory interactions, even when transformer architectures are used. For example, a recent study [30] reported that Enformer failed to predict the functional effects of enhancers located far from the TSS; its ability to explain expression variance for enhancer perturbations 30 kb away from the TSS was nearly negligible. It remains unclear whether this limitation arises from the model’s difficulty in recognizing regulatory elements, identifying enhancer–promoter pairings, or both.

Fine-tuning S2F models

Due to these shortcomings, researchers have been attempting to finetune S2F models and develop variant-aware sequence-to-expression models to improve cross-individual predictions. Due to the high cost of sample collection and sequencing, datasets with paired whole genome sequencing and RNA-seq are scarce and limited in sample size. Table 3 lists the datasets for the evaluation and finetuning of personal expression prediction models reviewed in this study.

Performer [31] is one of the first studies to finetune the Enformer model on paired WGS and RNA-seq samples for personal gene expression prediction. Focusing on the “whole blood” tissue from the GTEx database [14], Performer selected 300 genes that span a range of cis-heritability and used the individual genome sequences from 670 subjects to train and evaluate Enformer for the tissue-specific gene expression prediction. To improve the model’s ability to capture the variation across individuals, Performer’s loss function includes a component that looks at the pairwise difference between each pair of individuals within a batch. For a pair of individuals $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {i,j} $\end{document}$ : if the difference between observed values is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ (o_{i}-o_{j}) $\end{document}$ and predicted values is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ (p_{i}-p_{j}) $\end{document}$ , Performer minimizes the difference between the two thereby $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ [(o_{i}-o_{j})-(p_{i}-p_{j})]^{2} $\end{document}$ for all pairs $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {i,j} $\end{document}$ are included in its loss. While Enformer’s predictions often show little or even negative correlations across the GTEx individuals, the Performer corrects this for most genes, showing performance that are as good as or even better than those of elastic nets for 84% of the genes. They also trained Performer to predict DLPFC expression in the ROSMAP dataset with 742 individuals and made predictions for 205 GTEx individuals with brain cortex expression available, and observed good cross-cohort performance.

The Performer study attempts to train a model for 300 genes jointly to predict personal gene expression, however does not find any improvement against a single-gene model. When the multi-gene Performer model is used on held-out genes whose individual-level expression was not used to train the model, it can explain only the same amount of variability as the untrained Enformer model. These experiments suggest that the transcriptional regulatory mechanisms of cross-gene and cross-individual contexts speak fundamentally different languages, with little overlap between them. The results provide additional evidence to explain Enformer’s poor performance in predicting personal expression in the studies mentioned above [11, 12]. Additional experiments include using finetuned Borzoi for personal gene expression prediction although find it to perform similarly to Performer, indicating little improvement from a different model architecture.

In a separate study [32], researchers draw the same conclusion when they finetune Enformer on a lymphoblastoid cell line gene expression dataset with 421 individuals from the Geuvadis consortium. The finetuned deep learning model performs similarly to a linear model trained on the same data. Additional training strategies are tested. To avoid a phenomenon known as catastrophic forgetting [33], they jointly train Enformer on both the novel personalized expression dataset and its original training dataset. To augment the genetic variation seen by the model beyond the relatively small personalized expression dataset, they use a Massively Parallel Reporter Assay (MPRA) data in addition to the personal expression dataset. However, neither training strategy leads to significant improvement. The Geuvadis data are based on individuals with European ancestry [28]. To further evaluate the models, they are used to make predictions on a population with Yoruban ancestry. Pearson correlations drop by more than 0.1 for all models tested, indicating the difficulty in generalizing predictive models to another group of individuals with a different genetic background.

Genomic language models

Inspired by advances in natural language processing (NLP), recent efforts have developed genomic language models (gLMs) that treat DNA sequences as structured text, applying transformer-based architectures or state space models (SSMs) to learn rich contextualized representations of genomic information [34]. Models such as DNABERT-2 [35], Nucleotide Transformer (NT) [36], and Evo 2 [37] are pretrained on large-scale, unlabeled genomic corpora using self-supervised objectives such as masked language modeling (MLM) or autoregressive (AR) language modeling. These models capture the statistical structure and “syntax” of the genome, including regulatory motifs, sequence composition, and long-range dependencies. Unlike supervised models trained for specific tasks, gLMs produce general-purpose embeddings that can be fine-tuned or directly applied to a broad range of downstream applications, including promoter classification, enhancer activity prediction, transcription factor binding site identification, and non-coding variant effect inference. Their ability to transfer learned representations across tasks, cell types, and even species positions them as foundation models for scalable, data-efficient genome interpretation, marking a significant shift toward universal pretraining in genomics. Creating a gLM involves several key tasks that parallel those in NLP nevertheless require genomic-specific adaptations. They are described in the following.

Corpus construction

The first step in developing a gLM involves constructing a suitable training corpus. This typically entails selecting one or more reference genomes—such as human, mouse, or a multi-species compendium—to provide a broad and representative sampling of genomic content. Depending on the model’s intended application, the corpus may include only specific regions (e.g. promoters or introns) or encompass the entire genome, including coding, non-coding, and repetitive elements. These genomes are then segmented into fixed-length sequence windows (e.g. 1 kb to 1 Mb), which serve as input samples for model training.

Tokenization

Unlike natural language, DNA consists of a small four-letter alphabet (A, C, G, T). While representing individual nucleotides as tokens has been explored, most approaches employ specialized tokenization strategies to capture more expressive and informative representations of sequence content. Three widely used strategies are:

k-mer tokenization: This is the most common approach in gLMs. A DNA sequence is parsed into fixed-length subsequences (k-mers), such as 3-mers or 6-mers. Models like DNABERT [38] use overlapping k-mers (stride = 1), preserving high-resolution positional information and capturing local motifs. In contrast, models like the NT employ non-overlapping k-mers (stride = k) to reduce the number of tokens per input, enabling efficient training over long genomic contexts (e.g. kilobases). The choice of k introduces a key trade-off: larger k captures longer sequence motifs yet exponentially increases the vocabulary size (e.g. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ 4^{6} = 4,096 $\end{document}$ ), which raises memory and computational costs. At the same time, using a larger k reduces the total number of tokens per input sequence, which can improve scalability by lowering the quadratic cost of self-attention. Selecting an optimal k thus balances resolution, vocabulary complexity, and sequence length. Byte pair encoding (BPE) and subword tokenization: Originally developed for NLP, BPE, and other subword-based tokenization schemes have recently been adapted to genomics to create more flexible and data-driven vocabularies. These methods iteratively merge frequent character or k-mer pairs to form longer, biologically meaningful tokens without fixing k in advance. This can result in smaller, more compact vocabularies and greater expressiveness, especially for handling diverse or repetitive sequence patterns. While not as widely used as k-mer tokenization, BPE has been explored in models such as DNABERT-2 and GROVER [39], offering a promising alternative for capturing variable-length sequence motifs. Nucleotide tokenization: It treats each base (A, C, G, T) as an individual token rather than grouping them into k-mers. This approach preserves the full base-level resolution of the genome, however substantially increases sequence length, posing challenges for transformer architectures with quadratic scaling in sequence length. Consequently, nucleotide-level tokenization is more commonly employed in SSMs with linear-time complexity, which can efficiently process very long genomic sequences while retaining nucleotide-level detail.

Pretraining objective

Pretraining gLMs relies on self-supervised learning objectives to extract meaningful representations from unlabeled DNA sequences. Two primary formulations are widely used:

Masked language modeling: MLM is a bidirectional objective, originally introduced in BERT [40], where the model is trained to predict a randomly masked subset of tokens from their surrounding context. Let a DNA sequence be represented as a tokenized input $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{x} = (x_{1}, x_{2}, \dots , x_{T}) $\end{document}$ , and let $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathcal{M} \subset {1, \dots , T} $\end{document}$ denote the indices of tokens that are masked. The objective is to minimize the expected negative log-likelihood over the masked positions:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & \mathcal{L}_{\mathrm{MLM}} = \mathbb{E}_{\mathcal{M}} \left[ - \sum_{i \in \mathcal{M}} \log P(x_{i} \mid \mathbf{x}_{\backslash \mathcal{M}}) \right], \end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{x}{\backslash \mathcal{M}} $\end{document}$ denotes the input sequence with the masked tokens replaced (e.g. with a [MASK] token). This formulation enables the model to learn from both upstream and downstream context, making it well-suited for genomic sequences where regulatory signals are often non-directional. MLM is used in models such as DNABERT, DNABERT-2, and NT. One limitation of MLM is the pretrain–finetune mismatch, since masked tokens may not appear during downstream inference.Autoregressive language modeling: In contrast, AR models are trained to predict the next token in a sequence using only the left context. For a sequence $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{x} = (x{1}, x_{2}, \dots , x_{T}) $\end{document}$ , the objective is to minimize the negative log-likelihood of each token conditioned on its preceding tokens:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & \mathcal{L}_{\mathrm{AR}} = - \sum_{t=1}^{T} \log P(x_{t} \mid x_{1}, x_{2}, \dots, x_{t-1}). \end{align*}\end{document}

This unidirectional formulation is used in models inspired by GPTs [41–43] and is especially suitable for generative tasks, such as genomic sequence synthesis. AR models avoid the use of artificial masking tokens and preserve the full input distribution during training. Recent gLMs such as Evo 2 and DNAGPT [44] employ AR objectives to learn from long, contiguous DNA sequences.Although AR models naturally lack bidirectional context, recent work has begun to address this limitation. For example, Caduceus [45] introduces a novel bidirectional technique by applying a model to both a sequence and its flipped version and then join information from the forward and backward copies. These innovations blur the line between traditional AR and MLM approaches, enabling richer contextual learning without sacrificing generative capability. While AR models may still be less intuitive for tasks requiring fully bidirectional context, they show strong promise for modeling long-range dependencies and supporting both discriminative and generative applications in genome-scale settings.

Model architecture

Earlier works used recurrent neural networks (RNNs) [46] and CNNs [47] to capture local and short-range dependencies in genomic sequences. Most gLMs today adopt transformer-based architectures because of their ability to model long-range interactions through self-attention [18]. However, transformers scale quadratically with sequence length ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathcal{O}(L^{2}) $\end{document}$ ), making them computationally prohibitive for modeling megabase-scale genomic contexts. Recent developments [48] have introduced SSMs as an efficient alternative for modeling long genomic sequences when scalability is essential.

SSMs replace pairwise attention with linear recurrent updates governed by continuous or discrete dynamical systems. In a simplified discrete-time formulation, an SSM maintains a hidden state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{h}{t} \in \mathbb{R}^{N} $\end{document}$ that evolves linearly with the input sequence $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ x{t} $\end{document}$ :

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & \mathbf{h}_{t+1} = \mathbf{A} \mathbf{h}_{t} + \mathbf{B} x_{t}, \quad y_{t} = \mathbf{C} \mathbf{h}_{t}, \end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{A} $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{B} $\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{C} $\end{document}$ are learnable parameters that define the system’s dynamics, input projection, and output mapping, respectively. Because this recurrence is linear in both $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{h}{t} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ x{t} $\end{document}$ , the entire sequence can be expressed as a convolutional operation:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} & y_{t} = \sum_{k=0}^{t} K_{k} \, x_{t-k}, \quad \mathrm{where}\ K_{k} = \mathbf{C} \mathbf{A}^{k} \mathbf{B}. \end{align*}\end{document}

This kernel formulation allows efficient computation using fast Fourier transforms, yielding linear time and memory scaling ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathcal{O}(L) $\end{document}$ ) with respect to sequence length. Importantly, the recurrence matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{A} $\end{document}$ defines a long-range memory kernel whose eigenstructure determines how information decays or persists over time. Properly parameterized (e.g. via diagonalization [49] or HiPPO initialization [50]), $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathbf{A} $\end{document}$ can retain stable, slowly decaying modes, allowing SSMs to preserve information across hundreds of thousands of tokens—a key advantage over conventional RNNs.

Intuitively, SSMs behave like learnable filters that integrate information over arbitrary ranges, combining the efficiency of convolutions with the sequential structure of dynamical systems. This makes them well-suited for genomic modeling, where distal enhancer–promoter or chromatin-level interactions may span hundreds of kilobases. Recent genomic architectures such as HyenaDNA [48] and Evo 2 [37] leverage these properties to pretrain over megabase-scale inputs, achieving both long-range expressivity and computational tractability.

Overview of selected gLMs for personal gene expression prediction

Since 2020, the field has experienced a rapid expansion in the development of gLMs. Although a comprehensive review of these models is beyond the scope of this study, readers are referred to recent reviews [51] that provide in-depth discussions of gLM advances. Table 4 lists the gLMs that are reviewed in this study.

NT is one of the first large-scale gLMs, which is trained on 850 genomes across multiple species and 3200 human genomes. It utilized MLM on 6-kb DNA sequence segments that were segregated into 1000 non-overlapping 6-mer tokens. In extensive evaluations across 18 downstream genomic tasks—ranging from promoter prediction to splice site classification and epigenetic state inference—the NT consistently outperformed existing models. In a recent study [53], the authors employed NT to generate embeddings from 4-kb regions centered on the TSS for 290 GTEx participants, aiming to exploit its ability to encode inter-individual genetic variation. These embeddings were then kept fixed while additional transformer layers were trained for personalized gene expression prediction. However, this approach still fell short when compared to linear models, with the cross-individual median Pearson correlation being 0.02 less than that of Elastic Net. This shows that the pretraining of a transformer network on thousands of genomes, and the additional training of transformer layers on a small-size personal expression dataset failed to capture the individual transcriptional regulation. Furthermore, its performance deteriorated significantly as gene length increases-showing a negative correlation ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ P = 1.9 \times 10^{-6} $\end{document}$ ), which indicates the difficulty to learn long-range interactions for personal expression prediction even with a relatively small context window. Finally, its sequence input length of 4 kb is notably lower than that of Enformer (197 kb), potentially preventing the modeling of regulatory effects that could influence gene expression. NT-v2 is the second generation of gLM in the NT family, focusing on more efficient network architecture that reduces the number of parameters. Consequently, it increases the context window from 6 to 12 kb. In a recent study [54] that compares several gLMs’ performance in personal gene expression using the Geuvadis data, NT-v2 was found to be similar in performance to Evo 2 although underperformed another gLM called Caduceus. Furthermore, it still performed worse than Elastic Net.

DNABERT-2 is another notable gLM that builds on the original DNABERT by introducing several techniques to improve computational efficiency and effectiveness, and shows stronger empirical performance on genomic tasks. Unlike its predecessor, DNABERT-2 uses BPE as tokenization and attention with linear bias for positional embedding, and is pretrained on a broader genomic corpus. In direct comparisons with the NT, DNABERT-2 demonstrates comparable or superior performance on several core tasks despite being significantly smaller. In a recent study [52] designed to extend the DNABERT-2 framework, the authors introduced UKBioBert, a model pretrained on synthetic genomic sequences generated by incorporating 13 million genetic variants—derived from approximately 300,000 individuals—into the reference genome. This variant-informed pretraining strategy enables the model to better capture patterns of individual genetic variation within an MLM framework. The embeddings produced by UKBioBert were then integrated with the embeddings from Enformer and finetuned for personal expression prediction. The resulting model, known as UKBioFormer, fuses Enformer’s sequence-to-expression features with UKBioBert’s variant-aware sequence representations into a single architecture. By doing this, UKBioFormer enhances Enformer, outperforming Performer in 63.3% of genes with good predictability (Pearson correlation > 0.6 in Europeans) during cross-cohort testing (European to African American). Additionally, in predicting the direction of eQTLs, a task Enformer greatly struggles with, UKBioFormer achieved an accuracy of >70%, while Enformer only achieved an accuracy of 53% and Performer achieves an accuracy of 68%. However, like other deep learning models, UKBioFormer still struggles to generalize well to genes it was not fine-tuned on.

Caduceus and Evo 2 represent two recent SSM architectures that extend the capabilities of gLMs to megabase-scale contexts. Both models are descendants of the broader SSM family; however, differ in their architectural foundations and biological inductive biases. Caduceus builds on the Mamba framework [55], a selective SSM designed for long-sequence processing with near-linear complexity ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathcal{O}(L \log L) $\end{document}$ ). What makes Caduceus unique among gLMs is its incorporation of domain-specific inductive biases. It extends the Mamba block into two specialized variants: BiMamba, which introduces bi-directional processing to capture both upstream and downstream dependencies, and MambaDNA, which enforces reverse-complement (RC) equivariance to respect the double-stranded symmetry of DNA. These adaptations enable Caduceus to model sequence context in both directions and to generalize robustly across complementary strands—properties critical for genomic tasks such as variant effect prediction and motif recognition. Evo 2 adopts a different SSM-inspired design, based on the StripedHyena 2 architecture [56], the first multi-hybrid model built upon input-dependent convolutions to efficiently capture short-, medium-, and long-range dependencies across sequences. Evo 2 leverages this framework to train autoregressively on over 9 trillion nucleotides from all domains of life including Bacteria, Archaea, Eukaryota, and Viruses, enabling single-nucleotide resolution modeling at genome-scale context lengths.

Thus, while both models share the efficiency and long-context advantages of the SSM family, Caduceus emphasizes biologically grounded inductive biases—such as bi-directionality and RC symmetry—whereas Evo 2 focuses on large-scale sequence corpus, species diversity, and multi-operator flexibility for universal genomic modeling. In a recent benchmark [54] comparing multiple deep learning models for personalized gene expression prediction, Caduceus not only outperformed all gLMs, including Evo 2, yet also surpassed S2F architectures, achieving a mean Spearman correlation of 0.20 compared to 0.11 for the best S2F model. Moreover, across a subset of 46 genes deemed unpredictable by Elastic Net, Caduceus achieved a mean Spearman correlation of 0.11. Notably, these results were obtained despite Caduceus being pretrained solely on the human reference genome, whereas Evo 2 was trained on a massive cross-species corpus. This contrast underscores that architectural innovations inspired by biological principles can yield substantial improvements in modeling and interpreting transcriptional regulation from genomic sequences.

Discussion and future directions

The development of genome AI models using large-scale reference genomic and epigenomic databases has opened a new avenue for researchers to study a decades-old problem: the relationship between human genetic variants and health and disease. However, the direct application of these models to personal genomes has failed to achieve reliable predictive accuracy, yielding low or even negative correlations across certain genes. One possible explanation is that the sequences governing transcriptional regulation among genes encode mechanisms fundamentally different from those of genetic variants among individuals. To address this, researchers finetuned the models on datasets containing paired genome and gene expression data from cohorts of individuals. The resulting models showed improvements over the original versions, reversing most of the negative correlations, however did not outperform linear models significantly. This may be due to the relatively small size of the finetuning datasets—which typically include only hundreds of individuals—being insufficient for deep learning models to generalize well to unseen samples. It is also possible that novel network architectures and loss functions are needed for more effective training in this specific context.

The introduction of gLMs has opened the door to new possibilities. Because the pretraining relies solely on genome sequences, a gLM can leverage the vast number of personal genomes that have been sequenced and deposited into public databases. This circumvents the need for paired genomic and epigenomic data from each individual to train a supervised model—data that are often prohibitively expensive and time-consuming to collect. Ideally, such pretrained models require only a small number of additional samples for downstream tasks, thereby significantly improving data efficiency. In the field of large language models (LLMs), the scaling law [57] consistently holds true: more training data, larger models, and greater computational power tend to yield better performance. Today’s state-of-the-art LLMs required enormous computational resources to train, with estimated costs exceeding 100 million US dollars [58]. It is reasonable to speculate that a similar scaling law may apply to gLMs. However, current leading models such as NT were trained on only thousands of personal genomes. Meanwhile, UKBioBERT was trained on synthetically mutated sequences rather than actual personal genomes due to UK Biobank’s privacy regulations. Evo 2 is heading towards that direction by utilizing a sequence corpus with nearly nine trillion nucleotides. These gLMs are still nowhere near the scale of LLMs like GPT-4 [59] or Grok 4 [60]. Whether a GPT-like gLM holds the key to unlocking personal expression prediction remains an open question.

Looking ahead, the next generation of genome AI models will depend critically on the creation of larger and more diverse individual-level datasets. Current models are predominantly trained on reference genomes or limited population panels, which constrains their ability to capture the full spectrum of human regulatory and genetic variation. Building datasets that couple genomic sequences with transcriptomic, epigenomic, proteomic, and phenotypic measurements across diverse ancestries, tissues, and environmental contexts will be essential to advance from generic to individualized genome modeling.

Beyond data collection, there is an urgent need for methodological breakthroughs that go beyond scaling and instead embrace the complexity of biology. The strong performance of Caduceus on the Geuvadis dataset highlights the importance of incorporating biological insights into model design. This finding suggests that merely increasing model size or training data may yield diminishing returns unless existing biological knowledge is leveraged to guide models toward extracting meaningful regulatory signals from genomic sequences. Future models should therefore pursue architectures, objectives, and pretraining strategies that explicitly encode principles such as allelic heterogeneity, epistasis, and context-specific regulation. Multi-omics integration represents a powerful avenue toward this goal: by learning joint representations across DNA, RNA, and chromatin modalities, models could more directly link sequence variation to molecular and cellular phenotypes. The General Expression Transformer study [61] demonstrated that integrating DNA accessibility with sequence information enabled the model to accurately predict gene expression in previously unseen cell types. Such integrative frameworks will be critical for capturing the complete cascade from genotype to phenotype.

Another key frontier is causal inference [62]. Most current models identify correlations between sequence patterns and molecular phenotypes, however cannot distinguish cause from association. Embedding causal reasoning—through structural and graphical causal models [63, 64], and experimental perturbations [65, 66]—may enable models to predict the functional impact of variants more reliably and to generate experimentally testable hypotheses about regulatory mechanisms.

Interpretability remains an equally pressing challenge [67]. As genomic models continue to grow in scale and complexity, understanding what they learn becomes critical for their scientific utility. Developing feature attribution, feature interaction, and transparency models that map learned representations back to biological features (e.g. motifs, enhancers, or pathways) will allow these models to serve not merely as predictors but as instruments for biological discovery.

In addition, synthetic data may offer a promising path to overcome the intrinsic limitations of natural genomes. As a recent study [68] has noted, the human genome is relatively small and repetitive, limiting the diversity of regulatory configurations available for model training. MPRAs using synthetic DNA could expand the accessible regulatory landscape, enabling systematic exploration of cis-regulatory grammars. However, how to co-train models on both synthetic and real data remains an open methodological question that will require innovations in domain adaptation and representation alignment.

Finally, sustained progress in this field will hinge on collaboration and resource sharing. Establishing shared benchmarks, promoting cross-institutional data exchange, and pooling computational resources will ensure transparency, reproducibility, and accessibility. Collaborative consortia that unite biologists, computer scientists, and engineers will be vital for translating advances in genomic AI into mechanistic understanding and biomedical impact.

Key Points

Deep learning models achieve high accuracy when predicting molecular phenotypes on reference genomes but fail to robustly predict personalized gene expression across individuals.
Despite their simpler architecture, linear models are more reliable than deep learning models for cross-individual gene expression prediction.
Fine-tuned deep learning models show promise for predicting gene expression directly from raw genomic sequences, eliminating the need for predefined biomarkers.
Genomic language models, which require only sequence-based pretraining, have emerged as a promising alternative for encoding the transcriptional effects of genetic variation.

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Schadt EE, Björkegren JL. New: network-enabled wisdom in biology, medicine, and health care. Sci Transl Med 2012; 4:115rv 1–115rv 1.10.1126/scitranslmed.300213222218693 · doi ↗ · pubmed ↗
2Albert FW, Kruglyak L. The role of regulatory variation in complex traits and disease. Nat Rev Genet 2015; 16:197–212. 10.1038/nrg 389125707927 · doi ↗ · pubmed ↗
3Gamazon ER, Wheeler HE, Shah KP et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 2015; 47:1091–8. 10.1038/ng.336726258848 PMC 4552594 · doi ↗ · pubmed ↗
4Gusev A, Ko A, Shi H et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016; 48:245–52. 10.1038/ng.350626854917 PMC 4767558 · doi ↗ · pubmed ↗
5Hu Y, Li M, Lu Qet al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet 2019; 51:568–76. 10.1038/s 41588-019-0345-730804563 PMC 6788740 · doi ↗ · pubmed ↗
6Urbut SM, Wang G, Carbonetto Pet al. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat Genet 2019;51:187–95. 10.1038/s 41588-018-0268-830478440 PMC 6309609 · doi ↗ · pubmed ↗
7Zhang Z, Bae YE, Bradley J Ret al. Summit: an integrative approach for better transcriptomic data imputation improves causal gene identification. Nat Commun 2022; 13:6336. 10.1038/s 41467-022-34016-y 36284135 PMC 9593997 · doi ↗ · pubmed ↗
8Barbadilla-Martínez L, Klaassen N, van Steensel Bet al. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 2025;26:666–80. 10.1038/s 41576-025-00841-240360798 · doi ↗ · pubmed ↗