Zero-shot benchmarking of RNA language models in structural, functional, and evolutionary learning

He Wang; Yikun Zhang; Jie Chen; Jian Zhan; Yaoqi Zhou

PMC · DOI:10.1093/bib/bbag098·March 6, 2026

Zero-shot benchmarking of RNA language models in structural, functional, and evolutionary learning

He Wang, Yikun Zhang, Jie Chen, Jian Zhan, Yaoqi Zhou

PDF

Open Access

TL;DR

This paper evaluates 21 RNA language models to understand their strengths and weaknesses in capturing RNA structure, function, and evolution without fine-tuning.

Contribution

A standardized zero-shot benchmark for RNA language models across structural, functional, and evolutionary tasks.

Findings

01

RNA-specific pretraining is essential for capturing structural information.

02

Evolutionary signals from multiple sequence alignments improve model performance.

03

Model scaling alone does not guarantee better performance; architecture and objectives matter.

Abstract

RNA language models (LMs) are increasingly applied to RNA structure and function analysis, yet their intrinsic representational capacities remain poorly characterized. Here, we present a standardized zero-shot evaluation of 21 RNA LMs, with representative DNA LMs included as reference controls. Three complementary tasks—attention-based RNA secondary structure prediction, embedding-based RNA classification, and mutational fitness estimation from sequence likelihoods—are evaluated without downstream fine-tuning. Our results reveal substantial variability across models and clear trade-offs between structural, functional, and evolutionary representations. RNA-specific, noncoding RNA-enriched pretraining is crucial for capturing structural information, while evolutionary signals from multiple sequence alignments substantially boost performance. Although model scaling yields gains,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals2

LucaOne nucleotides

Diseases1

LMs

Figures4

Click any figure to enlarge with its caption.

Schematic overview of a BERT-based RNA language model, highlighting three core outputs—attention weights (att.), embedding representations (emb.), and token-level logits (logit). These outputs can be exploited for structural inference, functional prediction, and mutational fitness estimation, respectively.

Zero-shot RNA secondary structure prediction by DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs. Performance was evaluated using (a) MCC and (b) F1 score on the test set (TS), and (c) MCC and (d) F1 score on the hard test set (TS-Hard). For models with multiple released versions, the version achieving the highest median MCC was shown. Models within each category were ordered by increasing median MCC.

Zero-shot RNA classification based on embedding similarity on (a) the RfamSample and (b) ArchiveII-Nr datasets. Results were shown for DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs. Performance was measured by the OR between similarity distributions, reported as 1 – OR (better performance for a larger value). For the models with multiple versions, the version with the highest mean 1 − OR was displayed. Models within each category were ordered by increasing mean 1 − OR.

Zero-shot RNA fitness prediction using token-level logits. (a) Overall performance of DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs, measured by the absolute value of the Spearman correlation coefficient (SR) between predicted and experimental mutational effects. For models with multiple versions, only the version with the highest median SR was shown. Models within each category were ordered by increasing median SR. (b) Median SR across assays grouped by RNA type, illustrating RNA class–dependent performance variability.

Tables2

Table 1. Overview of representative RNA LMs, grouped by pretraining data characteristics, with published models highlighted in bold.

Class	Model	Year	Reference	Parameters	Pretraining dataset	Input modality	Tokenization	Architecture	Layers	Heads	Embedding dim.	Max input length
I	RNA-FM	2022	[26, 51]	100M	RNAcentral [19] (~23M Seq.)	Seq.	base	BERT	12	20	640	1024
	RNABERT	2022	[50]	0.5M	Human seq. from RNAcentral (~762K Seq.)	Seq.	base	BERT	6	12	120	440
	Uni-RNA (L8)	2023	[27]	21M	RNAcentral and nt [72] and GWH [73] (~1B Seq.)	Seq.	base	BERT	8	8	512	1024
	Uni-RNA (L12)			71M					12	12	768
	Uni-RNA (L16)			168M					16	16	1280
	AIDO.RNA (650 M)	2024	[42]	648M	RNAcentral (~41.5M Seq.)	Seq.	base	BERT	33	20	1280	1024
	AIDO.RNA (1.6B)	2024	[42]	1.6B	RNAcentral (~41.5M Seq.)	Seq.	base	BERT	32	32	2048	1024
	DGRNA	2024	[55]	108M	MARS (~1.2B) [52]	Seq.	base	Mamba2	12	-	768	2048
	GenerRNA	2024	[56]	304M	RNAcentral [19] (~16.09M Seq.)	Seq.	BPE	GPT	24	16	1280	1024
	RFamLlama (base)	2024	[57]	31M	Rfam 14.10 (~676K)	Seq.	base	Llama	8	32	512	-
	RFamLlama (large)	2024	[57]	89M	Rfam 14.10 (~676K)	Seq.	base	Llama	10	32	768	-
	RNAErnie	2024	[31]	86M	RNAcentral (~23M Seq.)	Seq.	base	BERT	12	12	768	512
	RNA-km	2024	[32]	152M	RNAcentral (~23M Seq.)	Seq.	base	BERT	12	16	1024	512
	RNA-MSM	2024	[29]	96M	MSA (~4000 Rfam [74])	MSA	base	MSA-Transformer	10	12	768	1024
	BiRNA-BERT	2025	[43]	116M	RNAcentral and RefSeq [72] (~36.5M Seq.)	Seq.	base or BPE	BERT	12	12	768	1024 (base)
	ERNIE-RNA	2025	[28]	86M	RNAcentral (~20.4M Seq.)	Seq.	base	BERT	12	12	768	1024
	LAMAR (2k)	2025	[75]	86M	RNAcentral and RefSeq and NCBI Genome (~15M Seq.)	Seq.	base	BERT	12	12	768	2048
	LAMAR (4k)	2025	[75]	86M	RNAcentral and RefSeq and NCBI Genome (~15M Seq.)	Seq.	base	BERT	12	12	768	4096
	ProtRNA	2025	[40]	651 M	nonredundant RNAcentral (~6M Seq.)	Seq.	base	BERT	33	20	1280	512
	RiNALMo (micro)	2025	[30]	33M	RNAcentral and nt and Rfam and Ensembl [76] (~36M Seq.)	Seq.	base	BERT	12	20	480	1024
	RiNALMo (mega)			148M					30	20	640
	RiNALMo (giga)			651M					33	20	1280
II	3UTRBERT	2023	[54]	86M	3′UTR in human mRNA transcripts from GENCODE [77] (~20K Seq.)	Seq.	k-mer	BERT	12	12	768	512
	MP-RNA	2024	[41]	186M	OneKP initiative [78] (~54.2B Bases)	Seq. and SS	base	BERT	32	30	720	1024
	PlantRNA-FM	2024	[53]	34M	OneKP initiative (~54.2B Bases)	Seq. and SS	base	BERT	12	20	480	1026
	SpliceBERT	2024	[60]	20M	72 vertebrate genomes from UCSC [79] (~2M Seq.)	Seq.	base	BERT	6	16	512	1024
	UTR_LM	2024	[80]	1M	Ensembl-derived 5′UTRs and curated datasets from previous studies [81, 82] (~2.3M Seq.)	Seq. and SS and MFE	base	BERT	6	16	128	1022
III	LucaOne	2025	[58]	1.6B	RNA/Refseq (~136M Seq.)	Seq.	base	BERT	20	40	2560	1280

Table 2. Summary of downstream tasks used for evaluating RNA LM performance.

Task	Task type	Representative LMs	Benchmark datasets
Structural:
Solvent accessibility	Regression	RNA-MSM	Author-curated solvent accessibility labels computed from RNA 3D chain structures using the POPS package [83] with a probe radius of 1.4 Å [29]
Secondary structure	Binary	RNA-FM, RNA-MSM, Uni-RNA, RiNALMo, ERNIE-RNA, RNAErnie, ProtRNA, MP-RNA, AIDO.RNA	RNAStralign [84]; ArchiveII [85]; bpRNA-1m [86]; RNA 3D structure datasets curated in SPOT-RNA and SPOT-RNA2 [87, 88]
3D contact	Binary	RNA-FM, ERNIE-RNA	RNAcontact benchmark datasets [61], including nonredundant RNA 3D structures curated from Leontis and Zirbel [89]
3D torsion angle	Regression	BiRNA-BERT	RNA 3D structure datasets curated in SPOT-RNA-1d [90]
RNA structure alignment	Alignment/Similarity	RNABERT	RNA pairwise alignments from BRAliBase2.1 k2 [91]
Functional:
Classification	Binary/Multiclass	RNABERT, BiRNA-BERT, RiNALMo, RNAErnie, LucaOne, AIDO.RNA	ArchiveII [85, 92]; Rfam-based classification benchmarks [74]
Splice sites	Binary	SpliceBERT, RiNALMo, BiRNA-BERT, AIDO.RNA, DGRNA	Author-curated splice-site sequence datasets reported in Spliceator [93]
Modification sites	Multilabel	Uni-RNA, AIDO.RNA, 3UTRBERT	MultiRM benchmark datasets covering 12 RNA modification types [94]; author-curated human m6A modification datasets reported in [54, 95]
RNA-protein binding	Binary	RNA-FM, ERNIE-RNA, ProtRNA, LucaOne, DGRNA, 3UTRBERT	Author-curated RNA–protein interaction datasets reported in PrismNet [62]; ncRPI-LGAT benchmark datasets [96]; author-curated eCLIP-RBP datasets [54]
RNA–RNA binding	Binary	RNAErnie, BiRNA-BERT, DGRNA	Author-curated miRNA-mRNA interactions datasets reported in DeepMirTar [97]; author-curated miRNA-lncRNA interaction datasets reported in PmliPred [98]
mRNA subcellular localizations	Multilabel	3UTRBERT	DM3Loc benchmark datasets [99]; RNA localization datasets curated from RNALocate [100]
Mean ribosome load	Regression	RNA-FM, Uni-RNA, RiNALMo, ERNIE-RNA, UTR_LM, ProtRNA, AIDO.RNA, DGRNA	Author-curated 5′UTR datasets with MRL annotations reported in Optimus 5-prime [81]
mRNA exp. level and trans. efficiency	Regression/Binary	UTR_LM, AIDO.RNA, DGRNA, PlantRNA-FM	Author-curated 5′UTR datasets from human muscle tissue, PC3 prostate cancer cells, and HEK293T cells with corresponding mRNA EL and TE values [80, 82]; author-curated 5′UTR datasets from Arabidopsis and rice transcriptomes with corresponding TE values [53]
RNA genic region annotation	Multilabel	PlantRNA-FM	Author-curated genic region annotation datasets (5′UTR, CDS, and 3′UTR) derived from Phytozome [53, 101]
IRES identification	Binary	UTR_LM	Author-curated IRES identification datasets comprising 46,774 mRNAs collected from multiple public databases [74, 102–105]
RNA generation	Generative/Sequence generation	GenerRNA, RFamLlama	Rfam-based benchmarks [74]
Fitness prediction	Regression	RFamLlama	Author-curated deep mutational scanning (DMS) datasets [57]

Funding3

—National Key R&D Program of China10.13039/501100012166
—Shenzhen Medical Research Funds in China
—Natural Science Foundation of China10.13039/501100001809

Keywords

RNA language modelszero-shot evaluationRNA secondary structureRNA classificationmutational fitness prediction

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Origins and Evolution of Life

Full text

Introduction

RNA molecules play central roles in virtually all aspects of cellular regulation, ranging from gene expression control and catalysis to molecular recognition and therapeutic intervention [1–13]. Besides messenger RNAs (mRNAs), diverse classes of noncoding RNAs (ncRNAs)—including ribosomal RNAs, transfer RNAs, microRNAs, long ncRNAs, ribozymes, and regulatory untranslated regions—are now recognized as key functional entities whose activities are intimately linked to their sequence-encoded structural and evolutionary constraints [14–18]. Despite the rapid accumulation of RNA sequence data, experimentally determined RNA structures and functional annotations remain scarce [19, 20], resulting in an expanding gap between sequence availability and biological understanding.

In parallel, large-scale language models (LMs) have transformed natural language processing by learning contextual representations from massive unlabeled corpora. Transformer-based architectures [21], such as BERT [22], GPT [23, 24], and T5 [25], capture long-range dependencies through self-attention and have demonstrated remarkable transferability across downstream tasks. The conceptual analogy between natural language and biological sequences—both composed of discrete tokens governed by context-dependent rules—has motivated the extension of language modeling approaches to proteins, DNA, and RNA [26–43].

Protein language models (PLMs) have experienced particularly rapid development and provide an important precedent for the RNA field. Early foundational models, such as ESM-1b [35] and ProtT5 [37], demonstrated that structural and functional information can emerge from unsupervised training on large protein sequence databases. Subsequent scaling and architectural refinements led to more multimodal PLMs, such as ProtCLIP [44], ProteinGPT [45], and ESM-3 [46], which integrate sequence information with functional or structural signals, enabling improved zero-shot inference and broadening the scope of protein representation learning. Meanwhile, systematic evaluations such as ProteinGym [47] and PEER [48] have highlighted that model architecture, data composition, and evaluation protocols critically influence apparent performance, underscoring the necessity of standardized and task-agnostic benchmarking across different PLMs [49].

Inspired by the success of PLMs, a growing number of RNA LMs have emerged in recent years. Early models, including RNABERT [50] and RNA-FM [26, 51], adopted BERT-style masked language modeling and were pretrained primarily on ncRNA databases. More recent efforts expanded along multiple directions: (i) scaling model size and training data (e.g. Uni-RNA [27], AIDO.RNA [42], and RiNALMo [30]), including the use of the master database of all possible RNA sequences (MARS) [52]; (ii) incorporating structural or motif-level inductive biases during pretraining (e.g. RNAErnie [31], ERNIE-RNA [28], MP-RNA [41], and PlantRNA-FM [53]); (iii) introducing alternative masking strategies (e.g. RNAErnie [31], RNA-km [32]); (iv) adopting alternative positional encoding schemes (e.g. RNA-km [32]); (v) leveraging evolutionary information through multiple sequence alignments (e.g. RNA-MSM [29]); (vi) adopting k-mer or byte pair encoding (BPE) tokenization instead of single-nucleotide representations (e.g. 3UTRBERT [54], BiRNA-BERT [43]); (vii) employing novel pretraining strategies (e.g. ProtRNA [40]); (viii) exploring alternative network architectures (e.g. Mamba2-based DGRNA [55], GPT-based GenerRNA [56], Llama-based RFamLlama [57]); and (ix) developing unified or cross-modal models capable of processing RNA together with DNA and proteins (e.g. LucaOne [58]). These models have been applied to a wide range of downstream tasks, such as RNA secondary structure prediction [26–28, 30–32, 40–43], RNA family and type classification [26, 59], splicing and modification sites analysis [27, 42, 60], RNA–protein/RNA interaction prediction [61, 62], and mutational fitness estimation [57].

Despite the rapid progress of RNA LMs, several fundamental limitations remain in the current landscape of RNA LMs. First, unlike the protein domain—where standardized benchmarks and systematic comparative studies have clarified the strengths and limitations of different model architectures—RNA LMs were typically evaluated on heterogeneous downstream tasks using disparate datasets, fine-tuning strategies, and performance metrics. This fragmentation hampers cross-study comparison and obscures which aspects of RNA biology are intrinsically captured by different models prior to task-specific supervision. Second, although DNA LMs are sometimes adopted as baselines in RNA-related studies, there has been no comprehensive assessment of whether RNA-specific pretraining provides measurable advantages over generic nucleotide LMs [63, 64]. Moreover, in contrast to the large impact of PLMs on protein structure prediction, most notably exemplified by the emergence of AlphaFold [65] and subsequent development of single-sequence-based methods (e.g. ESMfold [66], OmegaFold [67]), it remains unclear whether RNA LMs can lead to a comparable advance in RNA structural modeling; notably, the recent CASP16 assessment failed to reveal significant advancement [68]. Third, RNA LMs differ substantially in network architecture, sequence encoding schemes, and pretraining strategies, yet the influence of these design choices on performance across structurally versus functionally oriented tasks remains poorly understood. Finally, the diversity of RNA training corpora—ranging from ncRNA-focused databases to mRNA-centric transcriptomes and unified multiomics datasets—raises the question of how pretraining scope affects the balance between structural and functional generalization.

A limited number of recent studies have begun to benchmark and evaluate RNA and DNA LMs on specific classes of tasks. These efforts include comprehensive evaluations of six LMs for RNA secondary structure prediction [69], large-scale benchmarks for RNA fitness and structure prediction (e.g. RNAGym, evaluating three RNA LMs [63]), nucleotide foundation model benchmarks focused on fitness prediction (e.g. NABENCH, evaluating ten RNA LMs [70]), and broader task collections for RNA modeling such as BEACON [71] (evaluating seven RNA LMs) and RNA-Scope [64] (evaluating six RNA LMs). While these efforts have provided valuable task-specific insights and established important datasets and evaluation protocols, they typically focus on a single task category or application domain, evaluate a limited subset of RNA LMs, or rely on task-specific fine-tuning or supervised training. Consequently, a comprehensive and standardized analysis that systematically compares a broad spectrum of RNA LMs under identical zero-shot evaluation protocols—while explicitly disentangling structural and functional signals and extensively examining information derived directly from unmodified model outputs—remains lacking.

Here, we present a comprehensive and standardized evaluation of RNA LMs. Using representative DNA LMs as reference controls, we assess 21 RNA LMs across three complementary zero-shot tasks: attention-based RNA secondary structure prediction, embedding-based RNA classification, and RNA fitness prediction from mutational likelihoods. By focusing exclusively on zero-shot settings, we aim to disentangle the intrinsic representational capacity of each model from the effects of downstream fine-tuning. This study provides a unified benchmark for RNA LMs, clarifies trade-offs between structural and functional learning, and highlights key challenges and opportunities for the next generation of RNA foundation models.

RNA language models and downstream task landscape

Collection and categorization of RNA language models

To enable a systematic and representative comparison, we first curated a comprehensive collection of RNA LMs from recent literature and publicly available resources over the past 3 years (see Methods for details). In total, 21 RNA LMs were identified within the search period (Table 1). Based on their pretraining data, these models were categorized into three classes. Class I models were trained primarily on diverse RNA datasets, particularly containing large amounts of ncRNAs. Class II models trained exclusively on mRNA-related sequences. Class III models comprised unified or multimodal architectures jointly trained on RNA, DNA, and/or protein sequences. Given the rapid evolution of the field, some recently proposed models or preprints lacking stable public checkpoints may not be included, despite extensive efforts to ensure comprehensive coverage.

Despite growing interest in alternative architectures, most of the existing RNA LMs adopt the BERT paradigm (17 out of 21 in Table 1) and rely on masked language modeling as their primary pretraining objective. This architectural dominance mirrors early developments in protein language modeling and reflects the flexibility of encoder-based models for learning contextual sequence representations without task-specific supervision. Consequently, performance differences among RNA LMs arise largely from variations in model scale, training data composition, tokenization schemes, pretraining strategies, and the incorporation of auxiliary inductive biases, rather than from fundamentally distinct network backbones.

The canonical network for BERT-based LMs is illustrated in Fig. 1, and a more detailed generalized description is provided in the Supplementary Information. Here, we briefly highlight three model outputs that are particularly relevant for downstream analysis. First, attention weights encode pairwise relationships between sequence positions and have been associated with long-range dependencies and structural organization in biological sequences [26, 30, 35, 37, 51]. Second, last hidden state representations (embeddings) provide contextualized token- or sequence-level representations that form the basis for clustering, classification, and functional or structural analyses [28, 31, 49]. Third, token-level logits, which reflect the model's predicted probability distribution over sequence variants, can be interpreted as proxies for evolutionary constraints and mutational preferences [47, 63]. These three types of outputs—attention weights, embeddings, and logits—are also available in RNA LMs based on alternative architectures, including Mamba2-, GPT-, Llama-, and MSA-Transformer-based models [29, 55–57], and together enable systematic probing of RNA LM behavior in a zero-shot setting.

Schematic overview of a BERT-based RNA language model, highlighting three core outputs—attention weights (att.), embedding representations (emb.), and token-level logits (logit). These outputs can be exploited for structural inference, functional prediction, and mutational fitness estimation, respectively.

Downstream task categories and dataset landscape

Corresponding to the diversity of RNA biological questions, RNA LMs have been evaluated across a wide spectrum of downstream tasks that can be broadly categorized into structural and functional property prediction (Table 2). Structural tasks primarily focus on RNA conformational features and include solvent accessibility prediction [29], secondary structure prediction [26–30, 40–43], three-dimensional (3D) contact and torsion angle prediction [26, 28, 43], and RNA structural alignment [50]. Functional tasks aim to capture regulatory and biological roles of RNA molecules and encompass RNA family or type classification [26, 59], splice sites and modification sites prediction [42, 60], RNA–protein and RNA–RNA interactions prediction [61, 62], mRNA subcellular localizations [54], mean ribosome load (MRL) prediction [26, 30, 40, 42, 80, 81], mRNA expression level (EL) and translation efficiency (TE) prediction [42, 80, 82], genic region annotation [53], internal ribosome entry sites (IRESs) identification [80, 81, 106], RNA generation and mutational fitness prediction [57].

Despite their breadth, the use of downstream tasks for evaluating RNA LMs is subject to several fundamental limitations. First, a substantial fraction of evaluations requires task-specific fine-tuning or additional prediction heads, confounding the intrinsic representational capacity of the pretrained models with downstream supervision. Second, data bias and redundancy are prevalent in widely used RNA resources, such as RNAcentral [19] and Rfam [74], where a small number of highly conserved RNA families (e.g. rRNAs and tRNAs) dominate the dataset. Although sequence-level deduplication tools (e.g. MMseqs2 [107] or CD-HIT-EST [108]) are often applied, structurally similar but sequence-divergent RNAs may still lead to information leakage. Third, the quality of evaluation data remains a concern, as many RNA secondary structure benchmarks (e.g. RNAStralign [84], ArchiveII [85, 92], and bpRNA [86]) are derived largely from comparative analysis or predictive algorithms rather than experimentally resolved 3D structures, which is considered the gold standard for revealing all (canonical or noncanonical) base pairs within RNA sequences.

To address these issues, we restrict our analysis to downstream tasks that can be evaluated in a strictly zero-shot setting and that probe complementary aspects of RNA biology. Specifically, RNA secondary structure prediction interrogates structural signals encoded in attention patterns and contextual representations, embedding-based RNA classification evaluates functional and categorical information captured by last hidden state representations, and RNA fitness prediction based on mutational likelihoods leverages token-level logits to quantify probabilistic constraints on sequence variation. In constructing evaluation datasets, we emphasize experimentally derived data wherever possible and explicitly control for data redundancy and imbalance. For secondary structure prediction, we curated a dataset from experimentally resolved PDB structures and applied both sequence- and structure-level redundancy filtering. For RNA classification, positive and negative samples were balanced across RNA families or types. For RNA fitness prediction, we integrated experimental mutational data spanning multiple RNA classes. Together, this evaluation framework aims to provide a more principled, unbiased, and interpretable evaluation of RNA LMs under unified zero-shot conditions.

Data and methods

RNA language models included in this study

RNA LMs were collected through systematic searches of peer-reviewed literature and preprint repositories, including Google Scholar, bioRxiv, and arXiv, complemented by citation tracing and curated community resources. Models were included if they (i) were published or publicly available within the past 3 years (before 1 August 2025), (ii) were based on language-model architectures, and (iii) provided open-source pretrained weights enabling direct extraction of model representations.

A total of 21 RNA LMs meeting these criteria were identified, covering general-purpose RNA foundation models, task-specific RNA LMs, and unified cross-molecular LMs. Model characteristics—including pretraining data sources, parameter counts, tokenization strategies, architectural backbones, and details of model availability and download sources—were summarized in Table 1 and Supplementary Table 1. This collection was designed to be representative of the current RNA LM landscape while remaining sufficiently diverse to enable comparative analysis across architectures and training paradigms.

Datasets

RNA secondary structure

For zero-shot evaluation of RNA secondary structure inference, we relied exclusively on base-pair annotations extracted from experimentally resolved 3D RNA structures, as high-resolution 3D data provide the most reliable ground truth at the base-pair level. We adopted a previously curated dataset comprising a validation set (VL1, 40 PDBs) and an independent test set (TS, 70 PDBs) [29]. Redundancy was removed at both sequence and structural levels, with sequence identity capped below 80% and structural similarity (TM-score) below 0.45 between validation and test sets. In addition, overlap in Rfam family annotations between VL1 and TS was eliminated using Infernal-based homology detection [109]. This resulted in a strictly nonhomologous subset (TS-Hard, 15 PDBs), which was used to assess robustness under stringent homology exclusion.

RNA classification

Functional similarity was evaluated using two complementary RNA classification benchmarks that operate at different levels of granularity. For RNA family–level classification, we constructed the RfamSample dataset based on Rfam 14.0 seed alignments [74]. After internal deduplication within each family (sequence identity below 80%), 10 sequences (length ≤ 1024) were randomly sampled from each family containing more than 10 members, yielding a total of 561 families. This dataset was used to examine the ability of RNA LMs to distinguish within-family from cross-family sequence pairs. For RNA type–level classification, we employed the ArchiveII [85] dataset after excluding excessively long group II introns. To reduce redundancy, MMseqs2 [107] was applied with a maximum sequence identity threshold of 80%, resulting in a nonredundant subset (ArchiveII-Nr) containing 1273 RNAs (Supplementary Table 2). This dataset was chosen to probe whether RNA LM embeddings can capture functional similarity beyond close sequence homology.

RNA DMS assays

Zero-shot RNA fitness prediction was evaluated using experimental mutational datasets from the RNAGym benchmark [63]. We selected 31 ncRNA deep mutational scanning (DMS) assays with wild-type sequence lengths not exceeding 440 nucleotides, ensuring compatibility with all evaluated LMs. These assays span multiple RNA classes and provide quantitative measurements of mutational effects suitable for logit-based evaluation.

Zero-shot evaluation

All evaluations were conducted in a strictly zero-shot setting, without any task-specific fine-tuning or additional supervised training.

Zero-shot secondary structure prediction

For each RNA LM, attention maps were extracted from all head–layer combinations. Attention matrices were symmetrized, corrected using average product correction (APC), and transformed into base-pairing probability matrices via sigmoid scaling. A single optimal head–layer and decision threshold was selected on the validation set (VL1) by maximizing the F1 score and then applied unchanged to the test sets (TS and TS-Hard). Model performance was quantified using Matthews correlation coefficient (MCC) and F1 score. This procedure ensures that performance reflects intrinsic structural information encoded during pretraining rather than after task-specific optimization.

Zero-shot RNA classification by embedding similarity

To assess whether sequence embeddings encode functional similarity, sequence embeddings were extracted and converted into fixed-length representations using Fourier transform-based dimensionality reduction (FFT). Cosine similarity distributions were computed for homologous and nonhomologous RNA sequence pairs sampled from RfamSample and ArchiveII-Nr datasets. Model performance was quantified using (i) MCC, (ii) area under the receiver operating characteristic curve (AUC), and (iii) the overlap ratio (OR) between similarity distributions, with lower OR indicating stronger discriminative capacity. In addition to FFT-based embeddings, the dedicated classification token (CLS) embedding was employed as a complementary representation to assess method dependence [30].

Zero-shot RNA fitness prediction via mutational likelihoods

RNA fitness prediction was performed using token-level logits to estimate the likelihood of sequence variants, following established zero-shot evaluation protocols [47, 63]. Two scoring schemes were considered: wild-type conditioned log-likelihood ratios (WT-LLR) and pseudo-log-likelihood differences (PLL-D), capturing local and global mutational effects, respectively. For autoregressive models and k-mer/BPE–based models, only PLL-D were computed. Logit normalization was performed using either log-softmax or softmax transformations, depending on model output characteristics, to ensure comparability across different models. Model performance was quantified using (i) MCC, (ii) AUC, and (iii) the absolute value of the Spearman correlation coefficient (SR) between predicted and experimental mutational effects.

Statistical analysis

All reported results were obtained without downstream fine-tuning. Performance metrics represent averages over multiple independent random samplings where applicable. Supplementary Table 3 summarizes the availability of attention weights, embedding representations, and token-level logits across evaluated RNA and DNA LMs, which determined the inclusion of specific LMs in each zero-shot task.

Additional methodological details are provided in the Supplementary Information.

Results

Structural signals encoded by attention: zero-shot RNA secondary structure prediction

We first examined whether RNA LMs capture RNA secondary structure information through attention-based analysis. Following prior work [29], attention maps extracted from RNA LMs were used to perform zero-shot RNA secondary structure prediction. A single optimal head–layer combination and decision threshold were selected on a validation set (VL1) and subsequently applied unchanged to an independent test set (TS), which consists of 70 RNAs with experimentally determined 3D structures. The DNA LM DNABERT [38] was included as a reference control.

As summarized in Supplementary Table 4 and Fig. 2a and b, performance evaluated by MCC and F1 score revealed that the DNABERT and several Class II RNA LMs trained primarily on mRNA-related sequences—including 3UTRBERT, SpliceBERT, and UTR_LM—exhibited negligible secondary structure prediction capability, underscoring the importance of ncRNA-rich pretraining corpora for capturing RNA structural information. Notable exceptions were MP-RNA and PlantRNA-FM, which achieved relatively strong performance despite limited ncRNA content in their training data. Their effectiveness can be attributed to the explicit incorporation of RNA secondary structure information, derived from ViennaRNA-based predictions, during pretraining and inference [41, 53, 110]. The unified multiomics model LucaOne (Class III RNA LM) showed moderate performance but consistently underperformed most Class I RNA LMs, suggesting that modality-agnostic pretraining may dilute RNA-specific structural inductive biases.

Zero-shot RNA secondary structure prediction by DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs. Performance was evaluated using (a) MCC and (b) F1 score on the test set (TS), and (c) MCC and (d) F1 score on the hard test set (TS-Hard). For models with multiple released versions, the version achieving the highest median MCC was shown. Models within each category were ordered by increasing median MCC.

Among Class I RNA LMs, RNA-MSM achieved the highest median F1 score and MCC, outperforming AIDO.RNA (1.6B) by 9.0% and 6.6%, respectively, despite being trained on fewer than 4000 RNA families. This result highlights the advantage of incorporating evolutionary information via multiple sequence alignments and emphasizes the intrinsic difficulty of RNA LMs for inferring RNA secondary structure from single sequences alone. Clear scaling effects were observed across several model families: larger models consistently outperformed their smaller counterparts, including RiNALMo (giga versus mega), Uni-RNA (L16 versus L12), and AIDO.RNA (1.6B versus 650M).

Essentially the same ranking was observed on the more stringent TS-Hard dataset, which shares no RNA families with the validation set (VL1). The Spearman correlation coefficient between median MCC values on TS and TS-Hard was 0.99, indicating strong ranking consistency. As shown in Supplementary Table 5 and Fig. 2c and d, the top-performing models on TS—MP-RNA (#3), AIDO.RNA (1.6B) (#2), and RNA-MSM (#1)—maintained their relative ordering on TS-Hard, supporting the robustness of the comparative analysis despite the limited size of the hard test set.

Functional signals in embedding representations: zero-shot RNA classification

To evaluate whether RNA LM embeddings encode functional and categorical information, we performed zero-shot RNA classification based on embedding similarity. A total of 100,000 homologous and 100,000 nonhomologous RNA sequence pairs were sampled from the RfamSample dataset, and cosine similarity between sequence-level embeddings was computed using a Fourier transform–based dimensionality reduction approach (details in Methods). DNA LMs (e.g. DNABERT [38], DNABERT-2 [39], and NT [33]) were included as controls. Model performance was quantified using MCC, AUC, and the OR. For consistency with MCC and AUC, the 1 – OR was reported, such that higher values indicate better performance.

As shown in Supplementary Table 6 and Fig. 3a, most RNA and DNA LMs were able to distinguish homologous from nonhomologous sequence pairs, indicating that unsupervised pretraining captures family-level RNA properties to some extent. Overall, Class I RNA LMs outperformed DNA LMs and Class II RNA LMs on the RfamSample dataset, whereas the unified model LucaOne showed mid-range performance. RNA-FM achieved the strongest discriminative performance, with a mean value of 1 − OR = 0.878 ± 0.001. This was followed by ProtRNA. In contrast, non-BERT-based models (e.g. DGRNA, GenerRNA, and RFamLlama) produced embeddings with limited discriminative power in the absence of task-specific prompts.

Zero-shot RNA classification based on embedding similarity on (a) the RfamSample and (b) ArchiveII-Nr datasets. Results were shown for DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs. Performance was measured by the OR between similarity distributions, reported as 1 – OR (better performance for a larger value). For the models with multiple versions, the version with the highest mean 1 − OR was displayed. Models within each category were ordered by increasing mean 1 − OR.

Classification results based on F1, MCC, and AUC (Supplementary Table 6) were highly consistent with those based on 1 − OR, with Spearman correlation coefficients of 0.95, 0.94, and 0.89, respectively. This consistency indicates that the OR provides a robust summary metric for embedding-based classification performance.

We further evaluated model performance on the ArchiveII-Nr dataset, where homology was defined at the RNA type level rather than the family level. As shown in Supplementary Table 7 and Fig. 3b, overall performance decreased across all models, reflecting increased sequence diversity and weaker homology signals. Nevertheless, model rankings remained largely consistent with those observed on RfamSample (Spearman correlation = 0.88). ProtRNA was now the best, followed by RNA-FM. Notably, increased model size did not systematically improve performance in this task; in several cases, smaller models outperformed larger ones, suggesting that embedding quality does not scale monotonically with parameter count.

Complementary analyses using the dedicated classification token (CLS) embedding [30] yielded highly similar results (Supplementary Tables 8 and 9). The Spearman correlation between FFT-based and CLS-based rankings reached up to 0.95 for mean 1 − OR on both datasets, further supporting the robustness of the observed trends. In addition, high correlations in model rankings were also observed between 1 − OR and MCC (Spearman correlation = 0.92 in RfamSample and 0.94 in ArchiveII-Nr), as well as between 1 − OR and AUC (Spearman correlation = 0.93 in RfamSample and 0.94 in ArchiveII-Nr), when classification performance was evaluated using CLS-based embeddings.

Probabilistic signals in token-level logits: zero-shot RNA fitness prediction

Finally, we examined whether RNA LMs encode evolutionary and functional constraints through token-level logits by evaluating zero-shot RNA fitness prediction. Experimental mutational assays from the RNAGym benchmark [63], covering multiple classes of ncRNAs, were used for evaluation, with DNA LMs serving as reference controls. Model performance was quantified using MCC, AUC, and the absolute value of the Spearman correlation coefficient (SR).

Consistent with observations in ProteinGym [47] and RNAGym [63], the choice of scoring function had a pronounced impact on performance. As shown in Supplementary Tables 10 and 11, WT-LLRs substantially outperformed PLL-Ds across most RNA LMs and were used by nearly all top-performing models. This result highlights the importance of wild-type–anchored scoring for zero-shot RNA mutation effect prediction.

Overall performance patterns differed markedly from those observed in structural and classification tasks. As summarized in Fig. 4a (best results for each LM, using either the WT-LLR or PLL-D method), DNA LMs exhibited limited predictive power across all metrics. Models incorporating explicit evolutionary modeling or extended context lengths (e.g. EVO1.5) showed modest improvements but still underperformed RNA-specialized models. Unified models (LucaOne family) consistently outperformed DNA LMs, except for the EVO series, but remained inferior to many RNA-focused models.

Zero-shot RNA fitness prediction using token-level logits. (a) Overall performance of DNA LMs (D), Class III (unified, RIII), II (mRNA-related, RII), and I (ncRNA-enriched, RI) RNA LMs, measured by the absolute value of the Spearman correlation coefficient (SR) between predicted and experimental mutational effects. For models with multiple versions, only the version with the highest median SR was shown. Models within each category were ordered by increasing median SR. (b) Median SR across assays grouped by RNA type, illustrating RNA class–dependent performance variability.

The best performance was achieved by several Class I and Class II RNA LMs, including MP-RNA, AIDO.RNA, RNA-FM, SpliceBERT, and DGRNA, with median MCC values of 0.14–0.18, median AUC of approximately 0.60, and median SR values exceeding 0.20. MP-RNA and SpliceBERT show the strongest performance based on median (0.215) and mean (0.268) SR values, respectively. Nevertheless, absolute performance remained limited across all models (mean MCC < 0.20, AUC < 0.64, SR < 0.27), indicating that zero-shot RNA fitness prediction remains a challenging task. Model rankings based on MCC, AUC, and SR were highly consistent, with Spearman correlations of 0.99 between SR and the other metrics at the mean level.

Despite low overall performance, prediction accuracy varied substantially across RNA types. As shown in Fig. 4b, models with strong overall performance (e.g. SpliceBERT, DGRNA, EVO1.5, etc.) derived much of their predictive power from tRNA assays, where SR values exceeded 0.4, whereas performance on other RNA classes remained comparatively weak.

Discussion

Over the past decade, interest in RNA biology—particularly ncRNAs—has expanded rapidly due to their diverse regulatory and functional roles [1, 2, 8, 9]. Despite the central role of structure in determining RNA function, experimentally resolved RNA structures account for less than 0.1% of known sequences [111]. Inspired by the success of PLMs in capturing structure–function relationships from sequence alone, RNA LMs have emerged as a promising framework for large-scale RNA analysis [26, 27, 29, 30]. Here, we performed unified zero-shot evaluations of 21 RNA LMs to assess how effectively they encode structural, functional, and evolutionary information.

Consistent with prior observations, the limited nucleotide alphabet and weak sequence conservation of RNAs make structure inference from single sequences intrinsically challenging [26, 29, 112]. This difficulty is underscored by the strong performance of the MSA-based RNA-MSM model, which outperforms single-sequence RNA LMs despite being pretrained on a comparatively small number of RNA families. Although model scaling contributes to improved performance, it is not clear if simply expanding the number of parameters in existing RNA LMs would resolve the performance gap between MSA-based and single-sequence-based LMs.

A key finding of this study is the pronounced trade-off among structural, functional, and evolutionary representations. Models that perform well in zero-shot secondary structure prediction (e.g. RNA-km and MP-RNA) may exhibit weaker performance in embedding-based RNA classification or logit-based fitness prediction and vice versa. These results indicate that different biological signals are not automatically aligned in current RNA LMs and that pretraining strategies favoring one aspect may compromise others.

We further observe that unified or multiomics LMs do not consistently outperform RNA-specialized models on RNA-centric tasks. While such models may be advantageous for cross-modal applications, RNA-specific inductive biases appear critical for accurately capturing RNA structural and functional properties.

Together, these findings highlight substantial opportunities for improving RNA LMs. Although continued scaling is likely to yield incremental gains, more substantial progress will require advances in training objectives, data design, and inductive bias integration. Improving structural representations without degrading functional and evolutionary signals remains a central challenge. One possible solution is to develop domain-specific LMs that may offer some advantages for specific applications. Moreover, most current models treat RNA as composed solely of four canonical nucleotides, despite the widespread presence of chemical modifications that influence RNA folding and function. Incorporating RNA modification information therefore represents an important direction for future model development.

In summary, our results demonstrate that existing RNA LMs encode heterogeneous and task-dependent biological signals. A clearer understanding of how model architecture, training data, and objectives shape structural, functional, and evolutionary representations will be essential for developing next-generation RNA LMs that more faithfully reflect all aspects of RNA biology.

Key Points

RNA language models are evaluated using a strictly zero-shot framework, enabling assessment of intrinsic representational capacity independent of downstream fine-tuning.
Three zero-shot tasks—secondary structure inference, RNA classification, and mutational fitness prediction—are selected to explicitly correspond to attention weights, embeddings, and token-level logits, respectively.
Current RNA language models encode heterogeneous and task-dependent biological signals, with clear trade-offs between structural, functional, and evolutionary representations.
This work provides a standardized benchmark and evaluation strategy to facilitate more interpretable and reproducible comparisons of RNA language models.

Supplementary Material

bbag098_Supplymentary_Information_20260209

Bibliography112

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chaudhary N, Weissman D, Whitehead KA. m RNA vaccines for infectious diseases: principles, delivery and clinical translation. Nat Rev Drug Discov 2021;20:817–38. 10.1038/s 41573-021-00283-5.34433919 PMC 8386155 · doi ↗ · pubmed ↗
2Rohner E, Yang R, Foo KS. et al. Unlocking the promise of m RNA therapeutics. Nat Biotechnol 2022;40:1586–600. 10.1038/s 41587-022-01491-z.36329321 · doi ↗ · pubmed ↗
3Curreri A, Sankholkar D, Mitragotri S. et al. RNA therapeutics in the clinic. Bioeng Transl Med 2023;8:e 10374. 10.1002/btm 2.10374.36684099 PMC 9842029 · doi ↗ · pubmed ↗
4Gayet RV, Ilia K, Razavi S. et al. Autocatalytic base editing for RNA-responsive translational control. Nat Commun 2023;14:1339. 10.1038/s 41467-023-36851-z.36906659 PMC 10008589 · doi ↗ · pubmed ↗
5Dykstra PB, Kaplan M, Smolke CD. Engineering synthetic RNA devices for cell control. Nat Rev Genet 2022;23:215–28. 10.1038/s 41576-021-00436-7.34983970 PMC 9554294 · doi ↗ · pubmed ↗
6Li B, Niu Y, Ji W. et al. Strategies for the CRISPR-based therapeutics. Trends Pharmacol Sci 2020;41:55–65. 10.1016/j.tips.2019.11.006.31862124 PMC 10082448 · doi ↗ · pubmed ↗
7Bora RS, Gupta D, Mukkur TKS. et al. RNA interference therapeutics for cancer: challenges and opportunities (review). Mol Med Rep 2012;6:9–15. 10.3892/mmr.2012.871.22576734 · doi ↗ · pubmed ↗
8Childs-Disney JL, Yang X, Gibaut QMR. et al. Targeting RNA structures with small molecules. Nat Rev Drug Discov 2022;21:736–62. 10.1038/s 41573-022-00521-4.35941229 PMC 9360655 · doi ↗ · pubmed ↗