Computational understanding of non-coding RNA pairwise interactions
Marco Nicolini, Federico Stacchietti, Elena Casiraghi, Giorgio Valentini

TL;DR
This paper introduces CUPID, a deep learning tool that predicts interactions between non-coding RNAs using only their sequence data.
Contribution
CUPID is the first deep learning framework that generalizes across various types of non-coding RNAs to predict pairwise interactions.
Findings
CUPID uses RNA sequence embeddings and a classifier to predict ncRNA interactions without thermodynamic models.
The model generalizes across different ncRNA types like long non-coding, circular, and micro-RNAs.
CUPID provides a scalable approach for studying RNA-based regulatory networks.
Abstract
Non-coding RNAs (ncRNAs) govern a vast network of regulatory interactions within the cells, yet their pairwise relationships remain largely uncharted due to the complexity of RNA structure and the limits of current experimental methods. We present CUPID (Computational Understanding of Pairwise Interactions in ncRNA Data), a deep learning framework that predicts ncRNA-ncRNA interactions directly from primary sequence information. CUPID uses embeddings from a pre-trained RNA language model combined with a feed-forward classifier to identify patterns linked to molecular pairing. This approach avoids reliance on thermodynamic models or manual feature design and, unlike previously proposed models, is able to generalize across different types of ncRNAs, including long non-coding, circular, micro-, and small nuclear RNAs. By learning the hidden rules that govern RNA recognition, CUPID provides…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8|
|
|
|
|---|---|---|
| Random baseline | 0.5 | 0.047 |
| IntaRNA | 0.544 | 0.055 |
|
| ||
| No-Data-aug | 0.658 | 0.078 |
| Data-aug-Max | 0.810 | 0.147 |
| Data-aug-concat | 0.862 | 0.222 |
| Data-aug-Avg |
|
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · RNA Research and Splicing
Introduction
1
Understanding RNA-RNA interactions is critical for deciphering the regulatory circuits that orchestrate gene expression, RNA processing, and signal transduction. Non-coding RNAs (ncRNAs), despite lacking protein-coding potential, play pivotal roles in these processes (Ali et al., 2021). However, experimental mapping of ncRNA interactions remains challenging due to the limitations of existing experimental and computational techniques (Lorenzi et al., 2021).
Methods such as Minimum Free Energy (MFE) calculations and accessibility-based models have been usually applied to predict RNA-RNA interactions. Tools like IntaRNA (Mann et al., 2017) estimate the interaction energy as ΔGtotal = ΔGduplex+ΔGaccessibility, where the first term quantifies the energy released upon hybridization, and the second accounts for the cost of rendering binding regions accessible. Benchmark studies have demonstrated that accessibility-based algorithms can effectively differentiate native interactions from background noise (Umu and Gardner, 2017), yet these approaches rely on predefined parameters and simplified energy models. In parallel, experimental techniques such as RNA Antisense Purification (RAP-RNA) offer validation but remain limited by their high cost and labor intensity (Engreitz et al., 2014).
Advances in machine learning and graph-based modeling for biological data, including recent work on explainability and diffusion-based attention mechanisms, have motivated a surge of learning-driven approaches for predicting interactions across diverse molecular systems (Gliozzo et al., 2025; Cetin and Sefer, 2025; Sefer, 2025).
Machine learning methods, such as convolutional neural networks, deep forests and graph neural networks (Alipanahi et al., 2015; Tian et al., 2021; Wei et al., 2022) have been applied to RNA-protein interaction prediction, while graph-based approaches embed heterogeneous networks of ncRNAs and diseases using multigraph contrastive learning (Sun et al., 2025) or apply random-walk based graph representation learning techniques to predict non coding RNA interactions (Torgano et al., 2025).
While effective, these methods often rely on predefined feature extraction, graph structures, or supervised training, limiting their adaptability to novel ncRNA sequences.
In contrast, LLMs can directly learn from large corpora of proteins or RNA data (Valentini et al., 2023; Zhao et al., 2024; Shen et al., 2024; Nicolini et al., 2025a), capturing intricate interaction motifs beyond predefined energy models or graph-based constraints. Unlike thermodynamic models, which impose simplifying assumptions, LLMs infer interaction likelihoods from latent structural patterns, offering a flexible, data-driven approach. In particular, transformer-based foundation models can generate biologically meaningful representations directly from raw sequences, by exploiting large RNA sequence corpora (Sapoval et al., 2022; Chen et al., 2022; Yu et al., 2024). More in general several deep learning methods have been proposed to predict specific ncRNA interactions, using rna2vec pre-training and deep feature mining (Yu et al., 2022) or conditional random fields and graph convolutional networks (Wang et al., 2022), heterogeneous graph neural networks (Li et al., 2025) and convolutional neural networks combined with a Transformer Encoder (Yang et al., 2025) for the prediction of miRNA-lncRNA interactions.
We also recently proposed a deep neural network trained on embedded representations of a subset of ncRNAs obtained from the RNA-FM language model (Shen et al., 2024), achieving state-of-the-art results for predicting miRNA interactions with other ncRNA molecules (Nicolini et al., 2025b). However, our proposed model, like other models recently proposed in the literature (Li et al., 2025; Yang et al., 2025), is only able to predict specific ncRNA interactions (e.g., interactions with miRNAs). Furthermore, due to limitations on the maximum allowed sequence length of the underlying RNA-FM transformer, it can only process sequences shorter than approximately 1,000 nucleotides, thus limiting the model's application to relatively long ncRNAs (e.g., lncRNAs).
To overcome these limitations, we propose a novel Transformed-based deep learning model, that, differently from previous models proposed in literature, is able to predict a large range of ncRNA interactions, including long non-coding RNA (lncRNA), circular RNA (circRNA), microRNA (miRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Small Cajal body-specific RNAs (scaRNAs), small cytoplasmic RNAs (scRNA), and other types of ncRNAs. Moreover, by adopting GenerRNA (Zhao et al., 2024) to encode RNA sequences, our model can process full-length ncRNA sequences (up to 4,096 nucleotides) without truncation, thus significantly enlarging the set of ncRNAs that can be processed by the model.
We hypothesize that LLM-based contextual embeddings provide a rich representation for ncRNA interaction prediction, circumventing the limitations of manual feature engineering or predefined structural graphs. We reasoned that GenerRNA (Zhao et al., 2024), pretrained on a large corpus of ncRNA sequences using a masked language modeling objective, can capture long-range interactions of ncRNA molecules, thus facilitating downstream tasks such as ncRNA interaction prediction.
Our CUPID model (Computational Understanding of Pairwise Interactions in ncRNA Data), predicts ncRNA interactions using only sequence information. CUPID extracts embeddings from a pre-trained ncRNA language model and feeds a dense feed-forward neural network (FFNN) to automatically learn intricate sequence interaction features. This design circumvents the need for explicit thermodynamic parameterization and manually engineered features, offering a scalable and efficient alternative for uncovering novel regulatory interactions (Fabbri et al., 2019).
Methods
2
Dataset
2.1
Our dataset comprises a subset of multispecies ncRNA interaction pairs from RNA-KG (Cavalleri et al., 2024).1
The RNA-KG integrates physical and functional interactions between different types of ncRNAs, and their relationships with other biomolecules (genes and proteins) and chemicals, as well as with biomedical concepts coded in the Gene Ontology (Aleksander et al., 2023), the Human Phenotype Ontology (Gargano et al., 2024), Mondo (Vasilevsky et al., 2025), and other bio-medical ontologies related to the “RNA world.”
In particular, we extracted RNA–RNA edges from RNA-KG by selecting only relations annotated as interacts-with. In RNA-KG, interacts-with denotes experimentally supported physical RNA–RNA interactions, and we therefore excluded other relation types encoding functional associations (e.g., regulatory links, co-expression, or disease associations). The interacts-with edges integrated in RNA-KG originate from multiple underlying curated interaction databases. Figure 1 presents an overview of the main RNA entities and their relationships available in the the RNA-KG. Readers may refer to the RNA-KG reference (Cavalleri et al., 2024) for the complete list of contributing sources and evidence provenance.
Simplified representation of the RNA-KG meta-graph, focused on ncRNAs and their interactions.
We filtered the dataset to retain only sequences that fit within the GenerRNA (Zhao et al., 2024)'s token limit (approximately 4,096 nucleotides), since Byte Pair Encoding (BPE) compresses raw nucleotide sequences, allowing longer sequences to fit within the model's constraints. After applying this length filter, the dataset contains:
101, 088 interaction pairs (down from an initial 130, 310 pairs).11, 212 unique sequences (selected from 19, 624 potential sequences) belonging to 9 different RNA molecule types: long non-coding RNA (lncRNA), circular RNA (circRNA), microRNA (miRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Small Cajal body-specific RNAs (scaRNAs), small cytoplasmic RNAs (scRNA), not (better) classified non coding RNA molecules (ncRNA), and pseudo RNA.2
In the following, we denote the set of length-filtered molecules as
where the type of each molecule is given by ϕ(s), i.e. represents a mapping of a ncRNA sequence to its ncRNA type , e.g. miRNA, lncRNA or any other ncRNA type.
The identity of an interaction pair is solely determined by its constituent molecules, regardless of order; that is,
The type of an interaction (si, sj) with si≠sj and is determined by the types of the ncRNA si and sj theirselves, regardless of their order:
For instance, possible types of ncRNA interactions are miRNA-lncRNA or miRNA-circRNA. Assuming that interacting ncRNA pairs of different types exhibit distinct specificities that the model should learn, we reasoned that types with negligible sample sizes might introduce noise rather than valuable information. Therefore, the set of interaction pairs used in this work is obtained by further filtering the dataset of interacting pairs to remove interacting pair types represented by fewer than 100 samples, resulting in 10,644 unique sequences composing 99,841 interacting pairs. Figure 2 shows the distribution of the different types of ncRNA interactions.
Distribution of ncRNA interactions pairs in the filtered interaction set. Rows: first (left) molecule type; Columns: right molecule type.
Data augmentation
2.2
To address the issues due to the limited cardinality of the available training data, especially for specific types of ncRNA interactions (e.g., snRNA-miRNA or miRNA-circRNA), we employ a data augmentation strategy that effectively increases the dataset size by a factor of 4. For each original training instance represented as a pair of interacting ncRNA (si, sj) we generate three additional augmented instances:
Molecule Order Reversal: Swap the order of the molecules: (sj, si).Sequence Flipping: Reverse the nucleotide order in both molecules (denoted by the superscript F): .Combined Augmentation: Reverse both the molecule order and the nucleotide sequences: .
Thus, if the original dataset contains N instances, the augmented dataset becomes: Naug = 4N (Supplementary Figure S1). This augmentation introduces invariance to both the order and orientation of sequences, thereby enabling the model to better capture the underlying biological patterns and improving its robustness against input variability.
In order to avoid leakage between training and test sets, data augmentation is performed after splitting the dataset.
Negative examples generation
2.3
In our dataset, only positive non-coding RNA-RNA interactions are explicitly provided, and they occur with varying frequencies.
To effectively train CUPID , we generated negative examples for each interaction pair type by matching the frequency distribution of the positive interactions. Specifically, negative examples were generated under the assumption that any pair of ncRNA sequences drawn from the set of unique sequences that is not observed as a positive interaction constitutes a possible negative instance.
Let be the set of unique ncRNA sequences present in the dataset. Denote by
the set of all positive ncRNA-ncRNA interactions. Then, the set of all possible ncRNA pairs is given by (excluding self-interactions).
The set of potential negatives is defined as:
Negative sampling procedure To generate the negative samples for each interacting pair type, we corrupt its tuples. In other words, given a positive pair (si, sj) with type (ϕ(si), ϕ(sj)), we keep the first molecule si fixed and sample such that:
In this way we avoid generating negatives between ncRNA types that never interact (e.g., scaRNA and lncRNA).
Because we generate negatives for each positive pair (si, sj) by corrupting the right molecule while keeping the same type pair (ϕ(si), ϕ(sj)), the negative set preserves the interaction type-pair distribution of the positives in expectation (and approximately in practice, up to rejection of candidates already present as positives or previously sampled negatives).
For each positive edge, we selected n negative edges, in order to control the imbalance between positive and negative edges in the testing phase (we set n = 20 in our experiments).
Negative sampling algorithm The negative sampling algorithm is detailed in Algorithm 1. In our implementation, we set n = 20. Note that, since the condition at line 5 of the algorithm cannot be always guaranteed, it is likely that the number of negatives n ≤ 20. In our experiments we set n = 20.
Algorithm 1Negative sampling algorithm.
Model architecture
2.4
The overall CUPID architecture
2.4.1
Our model follows a two-stage pipeline, as illustrated in Figure 3. It first extracts ncRNA sequence embeddings using a pre-trained ncRNA Language Model (GenerRNA; Zhao et al., 2024) and then processes these embeddings through a Feed-Forward Neural Network (FFNN) to predict interaction probabilities.
High-level CUPID architecture schema.
The GenerRNA architecture mimics the GPT-2-medium model (Radford et al., 2019), and is composed of 24 stacked transformer-decoder layers, each incorporating a self-attention mechanism that models pairwise interactions among all positions in its input sequence. GenerRNA uses a context window of 1,024 tokens, corresponding to input sequences with a length of approximately 4,096 nucleotides coded through byte pair encoding (Sennrich et al., 2016). Note that this maximum length permits the encoding of large RNA molecules. This decoder-only Transformer architecture operates in an autoregressive manner, predicting the subsequent token given the previous ones. Both the input and output of the model are represented as tokens, which are encoded and decoded by a trained tokenizer using byte pair encoding. A special token (EOS) is used to delimit sequences, indicating the start and end of each sequence.
Each transformer block is fed with a input of size L×H, thus allowing to process RNA sequences having up to L tokens, each one represented through a H-dimensional real vector, with L = H = 1, 024, and outputs a latent representation with the same dimensionality for each input token. For each input sequence, the block employs a multi-head self-attention mechanism with 16 attention heads. This is followed by an “Add & Norm” sub-block, which applies residual addition and layer normalization. Subsequently, a feed-forward sub-layer expands the hidden states from 1,024 to 4,096 dimensions, applies a non-linear activation (ReLU), and then projects them back to 1,024 dimensions. Another “Add & Norm” sub-block is applied after the feed-forward network, and finally, the block produces an output matrix X∈ℝ^L×H^. A schematic diagram of this block is reported in the Figure 4.
High-level architecture of a GenerRNA block.
Pooling techniques
2.4.2
The i^th^ row of matrix X is a latent representation of the i^th^ token. To obtain a fixed-length embedding for the entire sequence, we tested two types of pooling over the sequence (i.e., across the H tokens), as well as their concatenation:
Average (Avg) Pooling: obtained as the mean of the embeddings of all the tokens: Maximum (Max) Pooling: Compute the element-wise maximum over all token embeddings: , where cmax is the columnwise max operator, and xij are the elements of the X embedding matrix.Concatenation of [Avg, Max]: Combine both pooled representations into a single embedding vector:
These pooling strategies are schematically depicted in Figure 5.
Pooling embedding strategies.
The embedded representation of a candidate interacting ncRNA pair is composed by the concatenation of the embeddings of the two interacting molecules.
The classification unit
2.4.3
To predict the interaction we used the pooled embeddings of the RNA sequences as input to a dense Feed Forward Neural Network (FFNN), having the following architecture:
Input layer dimension: 1,024 for Avg and Max-pooling embedding strategies, 2,048 when the embedding of the input molecule is obtained by concatenating the embeddings obtained by AVG and Max pooling,Hidden layers: 4 hidden layers with 1,024 neurons each and ReLU activation function,Output layer: 1 neuron with sigmoid activation function.
To train the network we applied the following hyper-parameters:Learning Rate: η = 5 × 10^−4^ with a linear warm-up phase of 4 epochs, followed by cosine decay. Epochs: 50 epochs with early stopping (patience of 10 epochs). The model with the best validation loss is selected (e.g., if the lowest validation loss is observed at epoch 35, then early stopping is triggered at epoch 45, and the model from epoch 35 is used).Batch Size: 512. Dropout Rate: 0.2. Optimizer: Adam. Loss Function: Binary Cross-Entropy.
Training and validation loss curves were monitored over epochs to assess model convergence and to avoid potential overfitting by early stopping.
Mini-batch balancing
2.4.4
Due to the imbalance in our dataset we adopted a training strategy designed to prevent the model from learning predominantly from the negatives. To address this, we constructed mini-batches that contain a controlled mix of positive and negative examples. Recall that our training set is composed of the set of positive interaction pairs, , and the set of negative interaction pairs , with , as detailed in Section 2.3. Each mini-batch B of size m is formed by randomly selecting mp positive examples (using a uniform distribution with replacement) and mn negative examples (using a uniform distribution without replacement). The ratio of negatives within each mini-batch is defined by
Here, r can vary between 0 and 1. A value of r = 0.7 implies that 70% of the mini-batch consists of negatives. The choice to sample positives with replacement is driven by their limited number, ensuring sufficient representation even in large batches, whereas sampling negatives without replacement allows for a broader coverage of these more abundant examples.
Experimental evaluation
2.5
Data preparation and splitting
2.5.1
In all our experiments the negative examples were sampled according to the relative frequency of the interacting pair types according to the procedure described in Section 2.3 using a negative:positive ratio equal to 20:1.
The dataset was partitioned into stratified training and test sets (train:test ratio = 90:10). The training set was further split into a stratified set for training (80% of interaction pairs) and the remaining 20% for validation. The validation set was used for early stopping and for tuning the classification threshold via maximization of the Matthews correlation coefficient (MCC; Matthews, 1975) on the validation data.
Evaluation metrics
2.5.2
To comprehensively assess model's performance, we computed a range of evaluation metrics, encompassing both threshold-dependent and threshold-independent measures. Specifically, we first evaluated standard binary classification metrics, including accuracy, balanced accuracy (to account for class imbalance), precision, recall, F1 score, AUROC (Area Under the Receiver Operating Characteristic Curve), and AUPRC (Area Under the Precision-Recall Curve). In addition to these overall metrics, we conducted a stratified analysis based on interacting pair types, computing the aforementioned measures separately for each pair type.
Let yi∈{0, 1} be the ground-truth label and the predicted probability for sample i. Given a decision threshold t, we define and the confusion matrix counts:
Threshold-dependent metrics are then computed as:
In our work the threshold t is chosen on the validation set by maximizing the MCC coefficient, which provides a balanced single-score summary that incorporates TP, TN, FP, and FN, and is therefore less sensitive than accuracy to class imbalance:
Threshold-independent metrics summarize performance across all thresholds. The ROC curve plots TPR(t) = Recall(t) versus FPR(t) = FP(t)/(FP(t)+TN(t)), and AUROC is the area under this curve. The precision–recall curve plots Precision(t) versus Recall(t), and AUPRC is its area; under strong class imbalance, AUPRC is often more informative than AUROC, with a random baseline equal to the positive prevalence .
Training hyper-parameters and baselines for comparison
2.5.3
The hyper-parameters and configurations used for training the FFNN are reported in Section 2.4.3. Moreover, training and validation loss curves were monitored over epochs to assess model convergence and to avoid potential overfitting by early stopping.
Hyperparameter selection was performed in preliminary experiments on a reduced subset of the training/validation interaction pairs using a grid-search strategy. We varied the number of hidden layers in {2, 4, 6}, the dropout rate in {0.1, 0.2}, and the batch size in {16, 512, 1, 024}. For each configuration, models were trained using the same optimization settings described in Section 2.4.3, and the final model was selected as the configuration that maximized validation AUPRC. No hyperparameters were tuned on the test set.
Besides the random classifier, whose expected performance are AUROC = 0.5 and AUPRC = 0.047, we employed the IntaRNA method (Mann et al., 2017) as a baseline for comparison. IntaRNA estimates interaction energy. While interaction energy can be thresholded to obtain binary predictions, which enable the computation of accuracy, balanced accuracy, precision, recall, and F1 scores, we opted to limit the comparison to AUROC and AUPRC. These metrics provide a more robust and threshold-independent evaluation of predictive performance, ensuring a fair comparison across models.
Results
3
We assessed the contribution of the data-augmentation strategy and the pooling operation used to obtain molecule-level embeddings. Table 1 summarizes AUROC and AUPRC results across all configurations, including a baseline random classifier, IntaRNA and CUPID models. For CUPID we compared results obtained with (Data-aug) and without (No-Data-aug) data augmentation, considering different pooling techniques, i.e., concatenation (concat), maximum (Max), and average (Avg) pooling.
Random baselines
3.1
With a random classifier we can expect an AUROC = 0.5, while the estimated baseline AUPRC is:
where N+ is the number of positive samples, and N− is the number of negative samples. Given the 1:20 ratio of positive to negative samples, the AUPRC baseline in the performed experiments is:
Our top-performing model achieves an AUPRC of 0.364, corresponding to a 7.65-fold improvement (0.364/0.0476). This margin quantifies the difficulty of the task: the extreme class imbalance renders precision–recall a stringent metric, and the observed gains indicate that the model extracts interaction-relevant information that is well above chance expectations.
IntaRNA results
3.2
Figure 6 reports IntaRNA performance on the augmented test set. In this setting, IntaRNA shows limited predictive power. Its scoring function relies on thermodynamic and accessibility components (e.g., hybridization energy and site accessibility), and in our experiments we used the default parameterization. Given the heterogeneity of ncRNA classes and sequence lengths in our benchmark, improved performance would likely require careful, class-specific calibration of both energy- and accessibility-related settings. Moreover, while IntaRNA is a general thermodynamics- and accessibility-based framework and is not inherently tied to a specific organism, it was originally introduced and most extensively evaluated in bacterial sRNA–mRNA interaction settings; consequently, when applied to heterogeneous ncRNA–ncRNA interactions (including long lncRNAs and diverse eukaryotic classes), its default parameterization may be suboptimal without additional tuning.
Results for IntaRNA results with augmented test set. (a) Distribution of predicted probabilities for negative and positive interactions; (b) AUPRC; (c) AUROC.
CUPID results
3.3
Table 1 compares all CUPID configurations. We first evaluated a CUPID model without augmentation, using concatenation of average and max pooling. Figure 7 shows the results obtained without data augmentation and with concatenated average-max pooling. The overall AUPRC results on the test set are relatively low (Figure 7c), even if a certain learning is witnessed by the AUROC largely above 0.5 (Figure 7f), and by the distribution of the predicted interaction probabilities for negative and positive examples (Figure 7c), with probabilities for positives relatively higher with respect to negatives. Nevertheless, the relatively flat trend of the training loss reveals a certain difficulty of the model to learn the data. This is reflected also in the confusion matrix where most of negative examples (70%) are misclassified (Figure 7e) and in the degradation of the AUPRC performance between validation (Figure 7a) and test (Figure 7d) data. By looking at specific ncRNA interactions, for certain interaction types (e.g., snRNA-snoRNA) we obtained good results across the different metrics, but for several ncRNA interactions (e.g., miRNA-lncRNA, miRNA-miRNA, and lncRNA-snoRNA) we achieved poor results, with AUPRC below 0.1 (Figure 7g). Summarizing Figure 7, shows that with this setting CUPID can provide a certain discrimination between positive and negative interactions (Figure 7c), but its precision–recall and ROC curves indicate a limited separation between positive and negative examples (Figures 7d, f).
CUPID results with concatenated pooling, and without using augmented data. (a) Overall precision recall curve on the validation set including all the type of ncRNA interactions; (b) training and validation loss across epochs; (c) distribution of the CUPID predicted probabilities on negative and positive examples on the test set; (d) overall precision recall curve on the test set including all the type of ncRNA interactions; (e) confusion matrix on the test set; (f) ROC curve on the test set including all the type of ncRNA interactions; (g) CUPID results on the test set across different types on ncRNA interactions (rows) for different types of metrics (columns).
Introducing data augmentation consistently improves performance (Table 1). Figure 8 shows the results obtained with data augmentation and average pooling. The AUPRC is more than 4 times larger than without data augmentation (Figure 8d and Table 1). Enlarging the size of training data by data augmentation allows the model to better learn the training data, as witnessed by the training loss that continues to decrease across epochs (Figure 8b). This results in a clear separation between the scores predicted for positive and negative examples—note that the probabilities predicted for negatives are compressed toward zero while for most positives are largely above 0.7 (even if with several outliers for both positive and negative examples, Figure 8b). The confusion matrix also confirms that the model with augmented data can better predict negative examples (Figure 8e); AUPRC (Figure 8d) significantly improves, and AUROC is larger than 0.9 (Figure 8f). Analyzing results for each specific ncRNA interaction, we can observe a significant improvement across all the considered metrics, with AUROC in most cases larger than 0.9, except for circRNA-miRNA, miRNA-scRNA, miRNA-snRNA, and miRNA-scaRNA (even if for these two last ncRNA interactions values are close to 0.9 (Figure 8g).
CUPID results with average pooling and using augmented data. (a) Overall precision recall curve on the validation set including all the type of ncRNA interactions; (b) training and validation loss across epochs; (c) distribution of the CUPID predicted probabilities on negative and positive examples on the test set; (d) overall precision recall curve on the test set including all the type of ncRNA interactions; (e) confusion matrix on the test set; (f) ROC curve on the test set including all the type of ncRNA interactions; (g) CUPID results on the test set across different types on ncRNA interactions (rows) for different types of metrics (columns).
These results confirm that data augmentation is crucial to improve results for two main reasons: at first the model has training data enough to better generalize; second, improves generalization leveraging molecule order and orientation, two symmetries that are not guaranteed to be learned from limited training data. Augmentation effectively enforces these invariances, reducing overfitting to sequence presentation and mitigating the scarcity of positive examples.
Pooling strategy has a direct impact on the stability of the molecule-level embedding. Average pooling–yielding a smoothed representation over the full sequence—achieves the highest AUROC and AUPRC (Figure 8) compared to max pooling (Supplementary Figure S2) and concatenation pooling (Supplementary Figure S3). This indicates that interaction-relevant information is not confined to a small set of token embeddings but arises from distributed features along the sequence. Max pooling, in contrast, appears sensitive to local outliers and overly compresses positional variability, while concatenation does not provide additional benefits once augmentation is introduced. The results suggest that, for ncRNA interaction prediction, the aggregate signal across nucleotides is more informative than isolated high-activation sites.
Discussion
4
The results shown in this work demonstrate that RNA sequence-only inference can recover interaction signals across diverse ncRNA classes. The best-performing configuration reaches AUROC values above 0.9 on the test set, despite operating without structural, evolutionary, or thermodynamic information. This suggests that pretrained RNA language models encode latent features associated with intermolecular recognition. These features may reflect statistical regularities of pairing propensities and local compositional biases captured during pretraining, even in the absence of explicit structural supervision.
From a methodological standpoint, two contributions appear essential. First, the augmentation scheme addresses symmetries inherent to the problem. Because interacting RNAs can be presented in either order, and because sequence orientation can vary, enforcing invariance to these transformations is critical for robust generalization. Data augmentation also increases the number of examples available for training, thus improving the generalization performance of the model. Second, average pooling provides stable embeddings for ncRNA sequences. For molecules such as lncRNAs—whose functional elements are dispersed and whose lengths vary over orders of magnitude—summarizing the full sequence avoids overemphasis on isolated positions and instead captures global contextual tendencies. Moreover, to our knowledge, CUPID is the first model able to predict a large set of ncRNA interactions, and in principle can be applied to predict any ncRNA interaction.
The limitations observed for IntaRNA highlight the difference between energy-based and representation-based approaches. Thermodynamic models rely on explicit structural motifs and accessibility assumptions, which may not generalize to long, structured, or poorly conserved ncRNAs. In contrast, CUPID does not attempt to reconstruct secondary structure but leverages contextual sequence statistics learned from large corpora. These complementary perspectives suggest potential synergies: coupling language-model embeddings with coarse structural predictions could refine the discrimination between spurious and functionally relevant pairing events.
Despite these promising results, we note that the resources used to train CUPID are limited in size and exhibits a strong imbalance across interaction types. Although our type-constrained negative sampling preserves the empirical distribution of interaction types, rare types remain challenging; they can yield higher-variance estimates and may prevent the model from learning robust type-specific patterns. Accordingly, we emphasize AUPRC in our per-type analyses, as it is generally more informative than AUROC under severe class imbalance. Future work will benefit from larger and more balanced interaction resources, and could further improve stability on underrepresented classes via targeted strategies such as class-aware reweighting, resampling, or cost-sensitive objectives.
As larger ncRNA catalogs become available through resources such as RNAcentral (Sweeney et al., 2020), and as experimental protocols expand the coverage of ncRNA–ncRNA interactions, the training regime of models like CUPID can be scaled accordingly. Future developments may integrate longer receptive fields, explicit cross-attention between molecules, or joint fine-tuning on experimentally resolved interactomes. These extensions could help reveal constraints underlying ncRNA recognition and improve the resolution of regulatory maps in eukaryotic transcriptomes.
In addition, while our study focuses on a resource-efficient paradigm that leverages pre-trained RNA language models with a lightweight interaction-specific prediction head, it would be interesting to complement our analysis with baselines that train a long-context Transformer from scratch. We did not include such a baseline here because, under the current supervision regime (approximately 10^5^ interaction pairs after filtering), end-to-end training from random initialization may be difficult to optimize and may not yield generalizable representations. As larger and more diverse labeled interaction resources become available, systematic comparisons between pre-trained and from-scratch Transformer encoders will become increasingly informative.
A similar consideration holds when considering studies substituting RNA-LM models with several Transformer-based nucleotide language models. While these models could, in principle, be considered as alternative backbones for RNA sequence embeddings [e.g., models pre-trained predominantly on DNA such as Nucleotide Transformer, which has been reported to transfer RNA-related signals (Dalla-Torre et al., 2025)], we selected GenerRNA because it is pre-trained specifically on RNA sequences, provides a long-context representation and it is expected to better capture RNA-class-specific features. We therefore expect RNA-specialized pre-training to yield representations that are more directly tailored to RNA sequence regularities than more generic DNA-pre-trained alternatives, even when the latter can capture some RNA features. In this work, we focused on characterizing the proposed interaction-prediction pipeline using a single RNA-specialized backbone, including ablations on augmentation and pooling. As larger and more diverse interaction resources become available, it will be important to benchmark GenerRNA in a zero-shot setting against more general nucleotide Transformers, and to evaluate both backbones also after task-specific fine-tuning.
In summary, the results show that CUPID provides a scalable sequence-based framework for ncRNA–ncRNA interaction prediction, achieving AUROC larger than 0.9 for several types on ncRNA interactions. Its performance, robustness to class heterogeneity, and limited dependence on domain-specific priors make it suitable for large-scale in silico screening and for guiding targeted experimental profiling of ncRNA regulatory networks.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aleksander S. A. Balhoff J. Carbon S. Cherry J. M. Drabkin H. J. Ebert D. . (2023). The gene ontology knowledgebase in 2023. Genetics 224:iyad 031. doi: 10.1093/genetics/iyad 03136866529 PMC 10158837 · doi ↗ · pubmed ↗
- 2Ali S. A. Peffers M. J. Ormseth M. J. Jurisica I. Kapoor M. (2021). The non-coding RNA interactome in joint health and disease. Nat. Rev. Rheumatol. 17, 692–705. doi: 10.1038/s 41584-021-00687-y 34588660 · doi ↗ · pubmed ↗
- 3Alipanahi B. Delong A. Weirauch M. T. Frey B. J. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838. doi: 10.1038/nbt.330026213851 · doi ↗ · pubmed ↗
- 4Cavalleri E. Cabri A. Soto-Gomez M. Bonfitto S. Perlasca P. Gliozzo J. . (2024). An ontology-based knowledge graph for representing interactions involving RNA molecules. Sci. Data 11:906. doi: 10.1038/s 41597-024-03673-739174566 PMC 11341713 · doi ↗ · pubmed ↗
- 5Cetin S. Sefer E. (2025). A graphlet-based explanation generator for graph neural networks over biological datasets. Curr. Bioinform. 20, 840–851. doi: 10.2174/0115748936355418250114104026 · doi ↗
- 6Chen J. Hu Z. Sun S. Tan Q. Wang Y. Yu Q. . (2022). Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. ar Xiv preprint ar Xiv:2204.00300.
- 7Dalla-Torre H. Gonzalez L. Mendoza-Revilla J. Lopez Carranza N. Grzywaczewski A. H. Oteri F. . (2025). Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297. doi: 10.1038/s 41592-024-02523-z 39609566 PMC 11810778 · doi ↗ · pubmed ↗
- 8Engreitz J. M. Sirokman K. Mc Donel P. Shishkin A. A. Surka C. Russell P. . (2014). RNA-RNA interactions enable specific targeting of noncoding RN As to nascent pre-m RN As and chromatin sites. Cell 159, 188–199. doi: 10.1016/j.cell.2014.08.01825259926 PMC 4177037 · doi ↗ · pubmed ↗
