All-at-once RNA folding with 3D motif prediction framed by evolutionary information

Aayush Karan; Elena Rivas

PMC · DOI:10.21203/rs.3.rs-5664139/v1·March 26, 2025

All-at-once RNA folding with 3D motif prediction framed by evolutionary information

Aayush Karan, Elena Rivas

PDF

Open Access

TL;DR

This paper introduces a new method for predicting RNA 3D structures by combining evolutionary data with motif prediction in a single process.

Contribution

The novel contribution is a probabilistic grammar, CaCoFold-R3D, that predicts RNA 3D motifs and secondary structure simultaneously using evolutionary information.

Findings

01

CaCoFold-R3D reliably identifies canonical helices and non-Watson-Crick motifs using covariation in RNA alignments.

02

The method can predict over fifty known RNA motifs in any non-helical loop region, including complex junctions.

03

CaCoFold-R3D is shown to be a fast and customizable alternative for predicting RNA 3D structures.

Abstract

Structural RNAs exhibit a vast array of recurrent short 3D elements involving non-Watson-Crick interactions that help arrange canonical double helices into tertiary structures. We present CaCoFold-R3D, a probabilistic grammar that predicts these RNA 3D motifs (also termed modules) jointly with RNA secondary structure over a sequence or alignment. CaCoFold-R3D uses evolutionary information present in an RNA alignment to reliably identify canonical helices (including pseudoknots) by covariation. We further introduce the R3D grammars, which also exploit helix covariation that constrains the positioning of the mostly non-covarying RNA 3D motifs. Our method runs predictions over an almost-exhaustive list of over fifty known RNA motifs (everything). Motifs can appear in any non-helical loop region (including 3-way, 4-way and higher junctions) (everywhere). All structural motifs as well as the…

Figures6

Click any figure to enlarge with its caption.

b](#F2)), the RBGJ3J4 probability of a given loop type is distributed between the generic loop state and the whole R3D motif class, which gets assigned a fraction of it. Those fractions, set by human curation, are (0.4, 0.4, 0.5, 0.2, 0.2, 0, 2) for the HL, BL, HL, J3, J4, and BS motif types respectively. Then applying the maximum entropy principle, all specific R3D motifs in one class are given the same probability of occurring. For instance, the probability of forming a generic hairpin loop in the trained RBGJ3J4 is 0.3475, thus RBGJ3J4-R3D assigns 0.2085 to the generic hairpin loop, and 0.1

d](#F3)). For instance for the Hammerhead ribozyme, while R3D does not model the non canonical base pairs that occur within the junction, it does model the correlated emission of all three segments which include the base paired residues as well as those that are not paired, but part of the motif. Similarly, for a 4-way junction (J4) motif four arbitrary sequence segments are considered simultaneously ([Figure 3e](#F3)). As seen in the case of the HCV IRES 4-way junction, the correlated segments may include no nucleotides, thus indicating helix coaxial stacking [[77](#R77)].

a](#F6)). From the CaCoFold analysis of the Rfam seed alignment, we inferred that this is actually a 3-way junction that is very conserved in sequence and exquisitely framed by covariation in all three closing helices ([Figure 6b](#F6)). Two of the closing helices are adjacent, and the third one is just a lone base pair. A Group II crystal structure [[12](#R12)] confirms the coaxial stacking of the two adjacent helices, as well as the lone pair; it also reports one non-Watson-Crick base pair within the 3-way junction ([Figure 6c](#F6)).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA and protein synthesis mechanisms · RNA modifications and cancer · RNA Research and Splicing

Full text

Introduction

Many noncoding RNAs (ncRNAs) play essential roles in cellular processes by means of conserved 3D structures [23]. Accurately determining the 3D structure of an RNA is a window into inferring its molecular mechanism of action.

RNA structure is hierarchical. Canonical base pairs (cis-Watson-Crick A:U, G:C and G:U wobble pairs) stack together as double helices and pseudoknots, forming the secondary structure. Critical loops and junctions connect these helices and arrange them into a 3D structure. These non-helical linker regions, called RNA 3D motifs [38] or modules [13], have been extensively studied in the literature [75, 81, 30, 80, 36, 37, 2, 48, 73, 25, 49, 31, 27, 16, 20, 15] for their importance in accurately characterizing full RNA structure. RNA 3D motifs have recurrent properties: they are typically short; they include recurrent patterns of non-Watson-Crick base pairs resulting in complex and distinctive 3D architectures; and often they also display conserved sequence patterns. Their structural properties are usually independent of the helices they are connected to; thus, identifying 3D motifs alongside secondary structure provides important additive clues that guide the assembly of a full RNA structure from its sequence.

RNA 3D motifs (modules) are inherently difficult to detect due to their short size (often between 4 to 20 nucleotides), sequence variability within motif types, and their sheer variety (more than 30 well categorized motifs have been identified in RNA crystal structures [2] from the PDB [5]). They can also be discontinuous in linear sequence, and they can appear in internal loops or junctions where the fragments composing the motif are hundreds of nucleotides apart. Important efforts have been developed to extract RNA 3D motifs from crystal structures, and to create databases of RNA 3D motifs, such as: RAG [21, 83], FR3D Motif library [67], RNA FRABASE [54], RNA 3D Motif Atlas [53], RNA Bricks [9], CaRNAval [55], LORA [6], D-ORB [17], and ARTEM [3]. Based on this knowledge, several important efforts exist to predict RNA 3D motifs from sequence such as RMDetect [13], JAR3D [84], RMfam [50], and BayesPairing2 [66].

However, these methods are not fully integrated with secondary structure prediction. Several methods [13, 84, 66, 42] are indirectly guided by secondary structures predicted by standard thermodynamic methods [41, 58]. But because those thermodynamic methods cannot incorporate similar parameters for the 3D motifs, the prediction of motifs cannot be integrated together with that of canonical base pairs. In fact, the inputs required can be quite strong: e.g., [84] requires that the loop regions testing for the presence of motifs are provided, while [66] trains over annotated motifs in one family for prediction, getting the most competitive results only when the train and test family are the same. Furthermore, previous techniques [13, 22, 84] are computationally expensive, making independent predictions for one motif at a time. This also restricts the diversity of motifs predicted over, often relegated to hairpin and internal loop motifs [13, 84].

Here, we introduce CaCoFold-R3D, a computationally fast probabilistic model that simultaneously predicts the joint RNA 3D motifs and secondary structure present in a structural RNA. CaCoFold-R3D is grounded on the power of covariation in alignments as inputs. While covariation is not prominent in RNA 3D motifs, the covariation found in canonical helices constraints the space where these 3D motifs can occur, and R-scape’s covariation analysis [61] assigns statistical significance as to whether its predictions are evolutionarily conserved RNA structures [60]. Methods such as RMDetect [13] and BayesPairing2 [66] also use alignments, but they do not provide statistical significance for their predictions.

Another important feature of CaCoFold-R3D is the exclusive use of probabilistic modeling which naturally facilitates the integration of the prediction of RNA 3D motifs with that of the RNA secondary structure. Several existing methods use probabilistic modeling of RNA 3D motifs, but they do not integrate those with the predictions of canonical base pairs [13, 22, 84]. CaCoFold-R3D deploys an array of stochastic context-free grammars (SCFGs), to model the structural architecture, and profile hidden Markov models (HMMs), to model sequence homology, that incorporate a large variety of motifs–accounting for sequence variability, we predict over 96 motifs total present in any loop region including hairpins, bulges, internal loops, and multiloops. In addition, the CaCoFold-R3D grammar is designed to generate not just individual sequences but probabilistic sequences representing the columns of an alignment. This important feature allows the modeling of sequence variations within the motif.

CaCoFold-R3D serves as a structural paradigm for a new class of probabilistic RNA folding algorithms that directly integrates the prediction of multiple RNA 3D motifs with that of canonical helices, as well as triplets and other long-range interactions, all of that constrained by the covariation found in the input alignments.

Results

CaCoFold-R3D: Prediction of RNA 3D motifs constrained by covariation

Figure 1 describes the overall CaCoFold-R3D method. The input is a sequence or alignment, and the output is an RNA structure that includes RNA 3D motifs, canonical helices (both nested and pseudoknotted), as well as other tertiary base pairing interactions, provided that they have covariation evidence.

From an alignment, R-scape identifies a set of positive base pairs that significantly covary above phylogenetic expectation and a set of negative pairs that are not expected to form because their variability is not reflective of them being base paired [61, 62]. We have previously shown that the accuracy of RNA structure prediction improves significantly by using covariation information as prediction constraints [59]. Crucially though, CaCoFold-R3D not only uses covariation to constrain secondary structure prediction, but it further uses covariation-bound secondary structure to further constrain the location of RNA 3D motifs via an integrated stochastic context-free grammar (SCFG).

Specifically, CaCoFold-R3D splits the covarying pairs into layers each with the maximum number of nested pairs until all positive pairs have been taken into account. The first layer includes the maximal number of covarying nested base pairs, and is folded into the main secondary structure. The rest of the layers are expected to identify helices of pseudoknotted canonical helices and other tertiary base pair interactions provided that they have covariation support. CaCoFold-R3D introduces a novel SCFG called RGBJ3J4-R3D to describe the first layer where the main structure is predicted. RGBJ3J4-R3D jointly infers the collection of nested canonical helices along with the RNA 3D motifs found within the loop regions (Figure 1) via a maximum probability parsing facilitated by dynamic programming.

CaCoFold-R3D has a collection of uniquely defining properties: (1) the method can handle most types of motifs occurring in hairpin loops, internal loops, or multiloops, (2) all motifs are predicted at once and under one unique probabilistic model, and (3) the model can folds entire alignments, taking into account RNA 3D motif sequence variability even within a given structural RNA family.

RBGJ3J4-R3D: Joint prediction of nested helices and 3D motifs with one single SCFG

The RGBJ3J4-R3D model described in Figure 2 is an SCFG that simultaneously infers the secondary structure of nested canonical helices as well as the RNA 3D motifs present in any of the loop regions. It combines together a grammar called RBGJ3J4 (Methods and supplemental Figure S1) with a library called R3D of RNA 3D motif grammars described in the next Section. RBGJ3J4 is unique in that it has specific descriptions for 3-way and 4-way junctions which are the most frequent of the multiloop structures found in RNA structures, which form many different RNA 3D motifs present in important RNA molecules such as the hammerhead 3-way junction [26] and the four-way junction of the hepatitis C virus IRES [45].

RGBJ3J4-R3D creates specific R3D grammar models (i.e. grammar non-terminals) for each of the different loop motifs. To incorporate these motif non-terminals into the RBGJ3J4 grammar, we simply add the motif SCFGs are additional productions along with a generic loop motif (Figure 2). Motif designs are added for six classes of loops: hairpin (HL), bulge (BL), and internal loops (IL) as well as 3-way (J3) and 4-way (J4) junctions, and general branch motifs (BS) that can appear in any branch of any higher order multiloop.

Training.

The parameters of the RBGJ3J4 grammar (Figure 2a) have been trained by maximum likelihood using TORNADO [63] on a large and diverse set of known RNA structures and sequences.

Regarding the RBGJ3J4-R3D parameterization of the R3D motif states (Figure 2b), the RBGJ3J4 probability of a given loop type is distributed between the generic loop state and the whole R3D motif class, which gets assigned a fraction of it. Those fractions, set by human curation, are (0.4, 0.4, 0.5, 0.2, 0.2, 0, 2) for the HL, BL, HL, J3, J4, and BS motif types respectively. Then applying the maximum entropy principle, all specific R3D motifs in one class are given the same probability of occurring. For instance, the probability of forming a generic hairpin loop in the trained RBGJ3J4 is 0.3475, thus RBGJ3J4-R3D assigns 0.2085 to the generic hairpin loop, and 0.1390 is distributed equally over all defined hairpin loop motifs (15 in the current implementation), thus each HL motif gets assigned a probability of 0.0093. These parameters could be train by maximum likelihood from datasets for RNA structures annotated with the 3D motifs.

Next we describe the specific R3D models for all six different loop classes.

R3D: Six architectures to describe 3D motifs in all types of RNA loops

Now we introduce the R3D grammars, which incorporate an arbitrary number of 3D motifs in any arbitrary loop region into the folding grammar. Integrating the R3D grammars with the RBGJ3J4 grammar gives one SCFG jointly modeling both secondary structure and motifs (Figure 2).

The key insight behind the R3D grammars is to realize that RNA 3D motifs have a structural component determined by the set of (mostly conserved) non-Watson-Crick pairs that characterize the motif, and also a sequence-based component as many 3D motifs also conserve particular residue identities. The R3D grammars describe the structural component of a motif using profile SCFGs specific for each type of motif (Figure 3a–3f), and the sequence component with customized profile hidden Markov models that allow for sequence variability (Figure 3g).

The key that makes the R3D grammar affordable is that unlike other methods like RMDetect [13] or Baye-sPairing2 [66], R3D does not attempt to model each of the actual non-Watson-Crick base pairs individually (which can be quite complicated and non nested). R3D instead models groups of residues that are correlated because of their underlying non-Watson-Crick base pairing. This induces a segmentation of a motif into continuous subsequences (modeled by profile HMMs) involved in specific correlations (modeled by the SCFGs). This decomposition allows one model to describe all motifs of one given type, giving rise to a generalized R3D grammar per motif type (Figure 3). The SCFG states can generate multi-residue long strings using specific profile HMMs. For each motif, the individual nucleotide bases that constitute each segment of the profile motif are of course dependent on the consensus sequences of the motifs.

We consider six different types of structural motifs, based on whether they occur in hairpin (HL), bulge (BL), or Internal loops (IL), as well as in 3-way (J3), 4-way (J4), or Branch Segments (BS) that can occur in any junction. Each of the six general R3D SCFG models in Figure 3a–3f have a particular SCFG architecture describing the interactions present in each motif. We now detail the segmentation method per type of motif, along with the corresponding grammar rules.

Hairpin Loop motifs.

3D motifs in hairpin loops (HL) motifs include both residues paired through non-Watson-Crick interactions as well as unpaired ones. For instance, the GNRA tetraloop [28] is a frequent hairpin loop motif in which the first G base forms two non-Watson Crick interactions with the R and A bases, which provides extra thermodynamic stability to the tetraloop [28]. The GNRA R3D SCFG models the correlated occurrence of the G and the NA base pairs (but it does not model the type of base pairing involved in that correlation), as well as the unpaired N residue (Figure 3a).

R3D designs a generic HL 3D motif by an arbitrary number of left/right correlated segments and a final loop segment of residues not correlated elsewhere. Figure 3a shows the general model. The R3D-HL motif assigns a profile HMM to the loop sequence, as well as to all the allocated left and right segments which will consist of the contiguous subsequences that pair through non-Watson-Crick interactions.

Bulge Loop motifs.

R3D bulge loop (BL) motifs are described in Figure 3b, and they have similar properties to the HL motifs. Notice that a BL motif can appear in a left of right bulge depending on which of the two ends of the motif is continuous and which inserts itself with the rest of the structure. Figure 3b shows only one of the two possibilities (called variants) for the BL motif. We generalize the concept of motif variants in the following sections and supplemental Figure S2.

Internal Loop motifs.

For an internal loop (IL) motif (Figure 3c), R3D assumes the presence of 2 loop regions with an inner stem and an outer stem region which are emitted correlatedly by the SCFG. As with the HL motifs, the actual sequences in the loops and left/right inner and outer stem sequences (all of which can be potentially empty) are modeled by profile HMMs.

For instance, the K-turn (or Kink turn) is a common internal loop motif featuring two G-A hydrogen bonded Sugar-Hoogsteen edge interactions that help induce an axial bend [31]. The K-turn R3D SCFG models these two correlated interactions. The internal loop portion of the K-turn has three unpaired nucleotides with consensus RNN, so the R3D grammar adds a profile HMM for the right bulge RNN sequence and treats the left bulge as empty (Figure 3c).

3-way and 4-way junction motifs.

The R3D SCFG for a 3-way junction (J3) motif includes three sequence segments that are emitted correlatedly (Figure 3d). For instance for the Hammerhead ribozyme, while R3D does not model the non canonical base pairs that occur within the junction, it does model the correlated emission of all three segments which include the base paired residues as well as those that are not paired, but part of the motif. Similarly, for a 4-way junction (J4) motif four arbitrary sequence segments are considered simultaneously (Figure 3e). As seen in the case of the HCV IRES 4-way junction, the correlated segments may include no nucleotides, thus indicating helix coaxial stacking [77].

Other multiloop motifs.

The R3D grammar also introduces sequence (BS) motifs which can appear in any multiloop branch as described in Figure 3f. These motifs may describe particular protein-binding motifs such as the CsrA binding motifs of the CsrB RNA [40], as well as components of higher-order loop motifs. For instance, the Loop E that appears in a 3-way junction of the Glutamine riboswitch [57], which is interrupted by a one base pair pseudoknot, and R3D is able to model with two BS motifs.

The sequence-motif profile HMMs.

Each interacting partner or loop in a RNA 3D motif consists normally of a conserved sequence with some variability. R3D models those sequence segments as short profile hidden Markov models (HMMs) described in Figure 3g. Each profile HMM has a consensus sequence, and by allowing mutations, insertions and deletions, it is able to accommodate sequence variability and to identify motif instances that have some variability relative to the consensus. The states of the profile HMM emit on transition, not on state. Motifs with sequence segments without residues, such as those occurring in multiloops bounded by coaxially stacked helices, are also possible. We model empty segments with a profile HMM to allow for the possibility of insertions relative to consensus.

Parameterization.

Each profile HMM is modeled to that they generate sequences that on average exceeds slightly the length of the motif (adding a 0.1 per consensus position) up to a max of 1.5 extra length per motif on average. The emission probability distribution over residues for each motif position is determined by the given consensus. Other residues not in the position consensus are allowed with a small probability of 10^−4^. Given a reliable database of motif examples on alignments, the segment HMM parameters could be trained by maximum likelihood.

Motif variants.

All RNA 3D motifs except HL motifs are bound by more than one helix, thus allowing different topological variants depending on which 5′/3′ ends are selected to integrate the motif into the rest of the structure. Bulge and Internal loop motifs have two variants, and 3-way and 4-way junctions have three and four variants respectively. For instance, the two variants of any BL motif correspond to a left and right bulge motif respectively. Supplemental Figure S2 describes all motif variants with their SCFG rules. For any 3D motif entry in the R3D descriptor file, CaCoFold-R3D internally models all possible variants of the motif.

R3D-prototype: The importance of framing 3D motifs by evolutionary information

One of the keys to our approach is that the CaCoFold-R3D method bounds the search of RNA 3D motifs to the segments of the RNA molecule enclosed by helical regions with covariation support. This is important as, due to the small size of the motifs, their associated models have low information content and would otherwise produce large number of false positives.

To initially test the effect of adding covariation information into the prediction of RNA 3D motifs, we implemented a R3D-prototype that simultaneously produce a secondary structure and models two 3D motifs: the GNRA tetraloop (a hairpin motif) and the K-turn (an internal loop motif). This prototype uses a version of the RBG grammar (Figure S1a) that produces structural predictions directly on RNA sequences, and implements two R3D grammars (also on sequences) modeling GNRA loops and K-turns. The prototype uses this RBG-R3D grammar, and for each RNA sequence predicts a maximum probability secondary structure including GNRA loops and K-turn motifs.

For each Rfam family, sequences are selected at random from their seed alignments, and covarying base pairs are extracted from the Rfam seed alignments. To test the effect of adding covariation, the R3D-prototype predictions can be constrained by covariation information provided externally, or alternatively it can be used without any covariation constraints. We record both the sensitivity, defined to be the percent of truth motifs successfully detected, as well as the average number of false positives per prediction. We perform this analysis both including and excluding covariation information to demonstrate the effectiveness of the model.

In Table 1, we present results from applying the R3D-prototype to structural RNAs from different Rfam families [50]. As positives, we tested the U3 small nuclear RNA and the spliceosomal U4 RNA which include two and one K-turns respectively [78], and the 5S rRNA which contains a GNRA tetraloop [52]. The U3 and U4 RNAs also serve as negative tests for the GNRA tetraloop, and 5S rRNA as negative for the K-turn. For an independent control, we selected the 6S RNA and the Ribosome modulation factor (RMF) RNA, which lack either of the tested motifs.

As hypothesized, adding covariation information vastly improves motif prediction accuracy despite the lack of covariation within the motifs themselves. Overall sensitivity on the detection of GNRA tetraloops and K-turns in the three positive RNAs increases after adding covariation from 84% to 95% (Table 1). Adding covariation also significantly reduces false positives for K-turn detection to similar levels to that of the GNRA tetraloop.

To further test the efficacy of our method, we applied it to four K-turns recently identified in bacterial RNAs via structural prediction and X-ray crystallography [29]. The performance of our method on these alignments corroborates the high level of accuracy and low false positivity as demonstrated before (Table 1). This R3D-prototype shows that our approach is a reliable predictor of confirmed motif structure. We moved on to making a full implementation of the RBGJ3J4-R3D model, named CaCoFold-R3D, that incorporates a large collection of RNA 3D motifs found recurrently in RNA structures [37, 2, 27] and operates on alignments.

R3D SCFG profiles of over fifty recurrent RNA 3D motifs

The presented version of the CaCoFold-R3D grammar integrates together R3D models for 51 different RNA 3D motif architectures which have been observed in structured RNAs [21, 67, 54, 53, 9, 55, 6]. The R3D descriptor describing the 51 motifs is provided in Figure S3. The total number of motifs implemented by CaCoFold-R3D after considering all motif variants (supplemental Figure S2) is 96.

Figure 4 includes a representation of 20 (out of 51) motifs included in this implementation. The full list of motifs can be found in Figure S3, which also provides the descriptive notation used in our input files to represent the motifs in our models (supplemental Figure 2). The method is customizable by simply changing the input file with the representations of new motifs to be considered.

Figure 4 also includes for each of the 20 motifs a positive example of a Rfam family documented to have the motif, accompanied by a detail of the CaCoFold-R3D full structural prediction correctly detecting the motif. It is worth noticing, that in the majority of cases, the Rfam 3D motif is bounded by helices that show some level of covariation, further supporting to our key design feature of informing motif detection with the evolutionary conservation of secondary structure helices that arrange into a 3D structure.

Results on RFAM alignments

We ran CaCoFold-R3D on all Rfam seed alignments. Figure 5 reports additional examples of full structure predictions with representative 3D motifs that have been reported in the literature. For instance, CaCoFold-R3D finds the K-turns in the alignments of the U3 snoRNA [43, 82], U4 snRNA [76], and other four new K-turns [29] that we used in the R3D-prototype, as well as the K-turn in the SAM riboswitch [47].

We also observe the Loop E motif in the 5S rRNA [11], the two G-bulge motifs in the T-box riboswitch [32], the J4a/4b 3D motif of the Magnesium riboswitch [14], and the T-loop motif in the TPP riboswitch as well as its characteristic 3-way junction [70]. Two interesting cases are the CsrB RNA that binds to the CsrA protein [40], for which we identify 12 binding motifs, and the Glutamine riboswitch, where two R3D branch segment (BS) motifs allow us to identify a confirmed Loop E motif occurring in a multiloop involving a pseudoknot instead of in an internal loop [57]. For the Metazoan SRP, CaCoFold-R3D identifies several of its characterized motifs (domain IV, C-loop, K-turn, U-turn, GNRAs) [68, 4, 24] (Figure 5). The collection of all Rfam predicted structures is provided in the supplemental material.

With regard to the distribution of detected 3D motifs, we observe that the GNRA tetraloop is the most frequently observed motif, followed by the K-turn. Most motifs of any other kind have between 10–50 instances in Rfam (supplemental Table S1). Because the CaCoFold-R3D predictions integrate the covariation information observed in the alignment base pairs, we use the covariation observed in the helices bounding the 3D motifs in order to assess our confidence in the predictions. Overall, we detect a total of 2,124 motifs, of which 1,460 have covariation support, defined here as a motif for which at least one of its bounding helices has one or more covarying base pairs. 591 of the Rfam families include 3D motifs with covariation support. For the two largest RNA structures SSU and LSU rRNA, we find 45 supported 3D motifs for eukaryotic SSU, and 62 for the eukaryotic LSU rRNA (see Table S1). The list of motifs detected for each Rfam family is also provided in the supplemental materials.

As a control, we obtain predictions for negative alignments obtained from the Rfam alignment by permuting the residues in each column (position) independently from each other. As a result of the shuffling, the covariation signal in the input alignment is altered, but the base composition of the positions remain unchanged, thus retaining the sequence signature of any potential motif. For these control alignments, we obtain 121 motifs supported by covariation, which compared to the 1,460 motifs obtained for the Rfam alignments, indicate an estimated 9% false discovery rate in our predictions. Notice that the control alignments report 733 helices out of 14,146 with at least one covarying base pair. Since R-scape [61, 59] reports pair with a significance E-value cutoff of 0.05, this number (733) is in good agreement with the expected average number of helices with at least one covarying pair under the null hypothesis (707.3 = 0.05×14, 146).

A new 3-way junction motif with high representation

As an example of the power of CaCoFold-R3D as a tool to discover new motifs, we turn to a loop in the Group II intron RNA for which Rfam describes a generic left bulge (Figure 6a). From the CaCoFold analysis of the Rfam seed alignment, we inferred that this is actually a 3-way junction that is very conserved in sequence and exquisitely framed by covariation in all three closing helices (Figure 6b). Two of the closing helices are adjacent, and the third one is just a lone base pair. A Group II crystal structure [12] confirms the coaxial stacking of the two adjacent helices, as well as the lone pair; it also reports one non-Watson-Crick base pair within the 3-way junction (Figure 6c).

We created a R3D grammar for this novel J3 motif (Figure 6d), and introduced it into the model. We were surprised to find that this seems to be a recurrent motif also found in other structural RNAs. In fact, our analysis shows that it is the most frequent 3-way junction observed in Rfam as well as one of the top five most frequent motifs (Table S1). In Figure 6e, we show examples of other J3-groupII instances found in the CaCoFold-R3D structures for other Rfam families.

Time performance

CaCoFold-R3D is fast. On an Apple M3 Max (128 GB), 98% of the Rfam families (4079/4178) take less than 60 seconds to run CaCoFold-R3D end-to-end, and 95% of families take less than 30 secs. For the small and large subunits (SSU and LSU) of the rRNA–the two longest structured RNAs–it takes 32 minutes to analyze the eukaryotic SSU alignment (length 1,978 and 90 sequences) and 2.9 hours for the eukaryotic LSU rRNA alignment (length 3,680 and 88 sequences).

Moreover, while other methods have to run a different search for each motif and for each sequence and also calculate a secondary structure separately [13, 66], CaCoFold-R3D directly runs all 96 motifs together with the secondary structure in a single shot prediction, and reports a consensus structure including 3D motifs for the alignment. The all-at-once RBGJ3J4-R3D prediction CYK algorithm scales with (L^3^ × M) for an alignment (or sequence) of length L, where M is the total number of nonterminals including both those for the RBGJ3J4 grammar (12) and those for the R3D grammars (96 in the tested implementation). Although due to the covariation constraints, we expect this to be a worse-case behavior.

Discussion

CaCoFold-R3D combines together several unique features that make the prediction of RNA 3D motifs accurate, fully integrated with secondary structure, and annotated with their expected reliability. The R3D grammar abstracts the different 3D motifs into six generalized designs, unlocking the ability to incorporate an arbitrary number and variety of motifs–we provide results using a total of 96 motifs (everything). The RBGJ3J4 grammar specifies all possible loops in an RNA molecule, allowing motif detection in any possible location within a sequence (everywhere). CaCoFold-R3D is fully probabilistic, so one can compute the joint probability of all structural motifs together with all nested helices, pseudoknots and triplets (all at once). Because our method is framed by the evolutionary information contained in the alignment, it provides information on predictive confidence as a function of the number of significant covarying base pairs extracted from the input alignment. CaCoFold-R3D is also computationally fast–in fact, we are able to present full predictions for all Rfam families, including the ribosomal RNA. Because it is customizable, it is a tool to investigate novel 3D motifs, and we present one new and frequent 3-way junction motif. These results demonstrate that the R3D grammar coupled with covariation information offers an accurate and reliable prediction paradigm for identifying crucial 3D motifs in structural RNA sequences.

CaCoFold-R3D predictions for the Rfam RNA families will be used to provide more complete inputs for the training of deep learning methods for RNA 3D structure prediction [85]. Methods that predict RNA 3D structure such as AlphaFold3 [1] and RoseTTaFold[33], which already use the Rfam data to inform their inputs, will benefit from the comprehensive information on the prevalent 3D recurrent motifs present in all RNA 3D structures provided by CaCoFold-R3D.

Supplementary Material

1

Bibliography85

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abramson J., Adler J., and Dunger J. et al.. Accurate structure prediction of biomolecular interactions with Alpha Fold 3. Nature, 630:493500, 2024.10.1038/s 41586-024-07487-w PMC 1116892438718835 · doi ↗ · pubmed ↗
2Batey R. T., Rambo R. P., and Doudna J. A.. Tertiary motifs in RNA structure and folding. Angew Chem Int Ed Engl, 38:2326–2343, 1999.10458781 10.1002/(sici)1521-3773(19990816)38:16<2326::aid-anie 2326>3.0.co;2-3 · doi ↗ · pubmed ↗
3Baulin E. F., Bohdan D. R., Kowalski D., Serwatka M., Swierczynska J., Zyra Z., and Bujnicki J. M.. ARTEM: a method for RNA and DNA tertiary motif identification with backbone permutations, and its example application to kink-turn-like motifs. bio Rxiv, 10.1101/2024.05.31.596898, 2024.PMC 1230602240721818 · doi ↗ · pubmed ↗
4Becker M. M., Lapouge K., Segnitz B., Wild K., and Sinning I.. Structures of human SRP 72 complexes provide insights into SRP RNA remodeling and ribosome interaction. NAR, 45:470481, 2016.10.1093/nar/gkw 1124 PMC 522448427899666 · doi ↗ · pubmed ↗
5Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., Shindyalov I. N., and Bourne P. E.. The Protein Data Bank. Nucleic Acids Research, 28:235–242, 2000.10592235 10.1093/nar/28.1.235PMC 102472 · doi ↗ · pubmed ↗
6Bohdan D. R., Voronina V. V., Bujnicki J. M., and Baulin E. F.. A comprehensive survey of long-range tertiary interactions and motifs in non-coding RNA structures. RNA, 51, 2023.10.1093/nar/gkad 605PMC 1048473937471030 · doi ↗ · pubmed ↗
7Chan C. W., Chetnanim B., and Mondragón A.. Structure and function of the T-loop structural motif in noncoding RN As. Wiley Interdiscip Rev RNA, 4:507522, 2013.10.1002/wrna.1175 PMC 374814223754657 · doi ↗ · pubmed ↗
8Cheong C., Varani G., and Tinoco I.Jr. Solution structure of an unusually stable RNA hairpin, 5’GGAC(UUCG)GUCC. Nature, 346:680–682, 1990.1696688 10.1038/346680 a 0 · doi ↗ · pubmed ↗