RNA secondary structures: from ab initio prediction to better   compression, and back

Evarista Onokpasa; Sebastian Wild; Prudence W. H. Wong

arXiv:2302.11669·q-bio.BM·February 24, 2023

RNA secondary structures: from ab initio prediction to better compression, and back

Evarista Onokpasa, Sebastian Wild, Prudence W. H. Wong

PDF

Open Access 1 Repo

TL;DR

This paper leverages biological knowledge and stochastic models to enhance RNA secondary structure prediction and compression, demonstrating that compression ratios can evaluate model quality effectively.

Contribution

It introduces a novel approach combining stochastic context-free grammars with compression techniques to improve RNA structure prediction and evaluation.

Findings

01

Improved compression ratios with expert stochastic models.

02

Compression ratio correlates with prediction quality.

03

Grammar features significantly impact compression performance.

Abstract

In this paper, we use the biological domain knowledge incorporated into stochastic models for ab initio RNA secondary-structure prediction to improve the state of the art in joint compression of RNA sequence and structure data (Liu et al., BMC Bioinformatics, 2008). Moreover, we show that, conversely, compression ratio can serve as a cheap and robust proxy for comparing the prediction quality of different stochastic models, which may help guide the search for better RNA structure prediction models. Our results build on expert stochastic context-free grammar models of RNA secondary structures (Dowell & Eddy, BMC Bioinformatics, 2004; Nebel & Scheid, Theory in Biosciences, 2011) combined with different (static and adaptive) models for rule probabilities and arithmetic coding. We provide a prototype implementation and an extensive empirical evaluation, where we illustrate how grammar…

Equations4

S \Rightarrow L S \Rightarrow [( G] S [) C] S \Rightarrow [( G] L S [) C] S \Rightarrow [( G] [\textbullet A] S [) C] S \Rightarrow [( G] [\textbullet A] ε [) C] S \Rightarrow [( G] [\textbullet A] [) C] ε = [( G] [\textbullet A] [) C],

S \Rightarrow L S \Rightarrow [( G] S [) C] S \Rightarrow [( G] L S [) C] S \Rightarrow [( G] [\textbullet A] S [) C] S \Rightarrow [( G] [\textbullet A] ε [) C] S \Rightarrow [( G] [\textbullet A] [) C] ε = [( G] [\textbullet A] [) C],

S \to L S, L \to [( G] S [) C], S \to L S, L \to [\textbullet A], S \to ε, S \to ε .

S \to L S, L \to [( G] S [) C], S \to L S, L \to [\textbullet A], S \to ε, S \to ε .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evita35/joint-rna-compression
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA and protein synthesis mechanisms · Genomics and Phylogenetic Studies · Natural Language Processing Techniques

Full text

\clearscrheadfoot\ohead\pagemark\rehead\mytitle\lohead\headmark\addtokomafont

caption \addtokomafontcaptionlabel \setcapmargin2em

\automark[section]

RNA secondary structures: from ab initio prediction to better compression, and back

Evarista Onokpasa University of Liverpool, UK, {evarista.onokpasa, sebastian.wild, pwong} @ liverpool.ac.uk

Sebastian Wild∗

Prudence W.H. Wong∗

RNA secondary structures: from ab initio prediction to better compression, and back

Evarista Onokpasa University of Liverpool, UK, {evarista.onokpasa, sebastian.wild, pwong} @ liverpool.ac.uk

Sebastian Wild∗

Prudence W.H. Wong∗

Abstract

In this paper, we use the biological domain knowledge incorporated into stochastic models for ab initio RNA secondary-structure prediction to improve the state of the art in joint compression of RNA sequence and structure data (Liu et al., BMC Bioinformatics, 2008). Moreover, we show that, conversely, compression ratio can serve as a cheap and robust proxy for comparing the prediction quality of different stochastic models, which may help guide the search for better RNA structure prediction models.

Our results build on expert stochastic context-free grammar models of RNA secondary structures (Dowell & Eddy, BMC Bioinformatics, 2004; Nebel & Scheid, Theory in Biosciences, 2011) combined with different (static and adaptive) models for rule probabilities and arithmetic coding. We provide a prototype implementation and an extensive empirical evaluation, where we illustrate how grammar features and probability models affect compression ratios.

1 Introduction

In this article, we explore the interplay and potential symbiosis between data compression and probabilistic methods for predicting the folding structure of (non-coding) RNA molecules. Ribonucleic acid (RNA) is a bio-polymer that serves various roles in the coding, decoding, expression and regulation of genes in cells. An RNA molecule consists of a chain of nucleotides each having a base attached to it (either adenine (A), cytosine (C), guanine (G), or uracil (U)); this string of bases forms the sequence of the molecule. Unlike the related DNA, RNA is usually single-stranded and forms spatial structures by folding onto itself (similar to proteins), with complementary bases forming a stabilizing hydrogen bond. The set of (indices of the) bases that form such pairs is the secondary structure of the molecule; it can be encoded by the dot-bracket notation, (see Figure 1; a formal definition is given in Section 2).

The secondary structure is instrumental for the biological function of non-coding RNA molecules and of great interest to biologists. Much research has hence been devoted to computationally predict the secondary structure from a known RNA sequence (ab initio RNA secondary-structure prediction) [4, 9, 26], including human swarm intelligence [15], and it remains an active research area [22, 7, 23]. We explore areas around RNA secondary structures where innovations in compression methods are central for further progress.

Better RNA Compression

Our first goal is to use the domain knowledge on RNA foldings incorporated into secondary-structure prediction models for improved methods for the joint compression of the sequence and secondary structure of RNA sequences. With biological databases ever increasing, compressed representations become desirable. In the case of databases for non-coding RNA sequences with known secondary structures, the data volume has long remained manageable, but growth is now accelerating: For example, RNA Central [2] now aggregates over 25 million trusted secondary structures 8 years after its first release; 1.8 million of these come from the rfam database [11], collected over its 20 years of existence.

The need for space-efficient representations of joint RNA sequence and secondary structure databases has been identified by Liu et al. in 2008 [16]. Their algorithm RNACompress, based on a stochastic context-free grammar (SCFG, defined below), has been recognized as an early application of ideas from grammar-based compression in the data-compression community [17, 12]. As we demonstrate in this article, substantially better compression ratios can be achieved than Liu et al. report; interestingly, by carefully extending their very method to a general framework of SCFG-based compression. Improvements are then realized by applying this framework on tried and tested grammars from the RNA secondary structure prediction literature [3, 20] (as well as further, orthogonal refinements).

Apart from the practical utility of less space, compression methods are of direct interest in bioinformatics as a way to upper bound the Kolmogorov complexity [13] of a dataset, and hence its inherent information content [8]. For example in the context of RNA sequences, one can ask how much additional information is contained in the secondary structure of the RNA when the sequence is known.

Compression as a proxy for predictive power

Our second and main goal is to test our hypothesis that for comparing probabilistic models for RNA secondary structures, compression ratio can serve as proxy for prediction quality in RNA secondary-structure prediction. Advances in next-generation sequencing allows determining the sequence of many molecules at scale, whereas secondary structures need to be determined by much more expensive techniques like X-ray crystallography [26]. A much cheaper and faster alternative is to computationally predict the structure from a known sequence. The state-of-the-art approaches either build on a chemical model of the molecules and try to identify a structure with minimal free energy or use a machine-learning approach. Both can formally be described by stochastic context-free grammars (see Section 2).

RNA secondary-structure prediction plays a vital role in studying the biological function of RNA molecules and for designing artificial RNA sequences, and so numerous software packages implement different algorithms for this task. Comparing their prediction quality is a delicate undertaking, because no definitive similarity metric is known to judge how close the predicted secondary structure is from an experimentally determined one [18]. Indeed, the method of choice in the literature to compare structure prediction is solely based on individual base pairs [18, 3, 21, 20]: One compares the sensitivity and positive predictive value (PPV) of different approaches (defined in Section 2).

We will use the compressed size (in bits per base) of the reference structure under the trained stochastic model as a more direct means to compare how well different models capture RNA folding behavior. This compressed size effectively reflects the log-likelihood of the reference structure and hence has a natural interpretation as the information content that model assigns to the RNA structure.

This has several advantages over sensitivity/PPV: (a) It directly evaluates the quality of the model, separating it from the method to produce a (single) predicted secondary structure. There are different options to predict a structure; one can use the most likely structure, or a consensus structure containing the most likely individual pairs, or return a sample of several nearly optimal structures. No choice clearly dominates the others, but they affect the sensitivity and PPV scores. (b) Log-likelihood is a single natural metric derived from first principles of information theory; it does not need trade-offs or further parameters.

Contributions

Our contributions are as follows. First, we improve the compression ratio achieved for joint RNA sequence and structure data by 45% over the state of the art, Liu et al.’s RNACompress [16]; compared to the general-purpose compressor paq8l (http://mattmahoney.net/dc/#paq), we see a 70% improvement. The improvement over RNACompress is the combined result of several refinements, but a 30% reduction in compressed size is observed when keeping everything but the used SCFG constant. This clearly shows the relevance of the grammar and the validity of our approach to employ structure-prediction grammars. The proposal and implementation of the more sophisticated grammars (such as the one based on [20]) is hence a useful contribution. Second, we demonstrate that compression ratio can be used as a robust predictor of how well a grammar will perform for ab initio secondary-structure prediction. To our knowledge, this is the first such attempt to identify suitable probabilistic models for RNA structure prediction that is not based on comparing predicted structures to a benchmark dataset. Finally, we reproduce and confirm the computational study of [3] with an independent implementation and additional modifications to their grammars.

Related Work

Liu et al. [16] proposed RNACompress in 2008; we discuss their methodology in detail in Section 3. Naganuma et al. [19] explore a related method of SCFG compression closer to grammar-based compression using straight-line programs. They create a stochastic grammar from the text to compress with a variation of the RePair heuristic [14]. For a broader context of grammar-based compression, see the recent survey of Kieffer and Yang [12]. Friemel [6] also targets the joint RNA compression problem, but using a different approach. He encodes RNA structures as labeled trees with each node representing a nucleotide and the branches representing the bonds; unpaired bases yield unary nodes. Friemel’s algorithm RNAContract contracts sequences of unary nodes (similar to compact tries) or a sequence of multiple nested brackets in the dot-bracket notation. After the node contraction the algorithm encodes the contracted node tree using Huffman coding.

Outline

The rest of this paper is structured as follows. Section 2 collects required concepts. Section 3 explains the grammar-based compression of RNA. Then we report on our two studies: Section 4 discusses the compression achieved with various grammars and Section 5 explores the connection between compressed size and prediction quality. We conclude in Section 6 with future work. In the appendix, we give details on the comparison with a general-purpose compressor (Appendix A), list the precise grammars we used (Appendix B), and investigate further differences between our approach and [16] (Appendix C). Further details, all datasets and code to produce the figures in this article are available online as supplementary material: https://www.wild-inter.net/publications/onokpasa-wild-wong-2023; the code is available on GitHub: https://github.com/evita35/joint-rna-compression.

2 Preliminaries

Dot-bracket notation

An RNA sequence is a string of bases A, C, G, U. Stable hydrogen bonds are possible between A and U resp. C and G (the Watson-Crick pairs) and to a lesser extent also between G and U. RNA secondary structures111As is often done in the area, we do not consider structures with pseudoknots in this paper.

can be represented by the dot-bracket notation [10]: a well-nested string over $\{\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{\textbullet}}}},\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{(}}}},\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{)}}}}\}$ where a base pair is denoted by matching parentheses (****) and an unpaired base by •; see Figure 1 for an example. We use “RNA” as an abbreviation for “a pair of an RNA sequence and its secondary structure”.

SCFG

Dot-bracket strings can be generated by a context-free grammar (CFG). A CFG is a tuple $(N,T,R,S)$ where $N$ and $T$ are finite sets of nonterminals and terminals, respectively, $R\subseteq N\times(N\cup T)^{*}$ is a finite set of production rules, and $S\in N$ is the start symbol. A rule in $R$ is written as $A\rightarrow\alpha$ . A stochastic context-free grammar (SCFG) is a tuple $G=(N,T,R,S,W)$ such that $(N,T,R,S)$ is a CFG and $W:R\rightarrow[0,1]$ is a function satisfying $\sum_{(A\rightarrow\alpha)\in R}W(A\rightarrow\alpha)=1$ for all $A\in N$ . For every $A\in N$ , $W$ represents a probability distribution over the set of rules with left-hand side $A$ .

Earley Parser

The Earley Parsing algorithm [5] is able to process any SCFG and efficiently determine whether a string belongs to the language of the grammar. We use the Earley parser implementations by [25, 27] when comparing various SCFGs since it does not require a rigid normal form for grammars.

RNA secondary-structure prediction

A stochastic context-free grammar can be used for RNA secondary-structure prediction where terminals correspond to bases and the leftmost derivation of an RNA sequence encodes a secondary structure of the sequence. The used SCFGs allow many different derivations (and hence secondary structures) for a given sequence and the rule probabilities induce a probability distribution over those. Using a classical machine-learning approach, the rule probabilities are chosen as maximum likelihood parameters w. r. t. a given training dataset (with known secondary structures). For predicting/inferring the (unknown) secondary structure of a new RNA sequence, a probabilistic parser determines the maximum-likelihood derivation (Viterbi parse) of the RNA sequence in the SCFG, which encodes the most likely secondary structure (under the given probabilistic model).

We measure the quality of prediction by sensitivity and positive predictive value (PPV): the fraction of correctly predicted base pairs among all pairs in the reference structure resp. all pairs in the predicted structure. More formally, let $\mathit{TP}$ , $\mathit{TN}$ , $\mathit{FP}$ , $\mathit{FN}$ be the number of base pairs that are true positives, true negatives, false positives, and false negatives, respectively. Then $\text{Sensitivity}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{FN}}$ and $\text{PPV}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{FP}}$ .

3 RNA compression using stochastic context-free grammars

We now show how to jointly compress an RNA sequence and secondary structure using a SCFG $G$ . This method has been used by Liu et al. [16] on a fixed grammar; we generalize it here to arbitrary grammars $G$ and rule-probability models. The terminals of $G$ are pairs of characters, e. g., ${\rm Appl.\leavevmode\nobreak\ Opt.}$ for base A in the RNA sequence and **(**in the (dot-bracket representation of the) secondary structure.222Liu et al. use 2 grammars instead – one for the sequence and one for the secondary structure – the two descriptions are equivalent.

To encode an RNA, we determine the sequence of rules in a leftmost derivation of the RNA and then encode this sequence of rules using a model for the rule probabilities using a standard code; Liu et al. use a fixed Huffman code; we employ arithmetic coding [28].

We illustrate the process on the RNA sequence $\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{(}}\genfrac{[}{]}{0.0pt}{1}{\texttt{A}}{\texttt{\textbullet}}\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{)}}$ with the grammar of Liu et al.: $G_{L}=(N,T,R,S)$ has $N=\{S,L\}$ , $T=\{$${\rm Appl.\leavevmode\nobreak\ Opt.}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{(}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{(}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{U}}{\texttt{(}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{A}}{\texttt{)}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{)}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{)}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{U}}{\texttt{)}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{A}}{\texttt{\textbullet}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{\textbullet}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{\textbullet}}\mathchar 44\relax\penalty 0\genfrac{[}{]}{0.0pt}{1}{\texttt{U}}{\texttt{\textbullet}}$$\}$ , and rules $R$ shown in Table 1.

The (unique) leftmost derivation using the grammar is as follows:

[TABLE]

where the sequence on applied production rules is

[TABLE]

Since we always replace the leftmost nonterminal, the next nonterminal to replace is known inductively, and we can reconstruct the leftmost derivation from only the (index of the) used right-hand sides: $1,4,1,7,2,2,$ using the order of rules in Table 1; (the $4$ indicates that the second used rule, where we know it expends $L$ , is the 4th rule with left-hand side $L$ , i. e., $L\to$$\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{(}}$$S$$\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{)}}$ ). Now suppose we have the (static) rule probabilities for $R$ from Table 1 and we use arithmetic coding to store the right-hand sides. We obtain the corresponding sequence of intervals from the rules, $[0.00,0.65),[0.30,0.35),[0.00,0.65),[0.50,0.60),[0.65,1.00),[0.65,1.00);$ which we encode using arithmetic coding to obtain the final binary codeword: 0011010100100.

The example above (and [16]) uses a static rule-probability model, usually obtained from a training dataset with known structures by counting how often each rule is used in the dataset derivations. With arithmetic coding, we can easily replace this by an adaptive rule-probability model, where rule probabilities are computed as relative frequencies in the prefix encoded so far (starting with some initial value for counters, typically 1). This entirely avoids the need for a second pass or a training dataset, as well as storing the rule probabilities. For long inputs, the adaptive model converges to the sequence-specific relative rule frequencies; we hence also include the semi-adaptive model where rule counts are determined for the given sequence in a first pass. Unless one also stores the rule counts, this model does not allow decoding, but indicates the limiting behavior of the adaptive model.

4 Joint compression of RNA sequence and secondary structure

To investigate the effectiveness of different parameters, we have developed a generic prototype implementation in Java that allows us to combine arbitrary SCFGs, rule-probability models, and final encoders (Huffman or arithmetic coding). We use an existing open-source Earley Parser implementation [25] for obtaining a parse tree (given a SCFG and an RNA with sequence and structure).333This parser has been reported to yield incorrect results for certain inputs; for the compression experiment, we could confirm that it works correctly on all our inputs and grammars.

Apart from $G_{L}$ from [16], we use the structure-prediction grammars from [3] and [20]. Since non-canonical bonds are regularly found in experimentally determined secondary structures, all our grammars come in two versions: one that only allows the Watson-Crick and “G-U wobble” pairs, and one that allows all 16 pairs. The difference for compression is small: while most RNA structures do contain non-canonical bonds, most contain only very few of them.

For the compression-quality study, we use the “friemel” dataset, consisting of 17 000 ribosomal RNAs from [1] where ambiguously sequenced bases, non-canonical base pairs and pseudoknots have been removed [6]. Information of each RNA in the given datasets is stored in a text file, using the dot-bracket notation. 24 contained empty hairpin loops; since 2 grammars from [3] exclude these, we replaced the innermost pair by two unpaired bases; for the evaluation, we exclude these 24 RNAs.

Figure 2 shows the compression quality of different grammars, normalized to the (average) number of bits per base in the RNA. It is striking that the current state-of-the-art method from the literature, Liu et al.’s RNACompress [16], performs much worse than all the structure-prediction grammars (for all rule-probability models), indicating that these grammars indeed incorporate effective domain knowledge about RNA structures. Also note that a simplistic encoding of the RNA sequence alone would use 2 bits/base; the most sophisticated grammars come very close to that for the joint encoding of sequence and structure: 2.21 bits/base on average for the grammar of Nebel and Scheid [20]. The large grammars $G_{2}$ , $G_{7}$ , and $G_{8}$ [3] (those with “stacking parameters”) and the huge grammar by Nebel and Scheid [20] perform overall best. But some much smaller grammars like $G_{6}$ come very close, despite having a factor 10 fewer parameters. This shows that it is the structure of the grammar, not merely the number of parameters of the model, that improve compression of RNA secondary structures.

5 Compression ratio vs. prediction quality

We have seen that the choice of the grammar heavily influences the compression quality of our generic joint RNA compressor. In this section, we take a closer look at this grammar dependence from the perspective of both compression and secondary-structure prediction. For that, we reproduced the classic study of Dowell and Eddy [3] comparing several hand-crafted SCFG for their ability to correctly infer RNA secondary structures given only the RNA sequence as input. Due to the bugs from [25], we here used the probabilistic Earley parser from [27]. We use the original datasets from [3] (available at http://eddylab.org/software/conus/): The “benchmark” dataset was used in [3] to compare the prediction quality of SCFGs, whose rule probabilities have been trained on their “mixed80” dataset; see [3] for further details. Both datasets contain many non-canonical bonds and 8 RNAs contain empty hairpin loops; we again eliminated the latter. Mixed80 contains numerous ambiguous bases; these were randomly replaced with a compatible base.

Figure 3 shows the results of comparing for each grammar how well it compresses the benchmark dataset of RNAs and how well it predicts secondary structures of this set (using the setup and parameters as in [3]). Taking into account the variability across different RNAs within the dataset, a clear and strong negative correlation is visible between compressed size and prediction quality; in particular, there is a clearly distinct cluster of grammars that simultaneously give the best compression and the best prediction.

At least for the grammars from [3], this shows that one can use compressed size as a more rigidly defined and robust proxy for secondary-structure prediction quality.

Figure 4 takes a closer look at the correlation on a per-RNA level. Even there, a correlation remains visible; in particular very accurately predicted structures are also well compressed. The right panel in Figure 4 shows that compressed size for different grammars is very strongly correlated; pictures for other grammar pairs are similar (excluding the poor performing $G_{1}$ , $G_{4}$ , and $G_{5}$ ). Note that despite the strong correlation at RNA level, there is a significant difference in the (mean) compression ratio between different grammars. This might indicate that there are intrinsically more and less “surprising” RNA secondary structures (knowing only the RNA sequence).

6 Conclusion

In this paper, we demonstrated how domain knowledge of RNA secondary structures encapsulated in stochastic context-free grammars for structure prediction can be used to obtain the best single-RNA compression ratios known for this type of data. Moreover, we showed promising first evidence for the utility of compression ability as a cheap and robust proxy for prediction quality for RNA secondary-structure prediction.

This work opens up several enticing avenues for future research. Using compression ability as simpler guide, we are working on an approach to discover new promising models for secondary-structure prediction. It would be interesting to investigate whether the robust correlation between prediction quality and compressed size continues to hold for large grammars with many parameters; here prediction could suffer due to overfitting issues, whereas compression might continue see improvements from additional parameters. Since many natural RNA secondary structures contain “pseudoknots”, a principled approach for compressing such structures would be interesting. If the compression-prediction correlation can be demonstrated in this domain as well, the lack of reliably free-energy models for pseudoknotted RNA structures and the relative lack of high-fidelity training data would make compression ability of even greater value in the search for better predictions models.

Appendix A Comparison with general purpose compressors

To compare the compression quality of our approach with state-of-the-art generic compressors, we use the paq8l tool (http://mattmahoney.net/dc/#paq). We compressed each individual RNA text file (with sequence in the first line and the secondary structure as dot-bracket string in the second line) in the friemel-modified dataset using paq8l -8 (the setting for best compression) and summed up the file sizes of all compressed RNAs.

The uncompressed size of friemel-modified is 39 284 962 bytes and all RNAs combined have 19 357 501 bases (2 bytes per base, one for sequence, one for structure, plus a small amount of metadata overhead). paq8l compressed this to 9 146 548 bytes. Dividing this total compressed size (in bytes) by the total number of bases in the dataset yields an average of 3.78 bits per base. This is 70% more than the 2.211 bits that our compressed with $G_{S}$ achieves (using a static rule-probability model).

It is not unexpected that a general purpose tool like paq8l does not come anywhere close to the compression of a domain-aware model; however, it is a bit surprising that paq8l uses substantially more space than the local first order empirical entropy: All first lines of the files have letters in $\{\texttt{A},\texttt{C},\texttt{G},\texttt{U}\}$ , and thus a local entropy of at most 2 bits per character. For the second line, we only have $\{\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{(}}}},\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{)}}}},\hbox{\makebox[9.16669pt][l]{\makebox[0.0pt][l]{{\textbullet}}}}\}$ , and hence at most $\lg(3)\approx 1.58$ bits per character. Exploiting this local entropy would result in 3.58 bits per base.

Appendix B Grammars

Here, we list the used grammars; we use the compact notation from [3], where we only give $a$ and $\hat{a}$ as terminals instead of the pairs introduced in Section 2. The actual RNA grammars would have 4 rules for each rule with a single “ $a$ ”; instead of $A\to\alpha a\beta$ , we would actually have $A\to\alpha\genfrac{[}{]}{0.0pt}{1}{\texttt{A}}{\texttt{\textbullet}}\beta$ , $A\to\alpha\genfrac{[}{]}{0.0pt}{1}{\texttt{C}}{\texttt{\textbullet}}\beta$ , $A\to\alpha\genfrac{[}{]}{0.0pt}{1}{\texttt{G}}{\texttt{\textbullet}}\beta$ , and $A\to\alpha\genfrac{[}{]}{0.0pt}{1}{\texttt{U}}{\texttt{\textbullet}}\beta$ ; similarly, each rules with a “ $a\hat{a}$ ” pair actually stands for 6 rules resp. 16 rules if we allow non-canonical base pairs. For the stacking grammars, nonterminals $B^{a\hat{a}}$ are shorthand notation for 6 resp. 16 different nonterminals, which “remember” an enclosing pair. If there are several occurrences of the same $a\hat{a}$ pair within one rule, these must be replaced consistently (with the same bases in all occurrences).

Our parsers require grammars to be free of $\varepsilon$ -rules, so we eliminated these in all grammars.

Moreover, the fast stochastic parser used for the prediction study requires a slightly more restrictive form: the grammars are not allowed to have left-recursive rules, and the nonterminals must be ordered, so that $B$ comes before $A$ whenever one can derive $B\alpha$ from $A$ . We only use the unambiguous grammars $G_{3},\ldots,G_{8}$ for the prediction study, so we directly give those grammars in the required form.

Grammar $G_{L^{\prime}}$ (LiuGrammar)

The first grammar is $G_{L^{\prime}}$ from Liu et al. [16] where we eliminate $\varepsilon$ -rules.

$T\to a$ (4 rules)

$T\to aS\hat{a}$ (16 rules)

$S\to T\mid TS$

Grammar $G_{1}$ (DowellGrammar1Bound)

Next, $G_{1},\ldots,G_{8}$ are the grammars taken from Dowell and Eddy [3].

$U\to a$

$B\to aS\hat{a}$

$C\to B\mid U$

$X\to UX\mid SX\mid U\mid S$

$S\to C\mid CX\mid US\mid USX$

Grammar $G_{2}$ (DowellGrammar2Bound)

$U\to a$

$P^{a\hat{a}}\to aP^{a\hat{a}}\hat{a}\mid S$

$S\to aP^{a\hat{a}}\hat{a}\mid U\mid US\mid SU\mid SS$

Grammar $G_{3}$ (DowellGrammar3Bound)

$U\to a$

$B\to aS\hat{a}$

$L\to B\mid UL$

$R\to U\mid UR$

$S\to B\mid UL\mid RU\mid LS\mid U$

Grammar $G_{4}$ (DowellGrammar4Bound)

$U\to a$

$B\to aS\hat{a}$

$C\to B\mid U$

$D\to C\mid CD$

$Q\to B\mid BD$

$S\to U\mid US\mid Q$

Grammar $G_{5}$ (DowellGrammar5Bound)

$U\to a$

$B\to aS\hat{a}$

$S\to U\mid B\mid US\mid BS$

Grammar $G_{6}$ (DowellGrammar6Bound)

$U\to a$

$B\to aM\hat{a}$

$T\to B\mid U$

$M\to B\mid TS\mid T$

$S\to TS\mid T$

An alternative version does not have the rule $M\to T$ ; that grammar then disallows hairpins of length one, i. e., ‘(•)‘.

Grammar $G_{7}$ (DowellGrammar7Bound)

$U\to a$

$B\to aV^{a\hat{a}}\hat{a}$ (16 rules)

$B^{b\hat{b}}\to aV^{a\hat{a}}\hat{a}$ ( $16\cdot 16$ rules)

$L\to B\mid UL$

$M\to UM\mid U$

$T\to U\mid UL\mid MU\mid LS$

$V^{a\hat{a}}\to B^{a\hat{a}}\mid T$ ( $16\cdot 2$ rules)

$S\to B\mid UL\mid MU\mid U\mid LS$

Grammar $G_{8}$ (DowellGrammar8Bound)

$U\to a$

$B\to aV^{a\hat{a}}\hat{a}$

$B^{b\hat{b}}\to aV^{a\hat{a}}\hat{a}$

$C\to U\mid B$

$D\to C\mid CD$

$E\to B\mid BD$

$N\to U\mid E\mid US\mid EU\mid EB$

$V^{a\hat{a}}\to B^{a\hat{a}}\mid N$

$S\to U\mid E\mid US$

Grammar $G_{S}$ (SchulzGrammar)

The grammar $G_{S}$ is taken from [20]; see also [24, Def. A.1.2]; we have made the modifications described below to make the grammar more suitable for compression.

Since we have to expand every occurrence of $a$ $\hat{a}$ on the right-hand side into 6 (or even 16) rules in our RNA grammars, we replaced “ $aL\hat{a}$ ” in several right-hand sides with a nonterminal that expands to $aL\hat{a}$ ( $A$ when we start a new stem and the new nonterminal $I$ when we continue after an interior loop or bulge). This reduces the number of parameters and hence the expressive power a bit, but will keep the grammar substantially smaller.

$p_{0}^{\prime}:S^{\prime}\rightarrow S,$

$p_{1}^{\prime}:S\rightarrow A,\quad\linebreak[2]$ $p_{2}^{\prime}:S\rightarrow AC,\quad\linebreak[2]p_{3}^{\prime}:S\rightarrow TA,\quad\linebreak[2]p_{4}^{\prime}:S\rightarrow TAC,$

$p_{5}^{\prime}:T\rightarrow A,\quad\linebreak[2]p_{6}^{\prime}:T\rightarrow AC,\quad\linebreak[2]p_{7}^{\prime}:T\rightarrow TA,\quad\linebreak[2]p_{8}^{\prime}:T\rightarrow TAC,$

$p_{9}^{\prime}:T\rightarrow C,$

$p_{10}^{\prime}:C\rightarrow X^{C},\quad\linebreak[2]p_{11}^{\prime}:C\rightarrow CX^{C},$

$p_{12}^{\prime}:A\rightarrow aL\hat{a},$

$p_{13}^{\prime}:L\rightarrow aL\hat{a},\quad\linebreak[2]p_{14}^{\prime}:L\rightarrow M,\quad\linebreak[2]p_{15}^{\prime}:L\rightarrow P,\quad\linebreak[2]p_{16}^{\prime}:L\rightarrow Q,$

$p_{17}^{\prime}:L\rightarrow R,\quad\linebreak[2]p_{18}^{\prime}:L\rightarrow F,\quad\linebreak[2]p_{19}^{\prime}:L\rightarrow G,$

$p_{20}^{\prime}:G\rightarrow Ia,\quad\linebreak[2]p_{21}^{\prime}:G\rightarrow IX^{B}X^{B},\quad\linebreak[2]p_{22}^{\prime}:G\rightarrow IBX^{B}X^{B},$

$p_{23}^{\prime}:G\rightarrow aI\quad\linebreak[2]p_{24}^{\prime}:G\rightarrow X^{B}X^{B}I\quad\linebreak[2]p_{25}^{\prime}:G\rightarrow X^{B}X^{B}BI$

$p_{26}^{\prime}:B\rightarrow X^{B}\quad\linebreak[2]p_{27}^{\prime}:B\rightarrow BX^{B}$

$p_{28}^{\prime}:F\rightarrow X^{F}X^{F}X^{F}\quad\linebreak[2]p_{29}^{\prime}:F\rightarrow X^{F}X^{F}X^{F}X^{F}\quad\linebreak[2]p_{30}^{\prime}:F\rightarrow X^{F}X^{F}X^{F}X^{F}X^{F}\\ p_{31}^{\prime}:F\rightarrow X^{F}X^{F}X^{F}X^{F}X^{F}H$

$p_{32}^{\prime}:H\rightarrow X^{H}\quad\linebreak[2]p_{33}^{\prime}:H\rightarrow HX^{H}$

$p_{34}^{\prime}:P\rightarrow aIa\quad\linebreak[2]p_{35}^{\prime}:P\rightarrow X^{I}IX^{I}X^{I}\quad\linebreak[2]p_{36}^{\prime}:P\rightarrow X^{I}X^{I}IX^{I}\quad\linebreak[2]p_{37}^{\prime}:P\rightarrow X^{I}X^{I}IX^{I}X^{I}$

$p_{38}^{\prime}:Q\rightarrow X^{I}X^{I}IX^{I}X^{I}X^{I}\quad\linebreak[2]p_{39}^{\prime}:Q\rightarrow X^{I}X^{I}IKX^{I}X^{I}X^{I}\quad\linebreak[2]p_{40}^{\prime}:Q\rightarrow X^{I}X^{I}X^{I}IX^{I}X^{I}\quad\linebreak[2]p_{41}^{\prime}:Q\rightarrow X^{I}X^{I}X^{I}JIX^{I}X^{I}$

$p_{42}^{\prime}:Q\rightarrow X^{I}X^{I}X^{I}IKX^{I}X^{I}\quad\linebreak[2]p_{43}^{\prime}:Q\rightarrow X^{I}X^{I}X^{I}JIKX^{I}X^{I}$

$p_{44}^{\prime}:R\rightarrow X^{I}IX^{I}X^{I}X^{I}\quad\linebreak[2]p_{45}^{\prime}:R\rightarrow X^{I}IKX^{I}X^{I}X^{I}\quad\linebreak[2]p_{46}^{\prime}:R\rightarrow X^{I}X^{I}X^{I}IX^{I}\quad\linebreak[2]p_{47}^{\prime}:R\rightarrow X^{I}X^{I}X^{I}JIX^{I}$

$p_{48}^{\prime}:J\rightarrow X^{I}\quad\linebreak[2]p_{49}^{\prime}:J\rightarrow{}JX^{I}$

$p_{50}^{\prime}:K\rightarrow X^{I}\quad\linebreak[2]p_{51}^{\prime}:K\rightarrow{}KX^{I}$

$p_{52}^{\prime}:M\rightarrow AA\quad\linebreak[2]p_{53}^{\prime}:M\rightarrow{}UAA\quad\linebreak[2]p_{54}^{\prime}:M\rightarrow{}AUA\quad\linebreak[2]p_{55}^{\prime}:M\rightarrow{}AAN$

$p_{56}^{\prime}:M\rightarrow UAUA\quad\linebreak[2]p_{57}^{\prime}:M\rightarrow{}UAAN\quad\linebreak[2]p_{58}^{\prime}:M\rightarrow{}AUAN\quad\linebreak[2]p_{59}^{\prime}:M\rightarrow{}UAUAN$

$p_{60}^{\prime}:N\rightarrow A$ $p_{61}^{\prime}:N\to UA$ $p_{62}^{\prime}:N\to AN$ $p_{63}^{\prime}:N\to UAN$

$p_{64}^{\prime}:N\rightarrow U$

$p_{65}^{\prime}:U\rightarrow X^{U}\quad\linebreak[2]p_{65}^{\prime}:U\rightarrow UX^{U}$

We add the following rules:

$F\to X^{F}$ $F\to X^{F}X^{F}$ (allow length 1 and 2 in hairpins)

$I\to aL\hat{a}$ (new nonterminal for use inside bulges/interior loops)

$S\to C$ (allow completely unpaired sequences)

Rules for all unpaired nonterminals:

$X^{B}\to a$ , $X^{C}\to a$ , $X^{F}\to a$ , $X^{H}\to a$ , $X^{I}\to a$ , $X^{U}\to a$

Appendix C Further results

This appendix reports on some further results that were left out of the main text due to space constraints in the proceedings version.

C.1 Huffman coding vs. Arithmetic coding

We here compare the influence of the coding step on compression ratio in isolation. For that, we modify Liu et al.’s RNACompress [16] to use arithmetic coding instead of a Huffman code, leaving everything else unchanged, and compare the results.

We were not able to obtain the original implementation of RNACompress and the datasets from Liu et al. [16]. We hence re-implemented RNACompress, and used the friemel-modified dataset of 17 000 RNA samples originally taken from [1] instead of the dataset from [16]. Some of the RNAs in Friemel’s dataset have non-canonical bonds (these are less stable secondary bonds). Since Liu et al. do not allow non-canonical bonds in their tool, we also removed these from Friemel’s dataset, i. e., we replaced the open **(**and close **)**parenthesis for non-canonical bonds with unpaired bases **•**in the positions were non-canonical bonds appeared. Afterwards only the stable bonds (Watson-Crick and G–U wobble bonds) were left in all samples in the dataset, which we call friemel-modified.

Unsurprisingly, the arithmetic coding produced better compression results than Huffman coding, but the difference between the means is only 2.7%. Figure 5 shows the distribution of compressed size over the RNAs; while arithmetic coding has moderate impact on the mean compressed size, it helps a lot to bring down the right tail. The scatterplot in Figure 6 further shows that indeed, arithmetic coding (with this fixed static model) is doing better on almost all RNAs, and the effect is bigger for those RNAs that are compressed worse.

C.2 Nullable Grammar vs. Non-Nullable Grammar

Liu et al. [16] originally use the following grammar (in the notation from Appendix B):

$G_{L}$

$L\to aS\hat{a}\mid a$

$S\to LS\mid\varepsilon$

For general parsers, $\varepsilon$ -rules are often inconvenient; we therefore modified this grammar to $G_{L^{\prime}}$ shown in Appendix B. This transformation makes the probabilistic model slightly richer and so will help compression, but it does not change the nature of the grammar; the structure of leftmost derivations of strings remain (almost) the same. (We here ignore the fact that the empty string is no longer in the language of grammar $G_{L^{\prime}}$ , while it was derivable in $G_{L}$ . For RNA compression, this is not relevant.) We manually implemented a parser for the original $G_{L}$ grammar and compared the compression outcome. As Figure 5 shows, this very moderate enrichment of the probabilistic model has a larger impact than moving from Huffman to arithmetic coding. The scatter plot in Figure 6 (right) shows that again, we never do worse in $G_{L^{\prime}}$ compared to $G_{L}$ , but that this time, the biggest savings are happening for the (much larger number of) RNAs that are compressed well.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. J. Cannone et al. The Comparative RNA Web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RN As. BMC Bioinformatics , 3, 2002.
2[2] R. Consortium. RN Acentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Research , 49(D 1):D 212–D 220, 2020.
3[3] R. D. Dowell and S. R. Eddy. Evaluation of several lightweight stochastic context-free grammars for rna secondary structure prediction. BMC bioinformatics , 5(1):1–14, 2004.
4[4] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids . Cambridge university press, 1998.
5[5] J. Earley. An efficient context-free parsing algorithm. Communications of the ACM , 13, 1970.
6[6] J. Friemel. Contraction-Based Compression of RNA Secondary Structures . B Sc dissertation, Universitat Bielefeld, 2020.
7[7] L. Fu, Y. Cao, J. Wu, Q. Peng, Q. Nie, and X. Xie. U Fold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Research , 50(3):e 14–e 14, 2021.
8[8] R. Giancarlo, D. Scaturro, and F. Utro. Textual data compression in computational biology: a synopsis. Bioinformatics , 25, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

RNA secondary structures: from ab initio prediction to better compression, and back

RNA secondary structures: from ab initio prediction to better compression, and back

Abstract

1 Introduction

Better RNA Compression

Compression as a proxy for predictive power

Contributions

Related Work

Outline

2 Preliminaries

Dot-bracket notation

SCFG

Earley Parser

RNA secondary-structure prediction

3 RNA compression using stochastic context-free grammars

4 Joint compression of RNA sequence and secondary structure

5 Compression ratio vs. prediction quality

6 Conclusion

Appendix A Comparison with general purpose compressors

Appendix B Grammars

Grammar GL′G_{L^{\prime}}GL′​ (LiuGrammar)

Grammar G1G_{1}G1​ (DowellGrammar1Bound)

Grammar G2G_{2}G2​ (DowellGrammar2Bound)

Grammar G3G_{3}G3​ (DowellGrammar3Bound)

Grammar G4G_{4}G4​ (DowellGrammar4Bound)

Grammar G5G_{5}G5​ (DowellGrammar5Bound)

Grammar G6G_{6}G6​ (DowellGrammar6Bound)

Grammar G7G_{7}G7​ (DowellGrammar7Bound)

Grammar G8G_{8}G8​ (DowellGrammar8Bound)

Grammar GSG_{S}GS​ (SchulzGrammar)

Appendix C Further results

C.1 Huffman coding vs. Arithmetic coding

C.2 Nullable Grammar vs. Non-Nullable Grammar

Grammar $G_{L^{\prime}}$ (LiuGrammar)

Grammar $G_{1}$ (DowellGrammar1Bound)

Grammar $G_{2}$ (DowellGrammar2Bound)

Grammar $G_{3}$ (DowellGrammar3Bound)

Grammar $G_{4}$ (DowellGrammar4Bound)

Grammar $G_{5}$ (DowellGrammar5Bound)

Grammar $G_{6}$ (DowellGrammar6Bound)

Grammar $G_{7}$ (DowellGrammar7Bound)

Grammar $G_{8}$ (DowellGrammar8Bound)

Grammar $G_{S}$ (SchulzGrammar)