TL;DR
This paper introduces Rpair, a method that preprocesses large datasets with hashing to improve the efficiency of applying grammar-based compression schemes like RePair, supported by theoretical bounds and practical experiments.
Contribution
It presents a novel preprocessing algorithm using context-triggered hashing to facilitate faster grammar-based compression and provides theoretical and empirical validation.
Findings
Preprocessing with hashing improves compression speed.
The method approximates LZ77 parsing effectively.
Experimental results show competitiveness with existing approaches.
Abstract
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give…
| File | Size | RePair | BigRePair | SOLCA | BigSOLCA | ||||||||
| Ratio | Time | Spc | Ratio | Time | Spc | Ratio | Time | Spc | Ratio | Time | Spc | ||
| c50 | 2.75 | 0.80% | 1832 | 3842 | 0.91% | 66.40 | 454.7 | 1.35% | 244.1 | 107.4 | 1.54% | 103.6 | 182.9 |
| c100 | 5.51 | 0.30% | 7311 | 3155 | 0.48% | 62.13 | 246.4 | 0.77% | 236.4 | 53.67 | 0.86% | 94.03 | 128.8 |
| c250 | 13.8 | 0.23% | 59.95 | 119.8 | 0.40% | 239.0 | 29.78 | 0.44% | 86.39 | 95.00 | |||
| c500 | 27.5 | 0.14% | 59.97 | 118.0 | 0.28% | 237.4 | 17.05 | 0.30% | 85.12 | 84.72 | |||
| c1000 | 55.1 | 0.10% | 60.95 | 117.3 | 0.22% | 237.3 | 13.56 | 0.23% | 86.13 | 78.82 | |||
| s815 | 3.75 | 1.72% | 8478 | 3726 | 1.93% | 90.87 | 2254 | 3.01% | 317.7 | 161.0 | 3.50% | 143.2 | 291.0 |
| s2073 | 9.72 | 2.01% | 95.86 | 1055 | 3.01% | 370.9 | 153.1 | 3.53% | 157.3 | 285.5 | |||
| s4570 | 22.0 | 2.61% | 244.1 | 534.2 | 3.57% | 480.6 | 154.4 | 4.24% | 185.6 | 334.7 | |||
| s11264 | 53.1 | 1.51% | 2605 | 294.2 | 2.20% | 620.2 | 92.60 | 2.61% | 157.3 | 206.4 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: CeBiB — Center for Biotechnology and Bioengineering, Chile 22institutetext: Faculty of Computer Science, Dalhousie University, Canada 33institutetext: Department of Artificial Intelligence,
Kyushu Institute of Technology, Fukuoka, Japan 44institutetext: Department of Science and Technological Innovation,
University of Eastern Piedmont, Alessandria, Italy 55institutetext: Department of Computer Science, University of Chile, Santiago, Chile
Rpair: Rescaling RePair with Rsync ††thanks: Partially funded with Basal Funds FB0001, Conicyt, Chile.
Travis Gagie 1122
Tomohiro I 33
Giovanni Manzini 44
Gonzalo Navarro 1155
Hiroshi Sakamoto 33
Yoshimasa Takabatake 33
Abstract
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.
1 Introduction
Dictionary compression has proved to be an effective tool to exploit the repetitiveness that most of the fastest-growing datasets feature [23]. Lempel-Ziv (LZ77 for short) [22, 31] stands out as the most popular and effective compression method for repetitive texts. Further, it can be run in linear time and even in external memory [18]. LZ77 has the important drawback, however, that accessing random positions of the compressed text requires, essentially, to decompress it from the beginning. Therefore, it is not suitable to be used as a compressed data structure that represents the text in little space while simulating direct access to it. Grammar compression [19] is an alternative that offers better guarantees in this sense. The aim is to build a small context-free grammar (or Straight-Line Program, SLP) that generates (only) the text. The smallest SLP generating a text is always larger than its LZ77 parse, but only by a logarithmic factor that is rarely reached in practice. With an SLP we can access any text substring with only an additive logarithmic time penalty [3, 5], which has led to the development of various self-indexes building on SLPs [4, 9, 12, 13, 15, 24]. In addition, many other richer queries on sequences have been supported by associating summary information the nonterminals of the SLP [1, 2, 5, 7, 11, 10].
Although finding the smallest SLP for a text is NP-complete [8, 26], there are several grammar construction algorithms that guarantee at most a logarithmic blowup on the LZ77 parse [8, 16, 17, 26, 27]. In practice, however, they are sharply outperformed by RePair [21], a heuristic that runs in linear time and obtains grammars of size very close to that of the LZ77 parse in most cases. This has made RePair the compressor of choice to build grammar-based compressed data structures [1, 7, 10, 11]. A serious problem with RePair, however, is that, despite running in linear time and space, in practice the constant is high and it can be built only on inputs that are about one tenth of the available memory. This significantly hampers its applicability on large datasets.
In this paper we introduce a scalable SLP compression algorithm that obtains space very close to that of RePair and can be applied on very large inputs. We prove a constant-approximation factor with respect to any SLP construction algorithm to which our technique is applied. Our experimental results show that we can compress a very repetitive 50GB text in less than an hour, using less than 650MB of RAM and obtaining very competitive compression ratios.
2 Preliminaries
For the sake of brevity, we assume the reader is familiar with SLPs, LZ77, and the links between the two. To prove theoretical bounds for our approach, we consider a variant of LZ77 in which if is a phrase then either and is the first occurrence of a distinct character, or occurs in and does not occur in . We refer to this variant as LZSS due to its similarity to Storer and Szymanski’s version of LZ77 [28], even though they allow substrings to be stored as raw text and we do not.
The best-known algorithm for building SLPs is probably RePair [21], for which there are many implementations (see [14] and references therein). It works by repeatedly finding the most common pair of symbols and replacing them with a new non-terminal. Although it is not known to have a good worst-case approximation ratio with respect to the size of LZ77 parsing, in practice it outperforms other constructions. RePair uses linear time and space but the coefficient in the space bound is quite large and so the standard implementations are practical only on small inputs. A more recent and more space economical alternative to RePair is SOLCA [29] that we will consider in Section 5.
Context-triggered piecewise hashing (CTPH) is a technique for parsing strings into blocks such that long repeated substrings are parsed the same way (except possibly at the beginning or end of the substrings). The name CTPH seems to be due to to Kornblum [20] but the ideas go back to Tridgell’s Rsync [30] and Spamsum (https://www.samba.org/ftp/unpacked/junkcode/spamsum/README):
“The core of the spamsum algorithm is a rolling hash similar to the rolling hash used in ‘rsync’. The rolling hash is used to produce a series of ’reset points’ in the plaintext that depend only on the immediate context (with a default context width of seven characters) and not on the earlier or later parts of the plaintext.”
Specifically, in this paper we choose a rolling hash function and a threshold , run a sliding window of fixed size over and end the current block whenever the window contains a triggering substring, which is a substring of length whose hash is congruent to 0 modulo . When we end a block, we shift the window ahead characters so all the blocks are disjoint and form a parse, which we call the Rsync parse. We call the set of distinct blocks the Rsync dictionary: if the input text contains many repetitions, we expect the dictionary to be much smaller than the text.
3 Algorithms
Given a string , we can use Rsync parsing to help build an SLP for with Algorithm 1 (“Rpair”). The final SLP can be viewed as first generating the parse, then replacing each block ID in the parse by the sublist of non-terminals that generate each block, and finally replacing the sublists by the blocks themselves.
Since each separator character appears only once in and its parse tree, any non-terminal whose expansion includes a separator character also appears only once and is deleted. Since the parse tree of an SLP is binary and each non-terminal we delete appears only once, the number of distinct non-terminals we delete is at least the length of the list of non-terminals at the roots of the maximal remaining subtrees of the parse tree, minus one. Therefore, creating rules to generate the sublists does not cause the number of distinct non-terminals to grow to more than the number in the original SLP for , plus one.
Algorithm 1 works with any algorithm for building SLPs for and . In Section 4 we show that, if we choose an algorithm that builds SLPs for and at most an -factor larger than their LZ77 parses, then we obtain an SLP an -factor larger than the LZ77 parse of . In the process we will refer to Algorithm 2 (“Rparse”), which produces an LZSS-like parse of but is intended only to simplify our analysis of Algorithm 1 (not to compete with cutting-edge LZ-based compressors). By “LZSS-like” we mean a parse in which each phrase is either a single character that has not occurred before, or a copy of an earlier substring. We note in passing that, if the parse in Step 3 is still to big to for a normal construction, then we can apply Algorithm 1 to it. We will show in the full version of this paper that, if we recurse only a constant number of times, then we worsen our compression bounds by only a constant factor.
4 Analysis
The main advantage of using Rsync parsing to preprocess is that Rsync parsing is quite easy to parallelize, apply over streamed data, or apply in external memory. The resulting dictionary and parse may be significantly smaller than , making it easier to apply grammar-based compression. In the full version of this paper we will analyze how much time and workspace Algorithms 1 and 2 use in terms of the total size of the dictionary and parse, but for now we are mainly concerned with the quality of the compression.
Let be the number of distinct blocks in the Rsync parse of , and let be the number of phrases in the LZ77 parse of . The first block is obviously the first occurrence of that substring and if is the first occurrence of another block, then (i.e., the block extended backward to include the previous triggering substring) is the first occurrence of that substring. Since the first occurrence of any non-empty substring overlaps or ends at a phrase boundary in the LZ77 parse, we can charge to such a boundary in . Since blocks have length at least and overlap by only characters when extended backwards, each boundary has the first occurrences of at most two blocks charged to it, so .
In Step 5 of Algorithm 2, we discard of the phrases of the phrases in the LZSS parses of and when mapping to the phrases in the LZSS-like parse of . Therefore, by showing that the number of phrases in the LZSS-like parse of is , we show that the total number of phrases in the LZSS parses of and is also , so the total number of phrases in their LZ77 parses is as well.
Due to space constraints, the proofs of the results below are in Appendix 0.A.
Lemma 1
If the -th phrase in the LZSS parse of is then the -th phrase resulting from Algorithm 2, if it exists, ends at or after .
We note that we can quite easily can reduce the five in Lemma 1, at the cost of complicating our algorithm slightly, but this is not a priority for us right now and we leave it for the full version of this paper.
Corollary 1
Algorithm 2 yields an LZSS-like parse of with at most five times as many phrases as its LZSS parse.
Theorem 4.1
Algorithm 2 yields an LZSS-like parse of with phrases.
Corollary 2
The LZ77 parses of and have phrases.
Let be any algorithm that builds an SLP at most an -factor larger than the LZ77 parse of its input. For example, with Rytter’s construction [26] we have .
By Corollary 2, applying to — Step 2b in Algorithm 1 — yields an SLP for with rules. As explained in Section 3, Steps 2c to 2g then increase the number of rules by at most one while modifying the SLP such that, for each block in the dictionary, there is a non-terminal whose expansion is that block.
Similarly, applying to — Step 3 — yields an SLP for with rules. Replacing the terminals in the SLP by the non-terminals generating the blocks and then combining the two SLPs — Steps 4 and 5 — yields an SLP for with rules. This gives us our main result of this section:
Theorem 4.2
Using in Steps 2b and 3 of Algorithm 1 yields an SLP for with rules.
5 Experiments
We use two genome collections in our experiments: c consists of concatenated copies of the human chromosome chr19, of about 59MB each; s consists of concatenated copies of salmonella genomes, of widely different sizes.
We compare two grammar compressors: RePair [21] produces the best known compression ratios but uses a lot of main memory space, whereas SOLCA [29] aims at optimizing main memory usage. Their versions combined with prefix-free parsing are BigRepair and BigSOLCA. RePair could be run only on the smaller collections. Appendix 0.B gives more details on the experimental setup.
Table 1 shows the results in terms of compression ratio, time, and space in RAM. On the more repetitive chr19 genomes, BigRePair is clearly the best choice for large files. It loses to RePair in compression ratio, but RePair took 11 hours just to process 5.5GB, so it is not a choice for larger files. Instead, BigRepair processed 55GB in less than an hour and 650MB of RAM. Similarly, SOLCA obtains better compression but more compression time than BigSOLCA, though the latter uses more space. The comparison between the two compressors shows that BigRepair performs better than both SOLCA and BigSOLCA in both compression ratio (reaching nearly half the compressed size of SOLCA on the largest files) and time ( of the time of BigSOLCA). Still SOLCA uses much less space: it compresses 55GB in 3.6 hours, but using less than 75MB.
The results start similarly on the less compressible salmonella collection, but it reaches an important turning point. The time of BigRePair on chr19 was stable around 1GB per minute, but on salmonella it is not: When moving from 10GB to 20GB of input data, the time per processed GB of BigRePair jumps by a factor of 2.5, and when moving from 20GB to 50GB it jumps by more than 10. To process the largest 53GB file, BigRePair requires more than 38 hours and over 15 GB of RAM. SOLCA, instead, handles this file in nearly 9 hours and less than 5 GB, and BigSOLCA in less than 2.5 hours and 11 GB, being the fastest. What happens is that, being less compressible, the output of the prefix-free parse is still too large for RePair, and thus it slows down drastically as soon as it cannot fit its structures in main memory. The much lower memory footprint of SOLCA, instead, pays off on these large and less compressible files, though its compression ratio is worse than that of BigRePair.
Appendix 0.A Omitted Proofs
0.A.1 Proof of Lemma 1
Proof
Our claim is trivially true for , since the first phrases in both parses are the single character , so let be greater than 1 and assume our claim is true for , meaning the st phrase in our parse ends at with . If then our claim is also trivially true for , so assume . We must show that our parse divides into at most five phrases, in order to prove our claim for .
First suppose that does not completely contain a triggering substring, so it overlaps at most two blocks. (It can overlap two blocks without containing a triggering substring if and only if a prefix of length less than lies in one block and the rest lies in the next block.) Let be ’s source and let , so in the LZSS parse is copied from . Since does not completely contain a triggering substring either, it too overlaps at most two blocks.
Without loss of generality (since the other cases are easier), assume and each overlap two blocks and they are split differently: lies in one block and lies in the next, and lies in one block and in the next, with . Assume also that , since the other case is symmetric. Since is completely contained in a block and occurs earlier completely contained in a block, as , our parse does not divide it. Similarly, since and are each completely contained in a block and occur earlier each completely contained in a block, as and , respectively, our parse does not divide them. Therefore, our parse divides into at most three phrases.
Now suppose the first and last triggering substrings completely contained in are and (possibly with ). By the arguments above, our parse divides into at most three phrases. Since is a sequence of complete blocks that have occurred earlier (in ), our parse does not divide it unless is a complete block that has occurred before as a complete block, in which case it may divide once between and . Since is completely contained in a block and occurs earlier completely contained in a block (in ), our parse does not divide it. Therefore, our parse divides into at most five phrases. ∎
0.A.2 Proof of Corollary 1
Proof
If the LZSS parse has phrases then the -th phrase ends at so, by Lemma 1, Algorithm 2 yields a parse with at most phrases. ∎
0.A.3 Proof of Theorem 4.1
Proof
It is well known that the LZSS parse of has at most twice as many phrases as the its LZ77 parse (since dividing each LZ77 phrase into a prefix with an earlier occurrence and a mismatch character yields an LZSS-like parse with at most twice as many phrases, and the LZSS parse has the fewest phrases of any LZSS-like parse). Therefore, by Corollary 1, Algorithm 2 yields a parse with at most phrases. ∎
0.A.4 Proof of Corollary 2
Proof
Immediate, from Theorem 4.1, the fact that the LZ77 parse is no larger than the LZSS parse, and inspection of Algorithm 1. ∎
Appendix 0.B Experimental setup
Our experiments ran on a Intel(R) I7-4770 @ 3.40 GHz machine with 32 GB memory.
The chr19 collection was downloaded from the 1000 Genomes Project. Each chr19 sequence was derived by using the bcftools consensus tool to combine the haplotype-specific (maternal or paternal) variant calls for an individual with the chr19 sequence in the GRCH37 human reference. The salmonella genomes were downloaded from NCBI (BioProject PRJNA183844) and preprocessed by assembling each individual sample with IDBA-UD [25] setting kMaxShortSequence to 1024 per public advice from the author to accommodate the longer paired end reads that modern sequencers produce. More details of the collections are available in previous work [6, Sec. 4].
For RePair we use Navarro’s implementation for large files, at http://www. dcc.uchile.cl/gnavarro/software/repair.tgz, letting it use 10GB of main memory, whereas the implementation of SOLCA is at https://github.com/ tkbtkysms/solca. To measure their compression ratios in a uniform way, we consider the following encodings of their output: if RePair produces (binary) rules and an initial rule of length , we account bits to encode the topology of the pruned parse tree (where the nonterminal ids become the preorder of their internal node in this tree) and bits to encode the leaves of the tree and the initial rule. SOLCA is similar, with .
Our code is available at https://gitlab.com/manzai/bigrepair.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Abeliuk, R. Cánovas, and G. Navarro. Practical compressed suffix trees. Algorithms , 6(2):319–351, 2013.
- 2[2] H. Bannai, T. Gagie, and T. I. Online LZ 77 parsing and matching statistics with RLBW Ts. In CPM , pages 7:1–7:12, 2018.
- 3[3] D. Belazzougui, S. J. Puglisi, and Y. Tabei. Access, rank, select in grammar-compressed strings. In ESA , pages 142–154, 2015.
- 4[4] P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. Time-space trade-offs for Lempel-Ziv compressed indexing. In CPM , pages 16:1–16:17, 2017.
- 5[5] P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. S. Rao, and O. Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput. , 44(3):513–539, 2015.
- 6[6] C. Boucher, T. Gagie, A. Kuhnle, and G. Manzini. Prefix-free parsing for building big BW Ts. In WABI , pages 2:1–2:16, 2018.
- 7[7] N. Brisaboa, A. Gómez-Brandón, G. Navarro, and J. Paramá. Gract: A grammar-based compressed index for trajectory data. Inf. Sci. , 483:106–135, 2019.
- 8[8] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory , 51(7):2554–2576, 2005.
