LZRR: LZ77 Parsing with Right Reference
Takaaki Nishimoto, Yasuo Tabei

TL;DR
This paper introduces LZRR, a novel bidirectional parsing method that guarantees fewer phrases than LZ77, achieving approximately 5% better compression on benchmark strings.
Contribution
LZRR is the first practical bidirectional parsing method with theoretical guarantees of smaller phrase counts than LZ77.
Findings
LZRR reduces phrase count by about 5% compared to LZ77.
LZRR guarantees smaller phrase count theoretically.
Experimental results confirm improved compression ratios.
Abstract
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input string. Gagie et al.(LATIN 2018) recently showed that a large gap exists between the number of smallest bidirectional phrases of a given string and that of LZ77 phrases. In addition, finding the smallest bidirectional parse of a given text is NP-complete. Several variants of bidirectional parsing have been proposed thus far, but no prior work for bidirectional parsing has achieved high compression that is smaller than that of LZ77 phrasing for any string. In this paper, we present the first…
| String | String length | ||||
| fib41 | 267,914,296 | 22 | 4 | 5 | 0.227 |
| rs.13 | 216,747,218 | 52 | 40 | 51 | 0.981 |
| tm29 | 268,435,456 | 56 | 43 | 31 | 0.554 |
| dblp.xml.00001.1 | 104,857,600 | 59,385 | 58,537 | 55,127 | 0.928 |
| dblp.xml.00001.2 | 104,857,600 | 59,556 | 60,220 | 55,122 | 0.926 |
| dblp.xml.0001.1 | 104,857,600 | 78,167 | 82,879 | 73,584 | 0.941 |
| dblp.xml.0001.2 | 104,857,600 | 78,158 | 99,467 | 73,583 | 0.941 |
| sources.001.2 | 104,857,600 | 294,994 | 466,074 | 287,411 | 0.974 |
| dna.001.1 | 104,857,600 | 308,355 | 307,329 | 295,354 | 0.958 |
| proteins.001.1 | 104,857,600 | 355,268 | 364,024 | 337,711 | 0.951 |
| english.001.2 | 104,857,600 | 335,815 | 487,586 | 324,282 | 0.966 |
| einstein.de.txt | 92,758,441 | 34,287 | 37,719 | 31,798 | 0.927 |
| einstein.en.txt | 467,626,544 | 89,437 | 96,487 | 83,368 | 0.932 |
| world_leaders | 46,968,181 | 175,670 | 179,503 | 165,626 | 0.943 |
| influenza | 154,808,555 | 769,286 | 764,634 | 714,320 | 0.929 |
| kernel | 257,961,616 | 793,915 | 794,058 | 741,556 | 0.934 |
| cere | 461,286,644 | 1,695,631 | 1,649,448 | 1,597,657 | 0.942 |
| coreutils | 205,281,778 | 1,441,384 | 1,439,918 | 1,359,606 | 0.943 |
| Escherichia_Coli | 112,689,515 | 2,078,512 | 2,014,012 | 1,961,296 | 0.944 |
| para | 429,265,758 | 2,332,657 | 2,238,362 | 2,200,802 | 0.943 |
| Execution time [sec] | Memory consumption [MB] | ||||||
|---|---|---|---|---|---|---|---|
| String | String length | LZ77 | LEX | LZRR | LZ77 | LEX | LZRR |
| einstein.de.txt | 92,758,441 | 24 | 16 | 27 | 2,266 | 2,266 | 3,808 |
| einstein.en.txt | 467,626,544 | 130 | 85 | 147 | 11,418 | 11,418 | 19,196 |
| world_leaders | 46,968,181 | 8 | 5 | 16 | 1,148 | 1,148 | 1,939 |
| influenza | 154,808,555 | 42 | 27 | 51 | 3,781 | 3,781 | 6,351 |
| kernel | 257,961,616 | 71 | 47 | 88 | 6,299 | 6,299 | 10,602 |
| cere | 461,286,644 | 131 | 90 | 500 | 11,263 | 11,263 | 18,925 |
| coreutils | 205,281,778 | 56 | 37 | 68 | 5,013 | 5,013 | 8,453 |
| Escherichia_Coli | 112,689,515 | 32 | 22 | 46 | 2,752 | 2,752 | 4,632 |
| para | 429,265,758 | 125 | 85 | 203 | 10,481 | 10,481 | 17,609 |
| Execution time [sec] | Memory consumption [MB] | ||||||
|---|---|---|---|---|---|---|---|
| String | String length | LZ77 | LEX | LZRR | LZ77 | LEX | LZRR |
| fib41 | 267,914,296 | 99 | 74 | 113 | 6,542 | 6,542 | 11,978 |
| rs.13 | 216,747,218 | 79 | 59 | 110 | 5,292 | 5,293 | 9,654 |
| tm29 | 268,435,456 | 108 | 81 | 142 | 6,554 | 6,555 | 11,797 |
| dblp.xml.00001.1 | 104,857,600 | 30 | 21 | 42 | 2,561 | 2,561 | 4,308 |
| dblp.xml.00001.2 | 104,857,600 | 30 | 20 | 41 | 2,561 | 2,561 | 4,305 |
| dblp.xml.0001.1 | 104,857,600 | 30 | 20 | 42 | 2,561 | 2,561 | 4,303 |
| dblp.xml.0001.2 | 104,857,600 | 30 | 20 | 41 | 2,561 | 2,561 | 4,303 |
| sources.001.2 | 104,857,600 | 28 | 19 | 41 | 2,561 | 2,561 | 4,302 |
| dna.001.1 | 104,857,600 | 30 | 20 | 41 | 2,561 | 2,561 | 4,302 |
| proteins.001.1 | 104,857,600 | 31 | 21 | 42 | 2,561 | 2,561 | 4,302 |
| english.001.2 | 104,857,600 | 30 | 21 | 42 | 2,561 | 2,561 | 4,302 |
| einstein.de.txt | 92,758,441 | 24 | 16 | 27 | 2,266 | 2,266 | 3,808 |
| einstein.en.txt | 467,626,544 | 130 | 85 | 147 | 11,418 | 11,418 | 19,196 |
| world_leaders | 46,968,181 | 8 | 5 | 16 | 1,148 | 1,148 | 1,939 |
| influenza | 154,808,555 | 42 | 27 | 51 | 3,781 | 3,781 | 6,351 |
| kernel | 257,961,616 | 71 | 47 | 88 | 6,299 | 6,299 | 10,602 |
| cere | 461,286,644 | 131 | 90 | 500 | 11,263 | 11,263 | 18,925 |
| coreutils | 205,281,778 | 56 | 37 | 68 | 5,013 | 5,013 | 8,453 |
| Escherichia_Coli | 112,689,515 | 32 | 22 | 46 | 2,752 | 2,752 | 4,632 |
| para | 429,265,758 | 125 | 85 | 203 | 10,481 | 10,481 | 17,609 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · semigroups and automata theory
LZRR: LZ77 Parsing with Right Reference
Takaaki Nishimoto*∗* and Yasuo Tabei*∗*
*∗*RIKEN Center for Advanced Intelligence Project
{takaaki.nishimoto,yasuo.tabei}@riken.jp
Abstract
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip (LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input string. Gagie et al. (LATIN 2018) recently showed that a large gap exists between the number of smallest bidirectional phrases of a given string and that of LZ77 phrases. In addition, finding the smallest bidirectional parse of a given text is NP-complete. Several variants of bidirectional parsing have been proposed thus far, but no prior work for bidirectional parsing has achieved high compression that is smaller than that of LZ77 phrasing for any string. In this paper, we present the first practical bidirectional parsing named LZ77 parsing with right reference (LZRR), in which the number of LZRR phrases is theoretically guaranteed to be smaller than the number of LZ77 phrases. Experimental results using benchmark strings show the number of LZRR phrases is approximately five percent smaller than that of LZ77 phrases.
1 Introduction
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip (LZ) 77 parsing [8], which compresses a given string by computing a sequence of phrases copied from the longest substring on the left position in an input string. LZ77 parsing has a long research history, with the first paper on it published in 1976 [8]. Many LZ77’s extensions have since been proposed (e.g., [7, 3, 10]), and LZ77 parsing achieves the smallest compression ratio among them.
Bidirectional (a.k.a. macro) parsing [11] is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or right position in an input string. Each set of LZ77 phrases is convertible into a set of bidirectional phrases, and the number of phrases in the smallest bidirectional parsing is less than that of LZ77 phrases. Gagie et al. [2] recently showed the number of LZ77 phrases representing an input string of length can be tightly bounded by the smallest number of bidirectional phrases representing the same string as , which suggests that a large gap exists between and . In addition, finding the smallest bidirectional parse of a given text is NP-complete [11]. Thus, an important open challenge is to develop a polynomial time bidirectional parsing such that the number of bidirectional phrases is smaller than that of LZ77 phrases.
Several variants of bidirectional parsing have been proposed thus far. Lex-parsing [9] is a bidirectional parsing that computes a sequence of bidirectional phrases that each occurred previously on a suffix array of a string. The number of phrases in the lex-parsing is bounded by [2]. Although the lex-parsing is effective for most benchmark strings (i.e., phrases is very close to ) in practice, it can fail to compress some strings (i.e., is much larger than ) [9]. Lcpcomp [1] and a bidirectional parsing using Burrows-Wheeler transform (BWT) [2] have also been proposed, and they never have fewer phrases than lex-parse [9]. Kempa and Prezza proposed a parsing algorithm for computing the bidirectional parse of an input string for a given string attractor of the string [6]. The number of the bidirectional phrases is bounded by , where is the size of the string attractor. Let be the size of the smallest string attractor for a given string. Then holds [6]. In addition, finding the smallest string attractor of a given string is also NP-complete [6]. In summary, no prior bidirectional parsing achieves high compression that is smaller than that of LZ77 phrasing for any string.
In this paper, we present the first practical bidirectional parsing named LZ77 parsing with right reference (LZRR) in which the number of LZRR phrases is always smaller than the number of LZ77 phrases by a large margin. LZRR is a polynomial time algorithm that greedily computes phrases from a string in the left-to-right order the same as LZ77. The main difference between LZRR and LZ77 is the way to compute their phrases. Whereas LZ77 parsing chooses the longest substring occurring previously as a phrase, LZRR parsing uses not only previous occurrences of each phrase but also subsequent occurrences (i.e., it chooses the longest substring occurring previously or subsequently as a phrase). For this reason, the number of LZRR phrases is theoretically guaranteed to be no more than that of LZ77 phrases. Experimental results using benchmark datasets show the number of LZRR phrases is approximately five percent smaller than that of LZ77 phrases.
2 Preliminaries
Let be an ordered alphabet of size , be a string of length over and be the length of . Let be the -th character of and be the substring of that begins at position and ends at position . denotes the suffix of beginning at position , i.e., . Let be the reversed string of , i.e., .
denotes all the occurrence positions of string in string , i.e., . Let be the length of the longest common prefix (LCP) of and . For two strings and , represents that is lexicographically smaller than , we write . Similarly, for a string , represents that the LCP of and is equal to or longer than that of and . For example, .
Our model of computation is a unit-cost word RAM with a machine word size of bits. We evaluate the space complexity in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor.
2.1 Arrays
Suffix array , inverse suffix array , LCP array , longest previous factor array , and sorted suffix array are integer arrays of length for a string , respectively. is the permutation of such that holds. is the permutation of such that holds for any . and for . stores the length of the longest prefix of occurring previously; that is and , where returns the maximal element of a given set. is the sorted starting positions of suffixes in decreasing order for the length of the LCP with . Formally, for an integer , is a permutation of such that . is not unique when there exist two positions and such that .
For , , , , , and .
2.2 Union-find data structure
Union-find is a data structure for disjoint sets and supports the following operations for disjoint set : , , . adds element into and returns the integer where is the cardinality of . merges two sets containing and , respectively; it adds a new set into ; it removes and from . The returns the id of the set containing in . The union-find data structure performs , , operations in time, while using space [12], where and are the numbers of and operations, respectively, and is the inverse of the -th row of Ackermann function.
2.3 Bidirectional phrases and partial bidirectional phrases
Bidirectional phrases (BP) [11] of string is a partition of as substrings (phrases) such that each is (i) either copied from another substring (target phrase) with , which can overlap , or (ii) an explicit character (character phrase), i.e., . Target phrase is denoted as a pair of the reference position and the length of . The substring is called the reference string of .
The original string can be recovered from BP by referring to a finite number of phrases from each in . If an infinite loop of phrases referred from any exists, the original string cannot be recovered from . If can be recovered from , is said to be a valid BP of ; otherwise, is said to be invalid BP of .
The value of the phrase reached from position in iterations of references is formally defined as . For , if is a character phrase, ; otherwise , where is the integer such that holds for . For , we define as follows:
[TABLE]
If holds, then there are no infinite loops of references containing . Therefore, is valid BP of if has no infinite loops of references, i.e., holds for all .
For example, let and be BPs of . Then is valid since and . On the other hand, is invalid since .
LZ77 phrases [8] of string are a specialization of BP and defined as the bidirectional phrases that are all selected from previously seen substrings. Since there is no infinite loops of references on phrases, LZ77 phrases of are always valid BP of . Formally, let of be valid BP of such that for each .
LZRR parsing gradually builds the valid BP from the start position of in the left-to-right order. A subsequence of the valid BP is called partial bidirectional phrases (PBP) and is defined as a BP for a prefix of that can be copied from any substring of , i.e., for all for a target phrase , which avoids a self copy.
The concatenation of such PBP and every character phrase referred from can recover the prefix of with a finite number of references. Such PBP are called valid PBP, and other PBP are called invalid PBP. Formally, let be the concatenation of PBP and the remaining character phrases equivalent to suffix . is valid if is valid; otherwise is invalid. For example, let be a PBP of . Then .
The original string of a PBP can be recovered by iteratively referring to phrases starting from each target phrase in a finite number of times until the character phrase is found. Thus, the position of each character phrase can be seen as the source for positions of target/character phrases. Formally, for a PBP and position on , returns source of in , i.e., position satisfying either (i) and for an integer or (ii) and . For the above example, the source of the position is the position in since , and .
3 LZRR
A key idea of LZRR parsing is to compute the valid BP from an input text by gradually computing the valid PBP from the head of in the left-to-right order. LZRR parsing computes whole LZRR phrases initialized as zero phrase for an input string in two steps: (i) it computes candidates of the reference positions of the longest valid phrase following the current LZRR phrase; and (ii) it computes the valid (possibly character) phrase with the maximum length among extensions starting from those candidates. Steps (i) and (ii) are iterated until whole LZRR phrases are computed.
LZRR parsing uses two major functions of LP and LF for steps (i) and (ii), respectively. Given a valid PBP of , LP function returns the longest valid phrase following , i.e., the longest phrase such that is a valid PBP of . Given a valid PBP of and reference position , LF function returns the length of the longest valid phrase having reference position and following , i.e., ) where is the starting position of the phrase following . LZRR parsing computes LZRR phrases as the valid BP of where is the first LZRR phrases for each and is the number of LZRR phrases of . The LZRR phrases of are not unique.
For example, let be the first LZRR phrase of . . LZRR parsing chooses phrase or as the next one.
This paper shows the following two theorems.
Theorem 1**.**
For a given string , LZRR parsing computes in time using working space.
Theorem 2**.**
* holds.*
The LZRR parsing algorithm is presented in Section 3. Theorems 1 and 2 are shown in Section 4.
3.1 algorithm
A straight forward computation of is to compute reference position such that and then compute , which results in LZRR phrase . This method takes time even if can be computed in constant time for each position . Instead, we reduce the computation time of LF functions by leveraging the following fact: the length of the longest valid phrase of starting position and reference position is not larger than that of the LCP of and . This fact suggests that after we find a phrase of length , we do not need to compute LF functions for any reference position such that the LCP of and is not longer than . For an efficient computation, we sort reference positions in descending order with respect to the length of the LCP for and maintain those positions in the sorted suffix array of . Then, we omit computing LF functions of reference positions on for the left-most position on such that the longest valid phrase starting at a reference position in is at least as long as the LCP of and . This is because exists on . Thus, the following lemma holds.
Lemma 3**.**
Let and be the left-most position on the such that holds. Then holds and contains .
Proof.
See Appendix. ∎
Algorithm 1 shows the algorithm for computing function and computes each LF function from the head of . When Algorithm 1 finds , it returns the current longest valid phrase.
3.2 algorithm
algorithm finds the longest valid target phrase with reference position and following the PBP of by gradually extending the target phrase of length until it cannot find any reference string copying the target phrase. When PBP for and the target phrase is computed one-by-one, it can include an infinite loop of references by a mutual reference of phrases. This is because PBP as a target phrase can be copied from the left and right reference strings. This can happen when for computing the extension the position of a target phrase in and can be mutually reached with a finite number of references. The algorithm avoids such cases by using the union-find data structure built from PBP .
Each disjoint set in the union-find data structure includes string positions with the same source (character phrase) for PBP . The union-find data structure is initialized as disjoint sets that all contain the unique position of the input string of length . If the union-find data structure for PBP for and the target phrase exists, the data structure for can be updated by operation.
The infinite loops of references can be detected using the find operation in the union-find data structure. When is a valid and the extension of starting position next to the PBP and reference position is computed, if is equal to if and only if infinite loops of references exist. algorithm checks this condition each time. Formally, the following corollary holds.
Corollary 4**.**
Let be a valid PBP and be a PBP for an integer , and be disjoint sets on such that each set consists of all positions of the same source for a PBP , where is and is the starting position of the last target phrase (i.e., ) in . (1) If holds on , then is valid. Otherwise is invalid. (2) is equal to the set created by on .
Algorithm 2 shows the algorithm for computing function using Corollary 4 and the algorithm stated previously. Thus, we can compute the length of the longest valid target phrase with reference position and following the PBP by union and find operations on the given union-find data structure for .
Note that we need to modify Algorithm 2 for algorithm. This is because algorithms in our algorithm need the same union-find data structure determined by the PBP . On the other hand, the given union-find data structure is changed by union operations in Algorithm 2. By modifying Algorithm 2 using an additional union-find data structure, we can compute without updating the given union-find data structure. Formally, the following lemma holds.
Lemma 5**.**
Given the union-found data structure for , we can compute in working space by operations on and union and find operations on an additional union-find data structure for disjoint sets. is disposed after is computed.
Proof.
See Appendix. ∎
3.3 Computation of
Since algorithm for each uses the union-find data structure for disjoint sets of the current LZRR phrases (i.e., ), we update the union-find data structure when the -th LZRR phrase is selected. This needs at most operations by Corollary 4.
4 Theoretical analysis
4.1 The proof of Theorem 1
We show that the working space of LZRR parsing is space. LZRR parsing needs two data structures: (1) the union-find data structures for algorithm and (2) the data structure to compute the sequence for algorithm, where is in algorithm and is the starting position of -th LZRR phrase.
We can compute in time in an online manner using arrays of , and for two integers and (See Appendix).
, , and of a given a string can be constructed in time and working space [4, 5]. Therefore, the second data structure can be constructed in time and space, and the LZRR parsing algorithm runs in working space.
Next, we show that the running time of LZRR parsing is . Let be the sequence of operations on disjoint-sets executed by LZRR parsing and be the sequence of , where is the number of phrases in . Then the running time is the sum of the computation time for executing and computing , and the prepossessing time of , and , which is .
We show that can be computed in time. For an integer , can be computed in time. This is because holds since has as a prefix for all , where is the string represented by the -th LZRR phrase. Thus, since . Hence can be computed in time using the above online algorithm.
We show that is performed in time. holds because performs union and find operations for . Since and for all , holds. Therefore, is performed in time by union-find data structures.
As a result, we can compute in time and working space.
4.2 The proof of Theorem 2
We define two BPs and for Theorem 2 and show three formulas: (1), (2), and (3) . Theorem 2 clearly holds in (1), (2), and (3), i.e., . The detailed proofs are in Appendix.
The proof of . parses greedily in the right-to-left order such that each phrase is the longest substring occurring previously (left) in .
A key idea of this proof is that if chooses a substring as an phrase, then there exists an LZ phrase starting at a position on the phrase and including the ending position of the phrase. This is because the phrase occurs previously in and the LZ phrase is the longest substring occurring previously in . Since the fact holds for every phrase, holds. Conversely, if chooses a substring as an LZ phrase, then there exists an phrase starting at a position on the LZ phrase and including the starting position of the LZ phrase. This is because the LZ phrase occurs previously in and the phrase is the longest substring occurring previously in . Since this fact holds for every LZ phrase, holds. Therefore, holds.
The proof of . parses in the left-to-right order such that each phrase is the longest substring occurring subsequently in .
A key idea of this proof is that if can choose a substring at a position as an LZOR phrase then also can choose the substring as an LZRR phrase. This is because candidate phrases with right reference positions are always valid phrases in LZRR parsing. Since the fact holds for every position on , holds.
The proof of . Parsing a string in the left-to-right order using the longest substring occurring subsequently in the string is equal to parsing the reversed string in the right-to-left order using the longest substring occurring previously in the reversed string. Thus, holds.
5 Experiments
In this section, we demonstrate the effectiveness of LZRR parsing with benchmark strings. We used two types of strings of pseudo-real and real repetitive collections in the Pizza & Chili corpus downloadable from http://pizzachili.dcc.uchile.cl. We compared our LZRR parsing with LZ77 parsing and lex-parse. We used execution time, memory, and number of phrases as evaluation measures for each method. The C++ programming language was used for implementing all the parsing algorithms. The implementations used in this experiment are available at https://github.com/TNishimoto/lzrr. LZ77 and lex-parse were implemented in the standard manner and work in time and space linear to string length using , and arrays. For each method, we computed two sets of phrases for original string and reverse string , respectively, and we took the set with the smaller number of phrases. We denote numbers of phrases as , , and for parsing algorithms of LZ77, lex-parse (LEX), and LZRR, respectively. We performed all the experiments on one core of a quad-core Intel(R) Xeon(R) E5-2680 v2 (2.80 GHz) CPU with 256 GB of memory.
5.1 Results
Table 1 shows the number of phrases for each method. The number of LZRR phrases was smaller than that of LZ77 phrases for all benchmark strings. Specifically, the number of LZRR phrases was approximately five percent smaller than that of LZ77 for all the strings except for fib41, rs.13, and tm29. The number of LZRR phrases was smaller that of lex-parse phrases for most of the strings.
Table 2 shows execution time and memory on limited benchmark strings for each method. The table for all the strings is presented in Appendix. Although our LZRR parsing needs time, the execution time was at most four times slower than that of LZ77 parsing. This is because the number of while-loops in Algorithm 1 is much smaller than in practice. The memory for LZRR parsing was at most two times larger than that for LZ77 parsing. This is because the proposed algorithm needs the data structure for along with , and arrays.
6 Conclusions
We presented a new bidirectional parsing algorithm named Lempel-Zip 77 parsing with right reference (LZRR). The number of LZRR phrases is theoretically guaranteed to be smaller than that of LZ77. Experimental results using benchmark strings showed LZRR parsing works in practice. An interesting line of future work is to devise the LZRR parsing algorithm working in time or a compressed space.
Acknowledgments. We would like to thank Simon J. Puglisi for notifying us some related work [1, 2].
Appendix A: The proof of Lemma 5
To compute without changing the union-find data structure for , we create an additional union-find data structure and we emulate find operations on using union-find data structures and for . Since union operations are performed on , is not changed in algorithm.
A key idea is that sources on the last phrase are only changed by extending . See Figure 1. The left figure represents sources of positions on target phrases. The right figure represents the change of sources by appending new target phrase to the target phrases. The new target phrase changes only sources on the phrase and these sources are determined by the phrase. This suggests that sources not on the target phrase can be computed using , and the other sources can be computed using and the additional union-find data structure that manages sources of positions on the phrase . In addition, disjoint sets managed by can be updated by union operations as Corollary 4.
Formally, let be the set of positions on the phrase and sources of those positions (i.e., ) and let be disjoint sets on such that each set consists of all positions of the same source, where is the position following . Let be the operation on disjoint-sets that adds into if does not contain . Then the following lemma and corollary hold.
Lemma 6**.**
For a position , if holds, then holds. Otherwise, holds, where .
Proof.
is a character phrase on for each position . If the source of on is not a position in , then does not reach any position in . When a phrase is appended into , the source is changed if and only if the character phrase on is changed. Thus, the source of is not changed by appending into , i.e., .
Otherwise, the source is in on and has a source on because is not a character phrase on . Since the source of is that of on , Lemma 6 holds.
∎
Corollary 7**.**
(1) For an integer , there exists a set that contains two positions and . (2) can be created by performing and operations on .
We compute the source of a given position on by find queries on and using Lemma 6 and Corollary 7. Note that we need to compute the position on the character phrase in a given set to obtain the source of a given position. For this reason, we use the position on a character phrase as the id of the set that contains the position. We can maintain such id using an additional array of length with the same time complexity, where is the cardinality of disjoint-sets.
We also note that we need to convert integers in to consecutive integers. This is because disjoint sets of are on consecutive integers since creates the element . Thus, we use an array of size , where stores the integer in that corresponds to if ; otherwise . This array also enables us to emulate operations. Since the size of is , we reuse during the LZRR parsing algorithm, and the algorithm creates the array in advance. It takes time and space. can be initialized in time, where is the number of positive integers in .
Algorithm 3 shows the modified algorithm for computing function using Lemma 6 and Corollary 7. Algorithm 3 computes by union and find operations and does not perform union operations on , where is the length of the longest valid phrase following with reference position . As a result, Lemma 5 holds.
Note that Algorithms 2 and 3 can fail if there exists an invalid PBP for an integer . If such an integer exists, then algorithms return and fail. However, such cases do not occur because we cannot remove infinite loops of references from an invalid PBP by appending phrases into the PBP.
Appendix B: Computing and
We show that we can compute and for a given and in time using and arrays.
We use the known fact that holds for two integers . When stores the permutation of containing for some integer , can store or by the above fact, where and are integers such that . Then is also the permutation of a subarray of containing . Thus, we compute by using the above observation.
We compute using and , where and . Since and , can be computed in constant time. In addition, we can appropriately update the four parameters in constant time for . Therefore, we can compute and in time and constant working space using a simple algorithm. ∎
Appendix C: The proof of the upper bound of LZRR phrases
We show three formulas using injective functions; for two BPs and of , if there exists an injective function that maps phrases in into distinct phrases in , then holds. In the remaining section, let and (resp. and ) be starting and ending positions of -th phrase in (resp. ).
The proof of . parses greedily in the right-to-left order such that each phrase is the longest substring occurring previously (left) in . Formally, let be the integer array of length such that stores the length of the longest substring of ending at position and occurring on for all , i.e., . Then is the valid BP of such that for all , the starting position of is , where . Figure 2 illustrates examples of and .
For and , we define the function that returns the integer such that contains (i.e., holds). is injective if the starting position of each phrase is not larger than that of the LZ phrase containing the ending position of the phrase, i.e., holds for all . This is because no LZ phrases contain two ending positions in phrases, i.e., no integer exists such that holds if holds for all .
We show using the substring starting at the starting position of and ending at the ending position of , i.e., . When holds, holds because is a suffix of and is a prefix of . Thus, we show always holds. If occurs in previously on , then because chooses the longest substring ending at position and occurring on . Otherwise, and hold since is a new character, i.e., . Therefore, holds for all .
Similarly, holds by constructing the injective function that returns the integer such that contains for a given .
The proof of . parses in the left-to-right order such that each phrase is the longest substring occurring subsequently in . Formally, let be the integer array of length such that stores the length of the longest substring of starting at position and occurring on for all , i.e., ). Then is the BP of such that for all , the starting position of is and . Figure 2 illustrates an example of .
For and , let be the function that returns the integer such that contains . Then is injective if holds for all . We use the following lemma.
Lemma 8**.**
Let be a valid PBP of . Then is also valid for any right target phrase , i.e., and hold, where .
Proof.
are represented character phrases on since has not been parsed. This means that for any . Therefore is valid by Corollary 4. ∎
holds if holds for all . Recall that function returns the valid longest bidirectional phrase. The array and Lemma 8 suggest that the length of the phrase of starting at position is at least . Thus holds for all , is injective, and hence holds.
The proof of . holds clearly because holds for all , where is the array of .
Appendix D: The proof of Lemma 3
Proof.
Recall that holds. On the other hand, holds for all because represents the length of the common prefix of and . Therefore, holds, which means at least one position exists such that in . ∎
Appendix E: Experiments
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. In: Proceedings of SEA. pp. 13:1–13:22 (2017)
- 2[2] Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of lempel-ziv parsing. In: Proceedings of LATIN. pp. 490–503 (2018)
- 3[3] Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)
- 4[4] Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of ICALP. pp. 943–955 (2003)
- 5[5] Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of CPM. pp. 181–192 (2001)
- 6[6] Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of STOC. pp. 827–840 (2018)
- 7[7] Kreft, S., Navarro, G.: LZ 77-like compression with fast random access. In: Proceedings of DCC. pp. 239–248 (2010)
- 8[8] Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on information theory 22(1), 75–81 (1976)
