LZRR: LZ77 Parsing with Right Reference

Takaaki Nishimoto; Yasuo Tabei

arXiv:1812.04261·cs.DS·December 12, 2018

LZRR: LZ77 Parsing with Right Reference

Takaaki Nishimoto, Yasuo Tabei

PDF

Open Access

TL;DR

This paper introduces LZRR, a novel bidirectional parsing method that guarantees fewer phrases than LZ77, achieving approximately 5% better compression on benchmark strings.

Contribution

LZRR is the first practical bidirectional parsing method with theoretical guarantees of smaller phrase counts than LZ77.

Findings

01

LZRR reduces phrase count by about 5% compared to LZ77.

02

LZRR guarantees smaller phrase count theoretically.

03

Experimental results confirm improved compression ratios.

Abstract

Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input string. Gagie et al.(LATIN 2018) recently showed that a large gap exists between the number of smallest bidirectional phrases of a given string and that of LZ77 phrases. In addition, finding the smallest bidirectional parse of a given text is NP-complete. Several variants of bidirectional parsing have been proposed thus far, but no prior work for bidirectional parsing has achieved high compression that is smaller than that of LZ77 phrasing for any string. In this paper, we present the first…

Tables3

Table 1. Table 1: The number of phrases for each method. The smallest number of phrases for each string is depicted in bold.

String	String length	$\| L Z 77 \|$	$\| L E X \|$	$\| L Z R R \|$	$\frac{\| L Z R R \|}{\| L Z 77 \|}$
fib41	267,914,296	22	4	5	0.227
rs.13	216,747,218	52	40	51	0.981
tm29	268,435,456	56	43	31	0.554
dblp.xml.00001.1	104,857,600	59,385	58,537	55,127	0.928
dblp.xml.00001.2	104,857,600	59,556	60,220	55,122	0.926
dblp.xml.0001.1	104,857,600	78,167	82,879	73,584	0.941
dblp.xml.0001.2	104,857,600	78,158	99,467	73,583	0.941
sources.001.2	104,857,600	294,994	466,074	287,411	0.974
dna.001.1	104,857,600	308,355	307,329	295,354	0.958
proteins.001.1	104,857,600	355,268	364,024	337,711	0.951
english.001.2	104,857,600	335,815	487,586	324,282	0.966
einstein.de.txt	92,758,441	34,287	37,719	31,798	0.927
einstein.en.txt	467,626,544	89,437	96,487	83,368	0.932
world_leaders	46,968,181	175,670	179,503	165,626	0.943
influenza	154,808,555	769,286	764,634	714,320	0.929
kernel	257,961,616	793,915	794,058	741,556	0.934
cere	461,286,644	1,695,631	1,649,448	1,597,657	0.942
coreutils	205,281,778	1,441,384	1,439,918	1,359,606	0.943
Escherichia_Coli	112,689,515	2,078,512	2,014,012	1,961,296	0.944
para	429,265,758	2,332,657	2,238,362	2,200,802	0.943

Table 2. Table 2: The execution time and memory for each method.

		Execution time [sec]			Memory consumption [MB]
String	String length	LZ77	LEX	LZRR	LZ77	LEX	LZRR
einstein.de.txt	92,758,441	24	16	27	2,266	2,266	3,808
einstein.en.txt	467,626,544	130	85	147	11,418	11,418	19,196
world_leaders	46,968,181	8	5	16	1,148	1,148	1,939
influenza	154,808,555	42	27	51	3,781	3,781	6,351
kernel	257,961,616	71	47	88	6,299	6,299	10,602
cere	461,286,644	131	90	500	11,263	11,263	18,925
coreutils	205,281,778	56	37	68	5,013	5,013	8,453
Escherichia_Coli	112,689,515	32	22	46	2,752	2,752	4,632
para	429,265,758	125	85	203	10,481	10,481	17,609

Table 3. Table 3: The full version of Table 2 .

		Execution time [sec]			Memory consumption [MB]
String	String length	LZ77	LEX	LZRR	LZ77	LEX	LZRR
fib41	267,914,296	99	74	113	6,542	6,542	11,978
rs.13	216,747,218	79	59	110	5,292	5,293	9,654
tm29	268,435,456	108	81	142	6,554	6,555	11,797
dblp.xml.00001.1	104,857,600	30	21	42	2,561	2,561	4,308
dblp.xml.00001.2	104,857,600	30	20	41	2,561	2,561	4,305
dblp.xml.0001.1	104,857,600	30	20	42	2,561	2,561	4,303
dblp.xml.0001.2	104,857,600	30	20	41	2,561	2,561	4,303
sources.001.2	104,857,600	28	19	41	2,561	2,561	4,302
dna.001.1	104,857,600	30	20	41	2,561	2,561	4,302
proteins.001.1	104,857,600	31	21	42	2,561	2,561	4,302
english.001.2	104,857,600	30	21	42	2,561	2,561	4,302
einstein.de.txt	92,758,441	24	16	27	2,266	2,266	3,808
einstein.en.txt	467,626,544	130	85	147	11,418	11,418	19,196
world_leaders	46,968,181	8	5	16	1,148	1,148	1,939
influenza	154,808,555	42	27	51	3,781	3,781	6,351
kernel	257,961,616	71	47	88	6,299	6,299	10,602
cere	461,286,644	131	90	500	11,263	11,263	18,925
coreutils	205,281,778	56	37	68	5,013	5,013	8,453
Escherichia_Coli	112,689,515	32	22	46	2,752	2,752	4,632
para	429,265,758	125	85	203	10,481	10,481	17,609

Equations2

\displaystyle g^{k}(x)=\left\{\begin{array}[]{ll}g^{k-1}(x)&\mbox{if }g^{k-1}(x)\in\Sigma,\\ g^{0}(g^{k-1}(x))&\mbox{otherwise}.\end{array}\right.

\displaystyle g^{k}(x)=\left\{\begin{array}[]{ll}g^{k-1}(x)&\mbox{if }g^{k-1}(x)\in\Sigma,\\ g^{0}(g^{k-1}(x))&\mbox{otherwise}.\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · semigroups and automata theory

Full text

LZRR: LZ77 Parsing with Right Reference

Takaaki Nishimoto*∗* and Yasuo Tabei*∗*

*∗*RIKEN Center for Advanced Intelligence Project

{takaaki.nishimoto,yasuo.tabei}@riken.jp

Abstract

Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip (LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input string. Gagie et al. (LATIN 2018) recently showed that a large gap exists between the number of smallest bidirectional phrases of a given string and that of LZ77 phrases. In addition, finding the smallest bidirectional parse of a given text is NP-complete. Several variants of bidirectional parsing have been proposed thus far, but no prior work for bidirectional parsing has achieved high compression that is smaller than that of LZ77 phrasing for any string. In this paper, we present the first practical bidirectional parsing named LZ77 parsing with right reference (LZRR), in which the number of LZRR phrases is theoretically guaranteed to be smaller than the number of LZ77 phrases. Experimental results using benchmark strings show the number of LZRR phrases is approximately five percent smaller than that of LZ77 phrases.

1 Introduction

Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip (LZ) 77 parsing [8], which compresses a given string by computing a sequence of phrases copied from the longest substring on the left position in an input string. LZ77 parsing has a long research history, with the first paper on it published in 1976 [8]. Many LZ77’s extensions have since been proposed (e.g., [7, 3, 10]), and LZ77 parsing achieves the smallest compression ratio among them.

Bidirectional (a.k.a. macro) parsing [11] is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or right position in an input string. Each set of LZ77 phrases is convertible into a set of bidirectional phrases, and the number of phrases in the smallest bidirectional parsing is less than that of LZ77 phrases. Gagie et al. [2] recently showed the number of LZ77 phrases $z$ representing an input string of length $n$ can be tightly bounded by the smallest number of bidirectional phrases $b^{*}$ representing the same string as $z=O(b^{*}\log(n/b^{*}))$ , which suggests that a large gap exists between $b^{*}$ and $z$ . In addition, finding the smallest bidirectional parse of a given text is NP-complete [11]. Thus, an important open challenge is to develop a polynomial time bidirectional parsing such that the number of bidirectional phrases is smaller than that of LZ77 phrases.

Several variants of bidirectional parsing have been proposed thus far. Lex-parsing [9] is a bidirectional parsing that computes a sequence of bidirectional phrases that each occurred previously on a suffix array of a string. The number of phrases $v$ in the lex-parsing is bounded by $v=O(b^{*}\log(n/b^{*}))$ [2]. Although the lex-parsing is effective for most benchmark strings (i.e., phrases $v$ is very close to $z$ ) in practice, it can fail to compress some strings (i.e., $v$ is much larger than $z$ ) [9]. Lcpcomp [1] and a bidirectional parsing using Burrows-Wheeler transform (BWT) [2] have also been proposed, and they never have fewer phrases than lex-parse [9]. Kempa and Prezza proposed a parsing algorithm for computing the bidirectional parse of an input string for a given string attractor of the string [6]. The number of the bidirectional phrases is bounded by $O(\gamma\log(n/\gamma))$ , where $\gamma$ is the size of the string attractor. Let $\gamma^{*}$ be the size of the smallest string attractor for a given string. Then $b^{*}=O(\gamma^{*}\log(n/\gamma^{*}))$ holds [6]. In addition, finding the smallest string attractor of a given string is also NP-complete [6]. In summary, no prior bidirectional parsing achieves high compression that is smaller than that of LZ77 phrasing for any string.

In this paper, we present the first practical bidirectional parsing named LZ77 parsing with right reference (LZRR) in which the number of LZRR phrases is always smaller than the number of LZ77 phrases by a large margin. LZRR is a polynomial time algorithm that greedily computes phrases from a string in the left-to-right order the same as LZ77. The main difference between LZRR and LZ77 is the way to compute their phrases. Whereas LZ77 parsing chooses the longest substring occurring previously as a phrase, LZRR parsing uses not only previous occurrences of each phrase but also subsequent occurrences (i.e., it chooses the longest substring occurring previously or subsequently as a phrase). For this reason, the number of LZRR phrases is theoretically guaranteed to be no more than that of LZ77 phrases. Experimental results using benchmark datasets show the number of LZRR phrases is approximately five percent smaller than that of LZ77 phrases.

2 Preliminaries

Let $\Sigma$ be an ordered alphabet of size $\sigma$ , $T$ be a string of length $n$ over $\Sigma$ and $|T|$ be the length of $T$ . Let $T[i]$ be the $i$ -th character of $T$ and $T[i..j]$ be the substring of $T$ that begins at position $i$ and ends at position $j$ . $T[i..]$ denotes the suffix of $T$ beginning at position $i$ , i.e., $T[i..n]$ . Let $T^{R}$ be the reversed string of $T$ , i.e., $T^{R}=T[n]T[n-1]\cdots T[1]$ .

$\mathit{Occ}(T,s)$ denotes all the occurrence positions of string $s$ in string $T$ , i.e., $\mathit{Occ}(T,s)=\{i\mid s=T[i,i+|s|-1],1\leq i\leq n-|s|+1\}$ . Let $\mathsf{lcp}(i,j)$ be the length of the longest common prefix (LCP) of $T[i..]$ and $T[j..]$ . For two strings $x$ and $y$ , $x\prec y$ represents that $x$ is lexicographically smaller than $y$ , we write $x\prec y$ . Similarly, for a string $z$ , $x\preceq_{z}y$ represents that the LCP of $x$ and $z$ is equal to or longer than that of $y$ and $z$ . For example, $aab\preceq_{aac}ab$ .

Our model of computation is a unit-cost word RAM with a machine word size of $\Omega(\log_{2}n)$ bits. We evaluate the space complexity in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a $\log_{2}n$ multiplicative factor.

2.1 Arrays

Suffix array $\mathsf{SA}$ , inverse suffix array $\mathsf{ISA}$ , LCP array $\mathsf{LCP}$ , longest previous factor array $\mathsf{LPF}$ , and sorted suffix array $\mathsf{SA}_{i}$ are integer arrays of length $n$ for a string $T$ , respectively. $\mathsf{SA}$ is the permutation of $[1..n]$ such that $T[SA[1]..]\prec\cdots\prec T[SA[n]..]$ holds. $\mathsf{ISA}$ is the permutation of $[1..n]$ such that $\mathsf{SA}[\mathsf{ISA}[i]]=i$ holds for any $i\in\{1,2,...,n\}$ . $\mathsf{LCP}[1]=0$ and $\mathsf{LCP}[i]=\mathsf{lcp}(\mathsf{SA}[i],\mathsf{SA}[i-1])$ for $i\in\{2,3,\ldots,n\}$ . $\mathsf{LPF}[i]$ stores the length of the longest prefix of $T[i..]$ occurring previously; that is $\mathsf{LPF}[1]=0$ and $\mathsf{LPF}[i]=\max\{\mathsf{lcp}(i,p)\mid p\in\{1..i-1\}\}$ , where $\max$ returns the maximal element of a given set. $\mathsf{SA}_{k}$ is the sorted starting positions of suffixes in decreasing order for the length of the LCP with $T[k..]$ . Formally, for an integer $k\in\{1,2,\ldots,n\}$ , $\mathsf{SA}_{k}$ is a permutation of $[1..n]$ such that $T[SA_{k}[1]..]\preceq_{T[k..]}\cdots\preceq_{T[k..]}T[SA_{k}[n]..]$ . $\mathsf{SA}_{k}$ is not unique when there exist two positions $i$ and $j$ such that $\mathsf{lcp}(k,i)=\mathsf{lcp}(k,j)$ .

For $T=abababaabb$ , $\mathsf{SA}=7,5,3,1,8,10,6,4,2,9$ , $\mathsf{ISA}=4,9,3,8,2,7,1,5,10,6$ , $\mathsf{LCP}=0,1,3,5,2,0,1,2,4,1$ , $\mathsf{LPF}=0,0,5,4,3,2,1,2,1,1$ , and $\mathsf{SA}_{1}=1,3,5,8,7,10,6,4,2,9$ .

2.2 Union-find data structure

Union-find is a data structure for disjoint sets and supports the following operations for disjoint set $\mathcal{D}$ : $\mathit{MakeSet}$ , $\mathit{Union}$ , $\mathit{Find}$ . $\mathit{MakeSet}$ adds element $\{m+1\}$ into $\mathcal{D}$ and returns the integer where $m$ is the cardinality of $\mathcal{D}$ . $\mathit{Union}(x,y)$ merges two sets $X,Y\in\mathcal{D}$ containing $x$ and $y$ , respectively; it adds a new set $X\cup Y$ into $D$ ; it removes $X$ and $Y$ from $\mathcal{D}$ . The $\mathit{Find}(x)$ returns the id of the set containing $x$ in $\mathcal{D}$ . The union-find data structure performs $\mathit{MakeSet}$ , $\mathit{Union}$ , $\mathit{Find}$ operations in $O(m+p+q\alpha_{p+q}(p))$ time, while using $O(m)$ space [12], where $p$ and $q$ are the numbers of $\mathit{Union}$ and $\mathit{Find}$ operations, respectively, and $\alpha_{k}$ is the inverse of the $k$ -th row of Ackermann function.

2.3 Bidirectional phrases and partial bidirectional phrases

Bidirectional phrases (BP) [11] of string $T$ is a partition of $T$ as substrings (phrases) $B=f_{1},f_{2},\ldots,f_{b}$ such that each $f_{i}=T[s_{i}..s_{i}+\ell-1]$ is (i) either copied from another substring $T[t_{i}..t_{i}+\ell-1]$ (target phrase) with $s_{i}\neq t_{i}$ , which can overlap $T[s_{i}..s_{i}+\ell-1]$ , or (ii) an explicit character (character phrase), i.e., $f_{i}=T[s_{i}]$ . Target phrase $f_{i}$ is denoted as a pair $\langle t_{i},|f_{i}|\rangle$ of the reference position $t_{i}$ and the length $|f_{i}|$ of $f_{i}$ . The substring $T[t_{i}..t_{i}+|f_{i}|-1]$ is called the reference string of $f_{i}$ .

The original string $T$ can be recovered from BP $B$ by referring to a finite number of phrases from each $f_{i}$ in $B$ . If an infinite loop of phrases referred from any $f_{i}$ exists, the original string $T$ cannot be recovered from $B$ . If $T$ can be recovered from $B$ , $B$ is said to be a valid BP of $T$ ; otherwise, $B$ is said to be invalid BP of $T$ .

The value of the phrase reached from position $x$ in $k$ iterations of references is formally defined as $g^{k}:\{1,\ldots,n\}\rightarrow\{1,\ldots,n\}\cup\Sigma$ . For $k=0$ , if $T[x]$ is a character phrase, $g^{0}(x)=T[x]$ ; otherwise $g^{0}(x)=t_{p}+(x-s_{p})$ , where $p$ is the integer such that $s_{p}\leq x<s_{p+1}$ holds for $s_{b+1}=n+1$ . For $k\geq 1$ , we define $g^{k}(x)$ as follows:

[TABLE]

If $g^{n}[x]\in\Sigma$ holds, then there are no infinite loops of references containing $x$ . Therefore, $B$ is valid BP of $T$ if $B$ has no infinite loops of references, i.e., $g^{n}[x]\in\Sigma$ holds for all $x$ .

For example, let $B=\langle 3,2\rangle,a,b,\langle 2,3\rangle$ and $B^{\prime}=\langle 3,2\rangle,\langle 1,2\rangle,b,a,b$ be BPs of $T=ababbab$ . Then $B$ is valid since $g^{1}(1),\ldots,g^{1}(7)=a,b,a,b,4,a,b$ and $g^{2}(5)=b$ . On the other hand, $B^{\prime}$ is invalid since $g^{7}(1),\ldots,g^{7}(7)=1,2,3,4,b,a,b$ .

LZ77 phrases [8] of string $T$ are a specialization of BP and defined as the bidirectional phrases that are all selected from previously seen substrings. Since there is no infinite loops of references on phrases, LZ77 phrases of $T$ are always valid BP of $T$ . Formally, let $\mathsf{LZ}(T)=f_{1},f_{2},\ldots,f_{z}$ of $T$ be valid BP of $T$ such that $|f_{i}|=\max\{1,\mathsf{LPF}[s_{i}]\}$ for each $i\in\{1,...,z\}$ .

LZRR parsing gradually builds the valid BP from the start position of $T$ in the left-to-right order. A subsequence of the valid BP is called partial bidirectional phrases (PBP) and is defined as a BP $P=f_{1},f_{2},\ldots,f_{k}$ for a prefix of $T$ that can be copied from any substring of $T$ , i.e., $t_{i}\in\{1,\ldots,n\}\setminus\{s_{i}\}$ for all $i\in\{1,\ldots,k\}$ for a target phrase $f_{i}$ , which avoids a self copy.

The concatenation of such PBP $P$ and every character phrase referred from $P$ can recover the prefix of $T$ with a finite number of references. Such PBP are called valid PBP, and other PBP are called invalid PBP. Formally, let $B_{P}=P\cdot f^{\prime}_{1},f^{\prime}_{2},\ldots,f^{\prime}_{k^{\prime}}$ be the concatenation of PBP $P$ and the remaining character phrases $f^{\prime}_{1},f^{\prime}_{2},\ldots,f^{\prime}_{k}$ equivalent to suffix $T[(n-k^{\prime}+1)..]$ . $P$ is valid if $B_{P}$ is valid; otherwise $P$ is invalid. For example, let $P=\langle 3,2\rangle,\langle 6,2\rangle$ be a PBP of $T=ababbab$ . Then $B_{P}=\langle 3,2\rangle,\langle 6,2\rangle,b,a,b$ .

The original string of a PBP can be recovered by iteratively referring to phrases starting from each target phrase in a finite number of times until the character phrase is found. Thus, the position of each character phrase can be seen as the source for positions of target/character phrases. Formally, for a PBP $P$ and position $x\in\{1,\ldots,n\}$ on $T$ , $\mathit{source}(P,x)$ returns source $y\in\mathcal{N}$ of $x$ in $B_{P}$ , i.e., position $y$ satisfying either (i) $g^{k}(x)=y$ and $g^{k+1}(x)\in\Sigma$ for an integer $k$ or (ii) $x=y$ and $g^{0}(x)\in\Sigma$ . For the above example, the source of the position $1$ is the position $6$ in $P$ since $g^{0}(1)=3$ , $g^{1}(1)=6$ and $g^{2}(1)=a$ .

3 LZRR

A key idea of LZRR parsing is to compute the valid BP from an input text $T$ by gradually computing the valid PBP from the head of $T$ in the left-to-right order. LZRR parsing computes whole LZRR phrases initialized as zero phrase for an input string in two steps: (i) it computes candidates of the reference positions of the longest valid phrase following the current LZRR phrase; and (ii) it computes the valid (possibly character) phrase with the maximum length among extensions starting from those candidates. Steps (i) and (ii) are iterated until whole LZRR phrases are computed.

LZRR parsing uses two major functions of LP and LF for steps (i) and (ii), respectively. Given a valid PBP $P$ of $T$ , LP function $\mathsf{LP}(P)$ returns the longest valid phrase following $P$ , i.e., the longest phrase $f$ such that $P\cdot f$ is a valid PBP of $T$ . Given a valid PBP $P$ of $T$ and reference position $j\in\{1,\ldots,n\}$ , LF function $\mathsf{LF}(P,j)$ returns the length of the longest valid phrase having reference position $j$ and following $P$ , i.e., $\mathsf{LF}(P,j)=\max(\{0\}\cup\{\ell\mid\ell\in\{1,2,\ldots,\mathsf{lcp}(i,j)\},P\cdot\langle j,\ell\rangle\mbox{ is valid}\}$ ) where $i$ is the starting position of the phrase following $P$ . LZRR parsing computes LZRR phrases as the valid BP $\mathsf{LZRR}(T)=\mathsf{LP}(P_{0}),\ldots,\mathsf{LP}(P_{b-1})$ of $T$ 　 where $P_{p}$ is the first $p$ LZRR phrases for each $p\in\{0,1,\ldots,b\}$ and $b$ is the number of LZRR phrases of $T$ . The LZRR phrases of $T$ are not unique.

For example, let $P_{1}=\langle 3,5\rangle$ be the first LZRR phrase of $T=abababaababa$ . $\mathsf{LF}(P_{1},1),$ $\ldots,\mathsf{LF}(P_{1},12)=0,0,0,0,0,0,0,0,2,0,2,0$ . LZRR parsing chooses phrase $\langle 9,2\rangle$ or $\langle 11,2\rangle$ as the next one.

This paper shows the following two theorems.

Theorem 1.

For a given string $T$ , LZRR parsing computes $\mathsf{LZRR}(T)$ in $O(n^{2}\alpha_{n^{2}}(n^{2}))$ time using $O(n)$ working space.

Theorem 2.

$|\mathsf{LZRR}(T)|\leq|\mathsf{LZ}(T^{R})|$ * holds.*

The LZRR parsing algorithm is presented in Section 3. Theorems 1 and 2 are shown in Section 4.

3.1 $\mathsf{LP}$ algorithm

A straight forward computation of $\mathsf{LP}(P)$ is to compute reference position $j_{\mathit{max}}$ such that $\mathsf{LF}(P,j_{\mathit{max}})=\max\{\mathsf{LF}(P,1),\ldots,\mathsf{LF}(P,n)\}$ and then compute $\ell_{\mathit{max}}=\mathsf{LF}(P,j_{\mathit{max}})$ , which results in LZRR phrase $\langle j_{\mathit{max}},\ell_{\mathit{max}}\rangle$ . This method takes $\Omega(n)$ time even if $\mathsf{LF}(P,j)$ can be computed in constant time for each position $j$ . Instead, we reduce the computation time of LF functions by leveraging the following fact: the length of the longest valid phrase of starting position $i$ and reference position $j$ is not larger than that of the LCP of $T[i..]$ and $T[j..]$ . This fact suggests that after we find a phrase of length $\ell^{\prime}$ , we do not need to compute LF functions for any reference position $j$ such that the LCP of $T[i..]$ and $T[j..]$ is not longer than $\ell^{\prime}$ . For an efficient computation, we sort reference positions in descending order with respect to the length of the LCP for $T[i..]$ and maintain those positions in the sorted suffix array $\mathsf{SA}_{i}$ of $i$ . Then, we omit computing LF functions of reference positions on $\mathsf{SA}_{i}[k_{\mathit{max}}+1..]$ for the left-most position $k_{\mathit{max}}$ on $\mathsf{SA}_{i}$ such that the longest valid phrase starting at a reference position in $\mathsf{SA}_{i}[1..k_{\mathit{max}}]$ is at least as long as the LCP of $T[i..]$ and $T[\mathsf{SA}_{i}[k_{\mathit{max}}]..]$ . This is because $j_{\mathit{max}}$ exists on $\mathsf{SA}_{i}[1..k_{\mathit{max}}]$ . Thus, the following lemma holds.

Lemma 3.

Let $\ell_{k}=\max\{\mathsf{LF}(P,\mathsf{SA}_{i}[1]),\ldots,\mathsf{LF}(P,\mathsf{SA}_{i}[k])\}$ and $k_{\mathit{max}}$ be the left-most position on the $\mathsf{SA}_{i}$ such that $\ell_{k_{\mathit{max}}}\geq\mathsf{lcp}(i,\mathsf{SA}_{i}[k_{\mathit{max}}])$ holds. Then $\ell_{\mathit{max}}=\ell_{k_{\mathit{max}}}$ holds and $\mathsf{SA}_{i}[..k_{\mathit{max}}]$ contains $j_{\mathit{max}}$ .

Proof.

See Appendix. ∎

Algorithm 1 shows the algorithm for computing $\mathsf{LP}(P)$ function and computes each LF function from the head of $\mathsf{SA}_{i}$ . When Algorithm 1 finds $k_{\mathit{max}}$ , it returns the current longest valid phrase.

3.2 $\mathsf{LF}$ algorithm

$\mathsf{LF}$ algorithm $\mathsf{LF}(P,j)$ finds the longest valid target phrase with reference position $j$ and following the PBP $P$ of $T$ by gradually extending the target phrase of length $1$ until it cannot find any reference string copying the target phrase. When PBP $P\cdot\langle j,\ell\rangle$ for $P$ and the target phrase $\langle j,\ell\rangle$ is computed one-by-one, it can include an infinite loop of references by a mutual reference of phrases. This is because PBP as a target phrase can be copied from the left and right reference strings. This can happen when for computing the extension $P\cdot\langle j,\ell\rangle$ the position of a target phrase in $P$ and $\langle j,\ell\rangle$ can be mutually reached with a finite number of references. The $\mathsf{LP}$ algorithm avoids such cases by using the union-find data structure built from PBP $P$ .

Each disjoint set in the union-find data structure includes string positions with the same source (character phrase) for PBP $P$ . The union-find data structure is initialized as $n$ disjoint sets that all contain the unique position of the input string of length $n$ . If the union-find data structure for PBP $P\cdot\langle j,\ell-1\rangle$ for $P$ and the target phrase $\langle j,\ell\rangle$ exists, the data structure for $P\cdot\langle j,\ell-1\rangle$ can be updated by $\mathit{Union}(i+\ell-1,j+\ell-1)$ operation.

The infinite loops of references can be detected using the find operation in the union-find data structure. When $P\cdot\langle j,\ell-1\rangle$ is a valid and the extension of starting position $i+\ell-1$ next to the PBP and reference position $j+\ell-1$ is computed, if $\mathit{Find}(i+\ell-1)$ is equal to $\mathit{Find}(j+\ell-1)$ if and only if infinite loops of references exist. $\mathsf{LF}$ algorithm checks this condition each time. Formally, the following corollary holds.

Corollary 4.

Let $Q=P\cdot\langle j,\ell\rangle$ be a valid PBP and $Q^{\prime}=P\cdot\langle j,\ell+1\rangle$ be a PBP for an integer $\ell\in\{0,\ldots,\mathsf{lcp}(i,j)-1\}$ , and $\mathcal{D}_{P}$ be disjoint sets on $\{1,\ldots,n\}$ such that each set consists of all positions of the same source for a PBP $P$ , where $P\cdot\langle j,0\rangle$ is $P$ and $i$ is the starting position of the last target phrase (i.e., $\langle j,\ell\rangle$ ) in $Q$ . (1) If $\mathit{Find}(i+\ell)\not=\mathit{Find}(j+\ell)$ holds on $\mathcal{D}_{Q}$ , then $Q^{\prime}$ is valid. Otherwise $Q^{\prime}$ is invalid. (2) $\mathcal{D}_{Q^{\prime}}$ is equal to the set created by $\mathit{Union}(i+\ell,j+\ell)$ on $\mathcal{D}_{Q}$ .

Algorithm 2 shows the algorithm for computing $\mathsf{LF}(P,j)$ function using Corollary 4 and the algorithm stated previously. Thus, we can compute the length $\ell_{\mathit{max}}$ of the longest valid target phrase with reference position $j$ and following the PBP $P$ by $O(\ell_{\mathit{max}})$ union and find operations on the given union-find data structure for $\mathcal{D}_{P}$ .

Note that we need to modify Algorithm 2 for $\mathsf{LP}(P)$ algorithm. This is because $\mathsf{LF}$ algorithms in our $\mathsf{LP}(P)$ algorithm need the same union-find data structure determined by the PBP $P$ . On the other hand, the given union-find data structure is changed by union operations in Algorithm 2. By modifying Algorithm 2 using an additional union-find data structure, we can compute $\mathsf{LF}(P,j)$ without updating the given union-find data structure. Formally, the following lemma holds.

Lemma 5.

Given the union-found data structure $L$ for $\mathcal{D}_{P}$ , we can compute $\mathsf{LF}(P,j)$ in $O(n)$ working space by $O(\ell_{\mathit{max}})$ $\mathit{Find}$ operations on $L$ and $O(\ell_{\mathit{max}})$ union and find operations on an additional union-find data structure $L^{\prime}$ for $O(\ell_{\mathit{max}})$ disjoint sets. $L^{\prime}$ is disposed after $\mathsf{LF}(P,j)$ is computed.

Proof.

See Appendix. ∎

3.3 Computation of $\mathsf{LZRR}(T)$

Since $\mathsf{LF}(P_{p},j)$ algorithm for each $p\in\{0,\ldots,b-1\}$ uses the union-find data structure for disjoint sets of the current LZRR phrases (i.e., $P_{p}$ ), we update the union-find data structure when the $(p+1)$ -th LZRR phrase is selected. This needs at most $|\mathsf{LP}(P_{p+1})|$ $\mathit{Union}$ operations by Corollary 4.

4 Theoretical analysis

4.1 The proof of Theorem 1

We show that the working space of LZRR parsing is $O(n)$ space. LZRR parsing needs two data structures: (1) the union-find data structures for $\mathsf{LF}$ algorithm and (2) the data structure to compute the sequence $W_{p}=\mathsf{SA}_{s_{p}}[1],\mathsf{lcp}(s_{p},\mathsf{SA}_{s_{p}}[1]),\ldots,\mathsf{SA}_{s_{p}}[k_{p}],\mathsf{lcp}(i,\mathsf{SA}_{i}[k_{p}])$ for $\mathsf{LP}(P_{p-1})$ algorithm, where $k_{p}$ is $k_{\mathit{max}}$ in $\mathsf{LP}(P_{p-1})$ algorithm and $s_{p}$ is the starting position of $p$ -th LZRR phrase.

We can compute $W_{p}[1..k]$ in $O(k)$ time in an online manner using arrays of $\mathsf{SA},\mathsf{ISA}$ , and $\mathsf{LCP}$ for two integers $p$ and $k$ (See Appendix).

$\mathsf{SA}$ , $\mathsf{ISA}$ , and $\mathsf{LCP}$ of a given a string $T$ can be constructed in $O(n)$ time and working space [4, 5]. Therefore, the second data structure can be constructed in $O(n)$ time and space, and the LZRR parsing algorithm runs in $O(n)$ working space.

Next, we show that the running time of LZRR parsing is $O(n^{2}\alpha_{n^{2}}(n^{2}))$ . Let $G$ be the sequence of operations on disjoint-sets executed by LZRR parsing and $W$ be the sequence of $W_{1}\cdots W_{b}$ , where $b$ is the number of phrases in $\mathsf{LZRR}(T)$ . Then the running time is the sum of the computation time for executing $G$ and computing $W$ , and the prepossessing time of $\mathsf{SA}$ , $\mathsf{ISA}$ and $\mathsf{LCP}$ , which is $O(n)$ .

We show that $W$ can be computed in $O(n^{2})$ time. For an integer $p\in\{1,\ldots,b\}$ , $W_{p}$ can be computed in $O(k_{p})=O(n)$ time. This is because $k_{p}=|\mathit{Occ}(T,f_{p})|\leq n$ holds since $T[\mathsf{SA}_{i}[y]..]$ has $f_{p}$ as a prefix for all $y\in\{1,\ldots,k_{p}\}$ , where $f_{p}$ is the string represented by the $p$ -th LZRR phrase. Thus, $|W|=O(n^{2})$ since $b\leq n$ . Hence $W$ can be computed in $O(n^{2})$ time using the above online algorithm.

We show that $G$ is performed in $O(n^{2}\alpha_{n^{2}}(n^{2}))$ time. $|G|=O(\sum_{p=1}^{b}(|f_{p}|\times k_{p}))$ holds because $\mathsf{LP}(P_{p})$ performs $O(k_{p+1}\times|f_{p+1}|)$ union and find operations for $p\in\{0,\ldots,b-1\}$ . Since $|f_{1}|+\cdots+|f_{n}|=n$ and $|\mathit{Occ}(T,f_{p})|\leq n$ for all $p$ , $|G|=O(n^{2})$ holds. Therefore, $G$ is performed in $O(n^{2}\alpha_{n^{2}}(n^{2}))$ time by union-find data structures.

As a result, we can compute $\mathsf{LZRR}(T)$ in $O(n^{2}\alpha_{n^{2}}(n^{2}))$ time and $O(n)$ working space.

4.2 The proof of Theorem 2

We define two BPs $\mathsf{LZ^{\prime}}(T)$ and $\mathsf{LZOR}(T)$ for Theorem 2 and show three formulas: (1) $|\mathsf{LZ^{\prime}}(T)|=|\mathsf{LZ}(T)|$ , (2) $|\mathsf{LZRR}(T)|\leq|\mathsf{LZOR}(T)|$ , and (3) $|\mathsf{LZOR}(T)|=|\mathsf{LZ}^{\prime}(T^{R})|$ . Theorem 2 clearly holds in (1), (2), and (3), i.e., $\mathsf{LZRR}(T)\leq\mathsf{LZ}(T^{R})$ . The detailed proofs are in Appendix.

The proof of $|\mathsf{LZ^{\prime}}(T)|=|\mathsf{LZ}(T)|$ . $\mathsf{LZ^{\prime}}(T)=f_{1},\ldots,f_{k}$ parses greedily $T$ in the right-to-left order such that each phrase is the longest substring occurring previously (left) in $T$ .

A key idea of this proof is that if $\mathsf{LZ^{\prime}}(T)$ chooses a substring as an $\mathsf{LZ^{\prime}}$ phrase, then there exists an LZ phrase starting at a position on the $\mathsf{LZ^{\prime}}$ phrase and including the ending position of the $\mathsf{LZ^{\prime}}$ phrase. This is because the $\mathsf{LZ^{\prime}}$ phrase occurs previously in $T$ and the LZ phrase is the longest substring occurring previously in $T$ . Since the fact holds for every $\mathsf{LZ^{\prime}}$ phrase, $|\mathsf{LZ}(T)|\leq|\mathsf{LZ^{\prime}}(T)|$ holds. Conversely, if $\mathsf{LZ}(T)$ chooses a substring as an LZ phrase, then there exists an $\mathsf{LZ^{\prime}}$ phrase starting at a position on the LZ phrase and including the starting position of the LZ phrase. This is because the LZ phrase occurs previously in $T$ and the $\mathsf{LZ^{\prime}}$ phrase is the longest substring occurring previously in $T$ . Since this fact holds for every LZ phrase, $|\mathsf{LZ^{\prime}}(T)|\leq|\mathsf{LZ}(T)|$ holds. Therefore, $|\mathsf{LZ^{\prime}}(T)|=|\mathsf{LZ}(T)|$ holds.

The proof of $|\mathsf{LZRR}(T)|\leq|\mathsf{LZOR}(T)|$ . $\mathsf{LZOR}(T)=f_{1},\ldots,f_{k}$ parses $T$ in the left-to-right order such that each phrase is the longest substring occurring subsequently in $T$ .

A key idea of this proof is that if $\mathsf{LZOR}(T)$ can choose a substring at a position as an LZOR phrase then $\mathsf{LZRR}(T)$ also can choose the substring as an LZRR phrase. This is because candidate phrases with right reference positions are always valid phrases in LZRR parsing. Since the fact holds for every position on $T$ , $|\mathsf{LZRR}(T)|\leq|\mathsf{LZOR}(T)|$ holds.

The proof of $|\mathsf{LZOR}(T)|=|\mathsf{LZ}^{\prime}(T^{R})|$ . Parsing a string in the left-to-right order using the longest substring occurring subsequently in the string is equal to parsing the reversed string in the right-to-left order using the longest substring occurring previously in the reversed string. Thus, $|\mathsf{LZOR}(T)|=|\mathsf{LZ^{\prime}}(T^{R})|$ holds.

5 Experiments

In this section, we demonstrate the effectiveness of LZRR parsing with benchmark strings. We used two types of strings of pseudo-real and real repetitive collections in the Pizza & Chili corpus downloadable from http://pizzachili.dcc.uchile.cl. We compared our LZRR parsing with LZ77 parsing and lex-parse. We used execution time, memory, and number of phrases as evaluation measures for each method. The C++ programming language was used for implementing all the parsing algorithms. The implementations used in this experiment are available at https://github.com/TNishimoto/lzrr. LZ77 and lex-parse were implemented in the standard manner and work in time and space linear to string length using $\mathsf{SA},\mathsf{ISA}$ , and $\mathsf{LCP}$ arrays. For each method, we computed two sets of phrases for original string $T$ and reverse string $T^{R}$ , respectively, and we took the set with the smaller number of phrases. We denote numbers of phrases as $|LZ77|$ , $|LEX|$ , and $|LZRR|$ for parsing algorithms of LZ77, lex-parse (LEX), and LZRR, respectively. We performed all the experiments on one core of a quad-core Intel(R) Xeon(R) E5-2680 v2 (2.80 GHz) CPU with 256 GB of memory.

5.1 Results

Table 1 shows the number of phrases for each method. The number of LZRR phrases was smaller than that of LZ77 phrases for all benchmark strings. Specifically, the number of LZRR phrases was approximately five percent smaller than that of LZ77 for all the strings except for fib41, rs.13, and tm29. The number of LZRR phrases was smaller that of lex-parse phrases for most of the strings.

Table 2 shows execution time and memory on limited benchmark strings for each method. The table for all the strings is presented in Appendix. Although our LZRR parsing needs $O(n^{2}\alpha_{n^{2}}(n^{2}))$ time, the execution time was at most four times slower than that of LZ77 parsing. This is because the number of while-loops in Algorithm 1 is much smaller than $n$ in practice. The memory for LZRR parsing was at most two times larger than that for LZ77 parsing. This is because the proposed algorithm needs the data structure for $\mathsf{LF}$ along with $\mathsf{SA},\mathsf{ISA}$ , and $\mathsf{LCP}$ arrays.

6 Conclusions

We presented a new bidirectional parsing algorithm named Lempel-Zip 77 parsing with right reference (LZRR). The number of LZRR phrases is theoretically guaranteed to be smaller than that of LZ77. Experimental results using benchmark strings showed LZRR parsing works in practice. An interesting line of future work is to devise the LZRR parsing algorithm working in $o(n^{2}\alpha_{n^{2}}(n^{2}))$ time or a compressed space.

Acknowledgments. We would like to thank Simon J. Puglisi for notifying us some related work [1, 2].

Appendix A: The proof of Lemma 5

To compute $\mathsf{LF}(P,j)$ without changing the union-find data structure $L$ for $\mathcal{D}_{P}$ , we create an additional union-find data structure $L^{\prime}$ and we emulate find operations on $\mathcal{D}_{P\cdot\langle j,\ell\rangle}$ using union-find data structures $L$ and $L^{\prime}$ for $\mathcal{D}_{P}$ . Since union operations are performed on $L^{\prime}$ , $L$ is not changed in $\mathsf{LF}$ algorithm.

A key idea is that sources on the last phrase $\langle j,\ell\rangle$ are only changed by extending $P\cdot\langle j,\ell\rangle$ . See Figure 1. The left figure represents sources of positions on target phrases. The right figure represents the change of sources by appending new target phrase $\langle j,\ell\rangle$ to the target phrases. The new target phrase changes only sources on the phrase and these sources are determined by the phrase. This suggests that sources not on the target phrase $\langle j,\ell\rangle$ can be computed using $L$ , and the other sources can be computed using $L$ and the additional union-find data structure $L^{\prime}$ that manages sources of positions on the phrase $\langle j,\ell\rangle$ . In addition, disjoint sets managed by $L^{\prime}$ can be updated by union operations as Corollary 4.

Formally, let $U(P\cdot\langle j,\ell\rangle)$ be the set of positions on the phrase $\langle j,\ell\rangle$ and sources of those positions (i.e., $U(P\cdot\langle j,\ell\rangle)=\{i,\ldots,i+\ell-1\}\cup\{\mathit{source}(P\cdot\langle j,\ell\rangle,x)\mid x\in\{i,\ldots,i+\ell-1\}\}$ ) and let $\mathcal{D^{\prime}}_{P\cdot\langle j,\ell\rangle}$ be disjoint sets on $U(P\cdot\langle j,\ell\rangle)$ such that each set consists of all positions of the same source, where $i$ is the position following $P$ . Let $\mathit{MakeSet}^{\prime}(x)$ be the operation on disjoint-sets $D$ that adds $\{x\}$ into $D$ if $D$ does not contain $x$ . Then the following lemma and corollary hold.

Lemma 6.

For a position $x\in\{1,2,\ldots,n\}$ , if $\mathit{source}(P,x)\not\in\{i,i+1,\ldots,i+\ell-1\}$ holds, then $\mathit{source}(P,x)=\mathit{source}(P\cdot\langle j,\ell\rangle,x)$ holds. Otherwise, $\mathit{source}(P\cdot\langle j,\ell\rangle,x^{\prime})=\mathit{source}(P\cdot\langle j,\ell\rangle,x)$ holds, where $x^{\prime}=\mathit{source}(P,x)$ .

Proof.

$T[r]$ is a character phrase on $B_{P}$ for each position $r\in\{i,i+1,\ldots,i+\ell-1\}$ . If the source $x^{\prime}$ of $x$ on $B_{P}$ is not a position in $\{i,i+1,\ldots,i+\ell-1\}$ , then $x$ does not reach any position in $\{i,\ldots,i+\ell-1\}$ . When a phrase is appended into $P$ , the source $x^{\prime}$ is changed if and only if the character phrase on $x^{\prime}$ is changed. Thus, the source of $x$ is not changed by appending $\langle j,\ell\rangle$ into $P$ , i.e., $\mathit{source}(P,x)=\mathit{source}(P\cdot\langle j,\ell\rangle,x)$ .

Otherwise, the source $x^{\prime}$ is in $\{i,\ldots,i+\ell-1\}$ on $B_{P}$ and $x^{\prime}$ has a source $x^{\prime\prime}$ on $B_{P\cdot\langle j,\ell\rangle}$ because $T[x^{\prime}]$ is not a character phrase on $B_{P\cdot\langle j,\ell\rangle}$ . Since the source of $x^{\prime}$ is that of $x$ on $B_{P\cdot\langle j,\ell\rangle}$ , Lemma 6 holds.

∎

Corollary 7.

(1) For an integer $x\in\{i,\ldots,i+\ell-1\}$ , there exists a set $X\in\mathcal{D^{\prime}}_{P\cdot\langle j,\ell\rangle}$ that contains two positions $x$ and $\mathit{source}(P\cdot\langle j,\ell\rangle,x)$ . (2) $\mathcal{D^{\prime}}_{P\cdot\langle j,\ell+1\rangle}$ can be created by performing $O(1)$ $\mathit{Union}$ and $\mathit{MakeSet}^{\prime}$ operations on $\mathcal{D^{\prime}}_{P\cdot\langle j,\ell\rangle}$ .

We compute the source of a given position on $\{1,\ldots,n\}$ by $O(1)$ find queries on $\mathcal{D}_{P}$ and $\mathcal{D^{\prime}}_{P\cdot\langle j,\ell\rangle}$ using Lemma 6 and Corollary 7. Note that we need to compute the position on the character phrase in a given set to obtain the source of a given position. For this reason, we use the position on a character phrase as the id of the set that contains the position. We can maintain such id using an additional array of length $m$ with the same time complexity, where $m$ is the cardinality of disjoint-sets.

We also note that we need to convert integers in $\mathcal{D^{\prime}}_{P\cdot\langle j,\ell\rangle}$ to consecutive integers. This is because disjoint sets of $L^{\prime}$ are on consecutive integers since $\mathit{MakeSet}$ creates the element $\{m+1\}$ . Thus, we use an array $W$ of size $n$ , where $W[x]$ stores the integer in $L^{\prime}$ that corresponds to $x$ if $x\in U(P\cdot\langle j,\ell\rangle)$ ; otherwise $W[x]=-1$ . This array also enables us to emulate $\mathit{MakeSet}^{\prime}$ operations. Since the size of $W$ is $n$ , we reuse $W$ during the LZRR parsing algorithm, and the algorithm creates the array in advance. It takes $O(n)$ time and space. $W$ can be initialized in $O(m)$ time, where $m$ is the number of positive integers in $W$ .

Algorithm 3 shows the modified algorithm for computing $\mathsf{LF}(P,j)$ function using Lemma 6 and Corollary 7. Algorithm 3 computes $\mathsf{LF}(P,j)$ by $O(\ell_{\mathit{max}})$ union and find operations and does not perform union operations on $\mathcal{D}_{P}$ , where $\ell_{\mathit{max}}$ is the length of the longest valid phrase following $P$ with reference position $j$ . As a result, Lemma 5 holds.

Note that Algorithms 2 and 3 can fail if there exists an invalid PBP $P\cdot\langle j,\ell^{\prime}\rangle$ for an integer $\ell^{\prime}\in\{1,2,\ldots,\ell_{\mathit{max}}\}$ . If such an integer exists, then algorithms return $\ell^{\prime}-1$ and fail. However, such cases do not occur because we cannot remove infinite loops of references from an invalid PBP by appending phrases into the PBP.

Appendix B: Computing $\mathsf{SA}_{k}[1..\ell]$ and $\mathsf{lcp}(k,\mathsf{SA}_{k}[1]),\ldots,\mathsf{lcp}(k,\mathsf{SA}_{k}[\ell])$

We show that we can compute $\mathsf{SA}_{k}[1..\ell]$ and $\mathsf{lcp}(k,\mathsf{SA}_{k}[1]),\ldots,\mathsf{lcp}(k,\mathsf{SA}_{k}[\ell])$ for a given $k$ and $\ell$ in $O(\ell)$ time using $T$ and $\mathsf{SA},\mathsf{ISA},\mathsf{LCP}$ arrays.

We use the known fact that $\mathsf{lcp}(\mathsf{SA}[i],\mathsf{SA}[j])=\min\{\mathsf{LCP}[i+1],\ldots,\mathsf{LCP}[j]\}$ holds for two integers $1\leq i<j\leq n$ . When $\mathsf{SA}_{k}[1..\ell^{\prime}]$ stores the permutation of $\mathsf{SA}[i^{\prime}..j^{\prime}]$ containing $k$ for some integer $\ell^{\prime}$ , $\mathsf{SA}_{k}[1..\ell^{\prime}+1]$ can store $\mathsf{SA}[i^{\prime}-1]$ or $\mathsf{SA}[j^{\prime}+1]$ by the above fact, where $i^{\prime}$ and $j^{\prime}$ are integers such that $j^{\prime}-i^{\prime}+1=\ell$ . Then $\mathsf{SA}_{k}[1..\ell+1]$ is also the permutation of a subarray of $\mathsf{SA}$ containing $k$ . Thus, we compute $\mathsf{SA}_{k}[1..\ell]$ by using the above observation.

We compute $\mathsf{SA}_{k}[\ell^{\prime}+1]$ using $i^{\prime},j^{\prime},p$ and $q$ , where $p=\mathsf{lcp}(k,\mathsf{SA}[i^{\prime}])$ and $q=\mathsf{lcp}(k,\mathsf{SA}[j^{\prime}])$ . Since $\mathsf{lcp}(k,\mathsf{SA}[i^{\prime}-1])=\min\{\mathsf{LCP}[i^{\prime}],p\}$ and $\mathsf{lcp}(k,\mathsf{SA}[j^{\prime}+1])=\min\{\mathsf{LCP}[j^{\prime}+1],q\}$ , $\mathsf{SA}_{k}[\ell^{\prime}+1]$ can be computed in constant time. In addition, we can appropriately update the four parameters in constant time for $\mathsf{SA}_{k}[\ell^{\prime}+2]$ . Therefore, we can compute $\mathsf{SA}_{k}[1..\ell]$ and $\mathsf{lcp}(k,\mathsf{SA}_{k}[1]),\ldots,\mathsf{lcp}(k,\mathsf{SA}_{k}[\ell])$ in $O(\ell)$ time and constant working space using a simple algorithm. ∎

Appendix C: The proof of the upper bound of LZRR phrases

We show three formulas using injective functions; for two BPs $F=f_{1},f_{2},\ldots,f_{k}$ and $F^{\prime}=f^{\prime}_{1},f^{\prime}_{2},\ldots,f^{\prime}_{k^{\prime}}$ of $T$ , if there exists an injective function $w$ that maps phrases in $F$ into distinct phrases in $F^{\prime}$ , then $k\leq k^{\prime}$ holds. In the remaining section, let $s_{x}$ and $e_{x}$ (resp. $s^{\prime}_{x}$ and $e^{\prime}_{x}$ ) be starting and ending positions of $x$ -th phrase in $F$ (resp. $F^{\prime}$ ).

The proof of $|\mathsf{LZ^{\prime}}(T)|=|\mathsf{LZ}(T)|$ . $\mathsf{LZ^{\prime}}(T)=f_{1},f_{2},\ldots,f_{k}$ parses greedily $T$ in the right-to-left order such that each phrase is the longest substring occurring previously (left) in $T$ . Formally, let $\mathsf{LPF}^{\prime}$ be the integer array of length $n$ such that $\mathsf{LPF}^{\prime}[i]$ stores the length of the longest substring of $T$ ending at position $i$ and occurring on $T[1..i-1]$ for all $i\in\{1,2,\ldots,n\}$ , i.e., $\mathsf{LPF}^{\prime}[i]=\max(\{0\}\cup\{\ell\mid\ell\in[1..i],|\mathit{Occ}(T[..i-1],T[i-\ell+1..i])|>0\})$ . Then $\mathsf{LZ^{\prime}}(T)=f_{1},\ldots,f_{k}$ is the valid BP of $T$ such that for all $x\in\{1,2,\ldots,k\}$ , the starting position $s_{x}$ of $f_{x}$ is $s_{x+1}-\max\{1,\mathsf{LPF}^{\prime}[s_{x+1}-1]\}$ , where $s_{k+1}=n$ . Figure 2 illustrates examples of $\mathsf{LZ}(T)$ and $\mathsf{LZ^{\prime}}(T)$ .

For $\mathsf{LZ}(T)=f_{1},f_{2},\ldots,f_{k}$ and $\mathsf{LZ^{\prime}}(T)=f^{\prime}_{1},f^{\prime}_{2},\ldots,f^{\prime}_{k^{\prime}}$ , we define the function $w(x^{\prime})$ that returns the integer $x$ such that $f_{x}$ contains $e^{\prime}_{x^{\prime}}$ (i.e., $s_{x}\leq e^{\prime}_{x^{\prime}}\leq e_{x}$ holds). $w$ is injective if the starting position of each $\mathsf{LZ^{\prime}}$ phrase is not larger than that of the LZ phrase containing the ending position of the $\mathsf{LZ^{\prime}}$ phrase, i.e., $s^{\prime}_{x^{\prime}}\leq s_{w(x^{\prime})}$ holds for all $x^{\prime}\in\{1,2,\ldots,k^{\prime}-1\}$ . This is because no LZ phrases contain two ending positions in $\mathsf{LZ^{\prime}}$ phrases, i.e., no integer exists $y^{\prime}$ such that $w(y^{\prime})=w(y^{\prime}+1)$ holds if $s^{\prime}_{x^{\prime}}\leq s_{w(x^{\prime})}$ holds for all $x^{\prime}$ .

We show $s^{\prime}_{x^{\prime}}\leq s_{w(x^{\prime})}$ using the substring $S$ starting at the starting position of $f_{w(x^{\prime})}$ and ending at the ending position of $f^{\prime}_{x^{\prime}}$ , i.e., $S=T[s_{w(x^{\prime})}..e^{\prime}_{x^{\prime}}]$ . When $|f^{\prime}_{x^{\prime}}|\geq|S|$ holds, $s^{\prime}_{x^{\prime}}\leq s_{w(x^{\prime})}$ holds because $S$ is a suffix of $f^{\prime}_{x^{\prime}}$ and $S$ is a prefix of $f_{w(x^{\prime})}$ . Thus, we show $|f^{\prime}_{x^{\prime}}|\geq|S|$ always holds. If $S$ occurs in previously on $T$ , then $|f^{\prime}_{x^{\prime}}|\geq|S|$ because $\mathsf{LZ^{\prime}}$ chooses the longest substring ending at position $e_{x^{\prime}}$ and occurring on $T[1..e_{x^{\prime}}-1]$ . Otherwise, $|S|=1$ and $S=f^{\prime}_{x^{\prime}}=f_{w(x^{\prime})}$ hold since $S$ is a new character, i.e., $\mathit{Occ}(T[1..e^{\prime}_{x^{\prime}}-1],T[e^{\prime}_{x^{\prime}}])=\emptyset$ . Therefore, $s^{\prime}_{x^{\prime}}\leq s_{w(x^{\prime})}$ holds for all $x^{\prime}\in\{1,2,\ldots,k^{\prime}-1\}$ .

Similarly, $|\mathsf{LZ^{\prime}}(T)|\geq|\mathsf{LZ}(T)|$ holds by constructing the injective function that returns the integer $x^{\prime}$ such that $f^{\prime}_{x^{\prime}}$ contains $s_{x}$ for a given $x$ .

The proof of $|\mathsf{LZRR}(T)|\leq|\mathsf{LZOR}(T)|$ . $\mathsf{LZOR}(T)=f_{1},f_{2},\ldots,f_{k}$ parses $T$ in the left-to-right order such that each phrase is the longest substring occurring subsequently in $T$ . Formally, let $\mathsf{LNF}$ be the integer array of length $n$ such that $\mathsf{LNF}[i]$ stores the length of the longest substring of $T$ starting at position $i$ and occurring on $T[i+1..]$ for all $i\in\{1,2,\ldots,n\}$ , i.e., $\mathsf{LNF}[i]=\max(\{0\}\cup\{\ell\mid\ell\in[1..n-i+1],|\mathit{Occ}(T[i+1..],T[i..i+\ell-1])|>0\}$ ). Then $\mathsf{LZOR}(T)=f_{1},f_{2},\ldots,f_{k}$ is the BP of $T$ such that for all $x\in\{2,3,\ldots,k\}$ , the starting position $s_{x}$ of $f_{x}$ is $s_{x-1}+\max\{1,\mathsf{LNF}[s_{x-1}]\}$ and $s_{1}=1$ . Figure 2 illustrates an example of $\mathsf{LZOR}(T)$ .

For $\mathsf{LZRR}(T)=f_{1},f_{2},\ldots,f_{k}$ and $\mathsf{LZOR}(T)=f^{\prime}_{1},f^{\prime}_{2},\ldots,f^{\prime}_{k^{\prime}}$ , let $w(x)$ be the function that returns the integer $x^{\prime}$ such that $f^{\prime}_{x^{\prime}}$ contains $s_{x}$ . Then $w$ is injective if $w(x)<w(x+1)$ holds for all $x\in\{1,2,\ldots,k-1\}$ . We use the following lemma.

Lemma 8.

Let $P=f_{1},\ldots,f_{b}$ be a valid PBP of $T$ . Then $P\cdot\langle j,\ell\rangle$ is also valid for any right target phrase $\langle j,\ell\rangle$ , i.e., $T[i..i+\ell-1]=T[j..j+\ell-1]$ and $j>i$ hold, where $i=|f_{1}\cdots f_{b}|+1$ .

Proof.

$T[i..n]$ are represented character phrases on $B_{P}$ since $T[i..n]$ has not been parsed. This means that $\mathit{source}(P,i+\ell^{\prime}-1)\not=\mathit{source}(P,j+\ell^{\prime}-1)$ for any $\ell^{\prime}\in\{1,2,\ldots,\ell\}$ . Therefore $P\cdot\langle j,\ell^{\prime}\rangle$ is valid by Corollary 4. ∎

$w(x)<w(x+1)$ holds if $\max\{1,\mathsf{LNF}[s_{x}]\}\leq|f_{x}|$ holds for all $x\in\{1,2,\ldots k\}$ . Recall that $\mathsf{LP}$ function returns the valid longest bidirectional phrase. The $\mathsf{LNF}$ array and Lemma 8 suggest that the length of the phrase of $\mathsf{LZRR}(T)$ starting at position $i$ is at least $\max\{1,\mathsf{LNF}[i]\}$ . Thus $w(x)<w(x+1)$ holds for all $x$ , $w$ is injective, and hence $|\mathsf{LZRR}(T)|\leq|\mathsf{LZOR}(T)|$ holds.

The proof of $|\mathsf{LZOR}(T)|=|\mathsf{LZ^{\prime}}(T^{R})|$ . $|\mathsf{LZOR}(T)|=|\mathsf{LZ}^{\prime}(T^{R})|$ holds clearly because $\mathsf{LPF}^{\prime}[x]=\mathsf{LNF}_{T^{R}}[n-x+1]$ holds for all $x\in\{1,2,\ldots,n\}$ , where $\mathsf{LNF}_{T^{R}}$ is the $\mathsf{LNF}$ array of $T^{R}$ .

Appendix D: The proof of Lemma 3

Proof.

Recall that $\mathsf{lcp}(i,\mathsf{SA}_{i}[1])\geq\cdots\geq\mathsf{lcp}(i,\mathsf{SA}_{i}[n])$ holds. On the other hand, $\mathsf{LF}(P,j)\leq\mathsf{lcp}(i,j)$ holds for all $j\in\{1,2,\ldots,n\}$ because $\mathsf{LF}(P,j)$ represents the length of the common prefix of $T[i..]$ and $T[j..]$ . Therefore, $\ell_{\mathit{max}}=\ell_{n}=\ell_{k_{\mathit{max}}}$ holds, which means at least one position $j^{\prime}$ exists such that $\mathsf{LF}(P,j^{\prime})=\ell_{\mathit{max}}$ in $\mathsf{SA}_{i}[..k_{\mathit{max}}]$ . ∎

Appendix E: Experiments

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. In: Proceedings of SEA. pp. 13:1–13:22 (2017)
2[2] Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of lempel-ziv parsing. In: Proceedings of LATIN. pp. 490–503 (2018)
3[3] Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)
4[4] Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of ICALP. pp. 943–955 (2003)
5[5] Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of CPM. pp. 181–192 (2001)
6[6] Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of STOC. pp. 827–840 (2018)
7[7] Kreft, S., Navarro, G.: LZ 77-like compression with fast random access. In: Proceedings of DCC. pp. 239–248 (2010)
8[8] Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on information theory 22(1), 75–81 (1976)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

LZRR: LZ77 Parsing with Right Reference

Abstract

1 Introduction

2 Preliminaries

2.1 Arrays

2.2 Union-find data structure

2.3 Bidirectional phrases and partial bidirectional phrases

3 LZRR

Theorem 1**.**

Theorem 2**.**

3.1 LP\mathsf{LP}LP algorithm

Lemma 3**.**

Proof.

3.2 LF\mathsf{LF}LF algorithm

Corollary 4**.**

Lemma 5**.**

Proof.

3.3 Computation of LZRR(T)\mathsf{LZRR}(T)LZRR(T)

4 Theoretical analysis

4.1 The proof of Theorem 1

4.2 The proof of Theorem 2

5 Experiments

5.1 Results

6 Conclusions

Appendix A: The proof of Lemma 5

Lemma 6**.**

Proof.

Corollary 7**.**

Appendix B: Computing SAk[1..ℓ]\mathsf{SA}_{k}[1..\ell]SAk​[1..ℓ] and lcp(k,SAk[1]),…,lcp(k,SAk[ℓ])\mathsf{lcp}(k,\mathsf{SA}_{k}[1]),\ldots,\mathsf{lcp}(k,\mathsf{SA}_{k}[\ell])lcp(k,SAk​[1]),…,lcp(k,SAk​[ℓ])

Appendix C: The proof of the upper bound of LZRR phrases

Lemma 8**.**

Proof.

Appendix D: The proof of Lemma 3

Proof.

Appendix E: Experiments

Theorem 1.

Theorem 2.

3.1 $\mathsf{LP}$ algorithm

Lemma 3.

3.2 $\mathsf{LF}$ algorithm

Corollary 4.

Lemma 5.

3.3 Computation of $\mathsf{LZRR}(T)$

Lemma 6.

Corollary 7.

Appendix B: Computing $\mathsf{SA}_{k}[1..\ell]$ and $\mathsf{lcp}(k,\mathsf{SA}_{k}[1]),\ldots,\mathsf{lcp}(k,\mathsf{SA}_{k}[\ell])$

Lemma 8.