Re-Pair In Small Space

Dominik K\"oppl; Tomohiro I; Isamu Furuya; Yoshimasa; Takabatake; Kensuke Sakai; Keisuke Goto

arXiv:1908.04933·cs.DS·November 19, 2019

Re-Pair In Small Space

Dominik K\"oppl, Tomohiro I, Isamu Furuya, Yoshimasa, Takabatake, Kensuke Sakai, Keisuke Goto

PDF

1 Repo

TL;DR

This paper introduces a space-efficient algorithm for computing Re-Pair grammar compression on large datasets, reducing memory usage while maintaining effective compression rates.

Contribution

It presents a novel algorithm that computes Re-Pair in significantly less space, supporting large-scale data processing and recovery of original input.

Findings

01

Achieves Re-Pair computation in near-quadratic time with reduced space complexity.

02

Supports recovery of original text within the same time as computation.

03

Provides variants for parallel and external memory models.

Abstract

Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an $O (n^{2}) \cap O (n^{2} l g lo g_{τ} n l g l g l g n / lo g_{τ} n)$ time algorithm computing Re-Pair in $n l g max (n, τ)$ bits of space including the text space, where $τ$ is the number of terminals and non-terminals. The algorithm works in the restore model, supporting the recovery of the original input in the time for the Re-Pair computation with $O (l g n)$ additional bits of working space. We give variants of our solution working in parallel or in the external memory model.

Tables2

Table 1. Table 1: Experimental evaluation of our implementation described in Sect. 2.5 . Table entries are running times in seconds. The last line is the benchmark on the unary string aa ⋯ a aa ⋯ a \texttt{aa}\cdots\texttt{a} .

$rmPreRun (X)$
Operation	Example
$X$	11100110
$\neg X$	00011001
$1 ≪ (1 + msb (\neg X))$	00100000
$(1 ≪ (1 + msb (\neg X))) - 1$	00011111
$((1 ≪ (1 + msb (\neg X))) - 1) & X$	00000110

Table 2. Table 2: Characteristics of our data sets. The number of turns and rounds are given for each of the prefix sizes 128, 256, 512, and 1024 KiB of the respective data sets. The number of turns reflecting the number of non-terminals is given in units of thousands. The turns of the unary string aa ⋯ a aa ⋯ a \texttt{aa}\cdots\texttt{a} are in plain units (not divided by thousand).

$rmPreRun (X)$
Operation	Example
$X$	11100110
$\neg X$	00011001
$1 ≪ (1 + msb (\neg X))$	00100000
$(1 ≪ (1 + msb (\neg X))) - 1$	00011111
$((1 ≪ (1 + msb (\neg X))) - 1) & X$	00000110

Equations12

With β, we have α f_{k + 1}

With β, we have α f_{k + 1}

= α f_{k} max (1 + 2/ (α β f_{k}), 1 + 1/ (2 α β) - 1/ (2 α β f_{k}))

\geq α f_{k} (1 + 2/ (5 α β)) =: γ_{i} α f_{k} with γ_{i} := 1 + 2/ (5 α β),

O k = 0 \sum O (l g n) \frac{n - f _{k}}{f _{k}} n l g f_{k} = O (n^{2} k \sum l g n \frac{k}{γ ^{k}}) = O (n^{2}) time in total.

O k = 0 \sum O (l g n) \frac{n - f _{k}}{f _{k}} n l g f_{k} = O (n^{2} k \sum l g n \frac{k}{γ ^{k}}) = O (n^{2}) time in total.

X [i] = {2^{⌈ l g σ ⌉} - 1 0 if S [i] = c, otherwise,

X [i] = {2^{⌈ l g σ ⌉} - 1 0 if S [i] = c, otherwise,

O k = 0 \sum O (l g n) min (\frac{n - f _{k}}{f _{k}} n l g f_{k}, \frac{( n - f _{k} ) ^{2} l g l g l g n}{lo g _{τ} n}) = O (n^{2} k = 0 \sum l g n min (\frac{k}{γ ^{k}}, \frac{l g l g l g n}{lo g _{τ} n})) = O (\frac{n ^{2} l g lo g _{τ} n l g l g l g n}{lo g _{τ} n}) time in total,

O k = 0 \sum O (l g n) min (\frac{n - f _{k}}{f _{k}} n l g f_{k}, \frac{( n - f _{k} ) ^{2} l g l g l g n}{lo g _{τ} n}) = O (n^{2} k = 0 \sum l g n min (\frac{k}{γ ^{k}}, \frac{l g l g l g n}{lo g _{τ} n})) = O (\frac{n ^{2} l g lo g _{τ} n l g l g l g n}{lo g _{τ} n}) time in total,

O k = 0 \sum O (l g n) \frac{n - f _{k}}{f _{k}} \frac{n}{p} l g^{2} f_{k} = O (\frac{n ^{2}}{p} k \sum l g n \frac{k ^{2}}{γ ^{k}}) = O (\frac{n ^{2}}{p}) time in total.

O k = 0 \sum O (l g n) \frac{n - f _{k}}{f _{k}} \frac{n}{p} l g^{2} f_{k} = O (\frac{n ^{2}}{p} k \sum l g n \frac{k ^{2}}{γ ^{k}}) = O (\frac{n ^{2}}{p}) time in total.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

koeppl/repair-inplace
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Re-Pair in Small Space

Dominik Köppl

Tomohiro I

Isamu Furuya

Yoshimasa Takabatake

Kensuke Sakai

Keisuke Goto

Abstract

Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})\cap\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2}\lg\log_{\tau}n\lg\lg\lg n/\log_{\tau}n)$ time algorithm computing Re-Pair in $n\left\lceil\lg\max(n,\tau)\right\rceil$ bits of working space including the text space, where $\tau$ is the number of terminals and non-terminals. The algorithm works in the restore model, supporting the recovery of the original input in the time for the Re-Pair computation with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ additional bits of working space. We give variants of our solution working in parallel or in the external memory model.

Keywords: Grammar Compression, Re-Pair, Computation in Small Space

1 Introduction

Re-Pair [21] is a grammar deriving a single string. It is computed by replacing the most frequent bigram in this string with a new non-terminal, recursing until no bigram occurs more than once. Despite this simple-looking description, both the merits and the computational complexity of Re-Pair are intriguing. As a matter of fact, Re-Pair is currently one of the most well-understood grammar schemes.

Besides the seminal work of Larsson and Moffat [21], there are a couple of articles devoted to the compression aspects of Re-Pair: Given a text $T$ of length $n$ whose characters are drawn from an integer alphabet of size $\sigma$ , the output of Re-Pair applied to $T$ is at most $2nH_{k}(T)+\mathop{}\mathopen{}o\mathopen{}(n\lg\sigma)$ bits with $k=\mathop{}\mathopen{}o\mathopen{}(\log_{\sigma}n)$ when represented naively as a list of character pairs [25], where $H_{k}$ denotes the empirical entropy of the $k$ -th order. Using the encoding of Kieffer and Yang [19], Ochoa and Navarro [26] could improve the output size to at most $nH_{k}(T)+\mathop{}\mathopen{}o\mathopen{}(n\lg\sigma)$ bits. Other encodings were recently studied by Ganczorz [14]. Since Re-Pair is a so-called irreducible grammar, its grammar size, i.e., the sum of the symbols on the right hand of all rules, is upper bounded by $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n/\log_{\sigma}n)$ [19, Lemma 2], which matches the information-theoretic lower bound on the size of a grammar for a string of length $n$ . Comparing this size with the size of the smallest grammar, its approximation ratio has $\mathop{}\mathopen{}\mathcal{O}\mathopen{}((n/\lg n)^{2/3})$ as an upper bound [8] and $\mathop{}\mathopen{}\mathup{\Omega}\mathopen{}(\lg n/\lg\lg n)$ as a lower bound [2].

On the practical side, Yoshida and Kida [33] presented an efficient fixed-length code for compressing the Re-Pair grammar. Although conceived as a grammar for compressing texts, Re-Pair has been successfully applied for compressing trees [23], matrices [30], or images [11].

For different settings or for better compression rates, there is a great interest in modifications to Re-Pair. Charikar et al. [8, Sect. G] give an easy variation to improve the size of the grammar. Sekine et al. [28] provide an adaptive variant whose algorithm divides the input into blocks, and processes each block based on the rules obtained from the grammars of its preceding blocks. Subsequently, Masaki and Kida [24] gave an online algorithm producing a grammar mimicking Re-Pair. Ganczorz and Jez [15] modified the Re-Pair grammar by disfavoring the replacement of bigrams that cross Lempel-Ziv-77 (LZ77) [34] factorization borders, which allowed the authors to achieve practically smaller grammar sizes. Recently, Furuya et al. [13] presented a variant, called MR-Re-Pair, in which a most frequent maximal repeat is replaced instead of a most frequent bigram.

1.1 Related Work

Although Re-Pair is a well received grammar, there is not much literature found on how to compute Re-Pair efficiently. In this article, we focus on the problem to compute the grammar with an algorithm working in text space, forming a bridge between the domain of in-place string algorithms and the domain of Re-Pair computing algorithms. We briefly review some prominent achievements in both domains:

In-Place String Algorithms.

For the LZ77 factorization, Kärkkäinen et al. [18] present an algorithm computing this factorization with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n/d)$ words on top of the input space in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(dn)$ time for a variable $d\geq 1$ , achieving $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(1)$ words with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ time. For the suffix sorting problem, Goto [16] gave an algorithm to compute the suffix array with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits on top of the output in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time if each character of the alphabet is present in the text. This condition got improved to alphabet sizes of at most $n$ by Li et al. [22]. Finally, Crochemore et al. [9] showed how to transform a text into its Burrows-Wheeler transform by using $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ of additional bits. Due to da Louza et al. [10], this algorithm got extended to compute simultaneously the LCP array with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits of additional working space.

Re-Pair Computation.

Re-Pair is a grammar proposed by Larsson and Moffat [21], who gave an algorithm computing it in expected linear time with $5n+4\sigma^{2}+4\sigma^{\prime}+\sqrt{n}$ words of working space, where $\sigma^{\prime}$ is the number of non-terminals (produced by Re-Pair). This space requirement got improved by Bille et al. [5], who presented a linear time algorithm taking $(1+\epsilon)n+\sqrt{n}$ words on top of the rewriteable text space for a constant $\epsilon$ with $0<\epsilon\leq 1$ . Subsequently, they improved their algorithm in [4] to include the text space within the $(1+\epsilon)n+\sqrt{n}$ words of working space. However, they assume that the alphabet size $\sigma$ is constant and $\left\lceil\lg\sigma\right\rceil\leq w/2$ , where $w$ is the machine word size. They also provide a solution for $\epsilon=0$ running in expected linear time. Recently, Sakai et al. [27] showed how to convert an arbitrary grammar (representing a text) into the Re-Pair grammar in compressed space, i.e., without decompressing the text. Combined with a grammar compression that can process the text in compressed space in a streaming fashion, this result leads to the first Re-Pair computation in compressed space.

Our Contribution.

In this article, we propose an algorithm that computes the Re-Pair grammar in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})\cap\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n\lg\log_{\tau}n\lg\lg\lg n/\log_{\tau}n)$ time (cf. Thm. 2.3 and Thm. 3.1) with $\max((n/c)\lg n,n\left\lceil\lg\tau\right\rceil)+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits of working space including the text space, where $\tau$ is the number of terminals and non-terminals. Given that the characters of the text are drawn from a large integer alphabet with size $\sigma=\mathop{}\mathopen{}\mathup{\Omega}\mathopen{}(n)$ , the algorithm works in-place. This is the first non-trivial in-place algorithm, as a trivial approach on a text $T$ of length $n$ would compute the most frequent bigram in $\mathop{}\mathopen{}\mathup{\Theta}\mathopen{}(n^{2})$ time by computing the frequency of each bigram $T[i]T[i+1]$ for every integer $i$ with $1\leq i\leq n-1$ , keeping only the most frequent bigram in memory. This sums up to $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{3})$ total time, and can be $\mathop{}\mathopen{}\mathup{\Theta}\mathopen{}(n^{3})$ for some texts since there can be $\mathop{}\mathopen{}\mathup{\Theta}\mathopen{}(n)$ different bigrams considered for replacement by Re-Pair. To achieve our goal of $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ total time, we first provide a trade-off algorithm (cf. Lemma 2.2) finding the $d$ most frequent bigrams in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2}\lg d/d)$ time for a trade-off parameter $d$ . We subsequently run this algorithm for increasing values of $d$ , and show that we need to run it $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ times, which gives us $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ time if $d$ is increasing sufficiently fast. Our major tools are appropriate text partitioning, elementary scans, and sorting steps, which we visualize in Sect. 2.4 by an example, and practically evaluate in Sect. 2.5. When $\tau=\mathop{}\mathopen{}o\mathopen{}(n)$ , a different approach using word-packing and bit-parallel techniques becomes attractive, leading to an $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n\lg\log_{\tau}n\lg\lg\lg n/\log_{\tau}n)$ time algorithm, which we explain in Sect. 3. Our algorithm can be parallelized (Sect. 5), used in external memory (Sect. 6), or adapted to compute the MR-Re-Pair grammar in small space (Sect. 4). Finally, in Sect. 7 we study several heuristics that make the algorithm faster on specific texts.

1.2 Preliminaries

We use the word RAM model with a word size of $\mathop{}\mathopen{}\mathup{\Omega}\mathopen{}(\lg n)$ for an integer $n\geq 1$ . We work in the restore model [7], in which algorithms are allowed to overwrite the input, as long as they can restore the input to its original form.

Strings.

Let $T$ be a text of length $n$ whose characters are drawn from an integer alphabet $\Sigma$ of size $\sigma=n^{\mathop{}\mathopen{}\mathcal{O}\mathopen{}(1)}$ . A bigram is an element of $\Sigma^{2}$ . The frequency of a bigram $B$ in $T$ is the number of non-overlapping occurrences of $B$ in $T$ , which is at most $\left|T\right|/2$ .

Re-Pair.

We reformulate the recursive description in the introduction by dividing a Re-Pair construction algorithm into turns. Stipulating that $T_{i}$ is the text after the $i$ -th turn with $i\geq 1$ and $T_{0}:=T\in\Sigma_{0}^{+}$ with $\Sigma_{0}:=\Sigma$ , Re-Pair replaces one of the most frequent bigrams (ties are broken arbitrarily) in $T_{i-1}$ with a non-terminal in the $i$ -th turn. Given this bigram is $\texttt{bc}\in\Sigma^{2}_{i-1}$ , Re-Pair replaces all occurrences of bc with a new non-terminal $X_{i}$ in $T_{i-1}$ , and sets $\Sigma_{i}:=\Sigma_{i-1}\cup\{X_{i}\}$ with $\sigma_{i}:=|\Sigma_{i}|$ to produce $T_{i}\in\Sigma_{i}^{+}$ . Since $\left|T_{i}\right|\leq\left|T_{i-1}\right|-2$ , Re-Pair terminates after $m<n/2$ turns such that $T_{m}\in\Sigma_{m}^{+}$ contains no bigram occurring more than once.

2 Sequential Algorithm

A major task for producing the Re-Pair grammar is to count the frequencies of the most frequent bigrams. Our work horse for this task are frequency tables. A frequency table in $T_{i}$ of length $f$ stores pairs of the form $(\texttt{bc},x)$ , where bc is a bigram and $x$ the frequency of bc in $T_{i}$ . It uses $f\left\lceil\lg(\sigma_{i}^{2}n_{i}/2)\right\rceil$ bits of space since an entry stores a bigram consisting of two characters from $\Sigma_{i}$ and its respective frequency, which can be at most $n_{i}/2$ . Throughout this paper, we use an elementary in-place sorting algorithm like heapsort:

Lemma 2.1 ([32]).

An array of length $n$ can be sorted in-place in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n\lg n)$ time.

2.1 Trade-Off Computation

By embracing the frequency tables, we present a solution with a trade-off parameter:

Lemma 2.2.

Given an integer $d$ with $d\geq 1$ , we can compute the frequencies of the $d$ most frequent bigrams in a text of length $n$ whose characters are drawn from an alphabet of size $\sigma$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\max(n,d)n\lg d/d)$ time using $2d\left\lceil\lg(\sigma^{2}n/2)\right\rceil+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits.

Proof.

Our idea is to partition the set of all bigrams appearing in $T$ into $\left\lceil n/d\right\rceil$ subsets, compute the frequencies for each subset, and finally merge these frequencies. In detail, we partition the text $T=S_{1}\cdots S_{\left\lceil n/d\right\rceil}$ into $\left\lceil n/d\right\rceil$ substrings such that each substring has length $d$ (the last one has a length of at most $d$ ). Subsequently, we extend $S_{j}$ to the left (only if $j>1$ ) and to the right (only if $j<\left\lceil n/d\right\rceil$ ) such that $S_{j}$ and $S_{j+1}$ overlap by one text position, for $1\leq j<\left\lceil n/d\right\rceil$ . By doing so, we take the bigram on the border of two adjacent substrings $S_{j}$ and $S_{j+1}$ for each $j<\left\lceil n/d\right\rceil$ into account. Next, we create two frequency tables $F$ and $F^{\prime}$ , each of length $d$ for storing the frequencies of $d$ bigrams. With $F$ and $F^{\prime}$ , we process each of the $n/d$ substrings $S_{j}$ as follows: Let us fix an integer $j$ with $1\leq j\leq\left\lceil n/d\right\rceil$ . We first put all bigrams of $S_{j}$ into $F^{\prime}$ in lexicographic order. We can perform this within the space of $F^{\prime}$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(d\lg d)$ time since there are at most $d$ different bigrams in $S_{j}$ . We compute the frequencies of all these bigrams in the complete text $T$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n\lg d)$ time by scanning the text from left to right while locating a bigram in $F^{\prime}$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg d)$ time with a binary search. Subsequently, we interpret $F$ and $F^{\prime}$ as one large frequency table, sort it with respect to the frequencies while discarding duplicates such that $F$ stores the $d$ most frequent bigrams in $T[1..jd]$ . This sorting step can be done in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(d\lg d)$ time. Finally, we clear $F^{\prime}$ and are done with $S_{j}$ . After the final merge step, we obtain the $d$ most frequent bigrams of $T$ stored in $F$ .

Since each of the $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n/d)$ merge steps takes $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(d\lg d+n\lg d)$ time, we need $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\max(d,n)\cdot(n\lg d)/d)$ time. For $d\geq n$ , we can build a large frequency table and perform one scan to count the frequencies of all bigrams in $T$ . This scan and the final sorting with respect to the counted frequencies can be done in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n\lg n)$ time. ∎

2.2 Algorithmic Ideas

With Lemma 2.2, we can compute $T_{m}$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(mn^{2}\lg d/d)$ time with additional $2d$ $\left\lceil\lg(\sigma_{m}^{2}n/2)\right\rceil$ bits of working space on top of the text for a parameter $d$ with $1\leq d\leq n$ . In the following, we present an $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ time algorithm that needs $\max((n/c)\lg n,$ $n\left\lceil\lg\sigma_{m}\right\rceil)+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits of working space, where the text space is included as a rewriteable part in the working space and $c\geq 1$ is a constant. In this model, we assume that we can enlarge the text $T_{i}$ from $n_{i}\left\lceil\lg\sigma_{i}\right\rceil$ bits to $n_{i}\left\lceil\lg\sigma_{i+1}\right\rceil$ bits without additional extra memory. Our main idea is to store a growing frequency table using the space freed up by replacing bigrams with non-terminals. In detail, we maintain a frequency table $F$ in $T_{i}$ of length $f_{k}$ for a growing variable $f_{k}$ , which is set to $f_{0}:=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(1)$ in the beginning. The table $F$ takes $f_{k}\left\lceil\lg(\sigma_{i}^{2}n/2)\right\rceil$ bits, which is $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg(\sigma^{2}n))=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits for $k=0$ . When we want to query it for a most frequent bigram, we linearly scan $F$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(f_{k})=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time, which is not a problem since (a) the number of queries is $m\leq n$ , and (b) we aim for $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ overall running time. A consequence is that there is no need to sort the bigrams in $F$ according to their frequencies, which simplifies the following discussion.

Frequency Table $F$ .

With Lemma 2.2, we can compute $F$ in $\mathcal{O}(n\max(n,f_{k})$ $\lg f_{k}/f_{k})$ time. Instead of recomputing $F$ for every turn $i$ , we want to recompute it only when it no longer stores a most frequent bigram. However, it is ad-hoc not clear when this happens as replacing a most frequent bigram during a turn (a) removes this entry in $F$ and (b) can reduce the frequencies of other bigrams in $F$ , making them possibly less frequent than other bigrams not tracked by $F$ . Hence, the variable $i$ for the $i$ -th turn (creating the $i$ -th non-terminal) and the variable $k$ for recomputing the frequency table $F$ the $(k+1)$ -st time are loosely connected. We group together all turns with the same $f_{k}$ and call this group the $k$ -th round of the algorithm. At the beginning of each round, we enlarge $f_{k}$ and create a new $F$ with a capacity for $f_{k}$ bigrams. Since a recomputation of $F$ takes much time, we want to end a round only if $F$ is no longer useful, i.e., when we no longer can guarantee that $F$ stores a most frequent bigram. To achieve our claimed time bounds, we want to assign all $m$ turns to $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ different rounds, which can only be done if $f_{k}$ grows sufficiently fast.

Algorithm Outline.

Given we are at the beginning of the $k$ -th round and the $i$ -th turn, we compute the frequency table $F$ storing $f_{k}$ bigrams, and keep additionally the lowest frequency of $F$ as a threshold $t$ , which is treated as a constant during this round. During the computation of the $i$ -th turn, we replace the most frequent bigram (say, $\texttt{bc}\in\Sigma_{i}^{2}$ ) in the text $T_{i}$ with a non-terminal $X_{i+1}$ to produce $T_{i+1}$ . Thereafter, we remove bc from $F$ and update those frequencies in $F$ which got decreased by the replacement of bc with $X_{i+1}$ , and add each bigram containing the new character $X_{i+1}$ into $F$ if its frequency is at least $t$ . Whenever a frequency in $F$ drops below $t$ , we discard it. If $F$ becomes empty, we move to the $(k+1)$ -st round, and create a new $F$ for storing $f_{k+1}$ frequencies. Otherwise ( $F$ still stores an entry), we can be sure that $F$ stores a most frequent bigram. In both cases, we recurse with the $(i+1)$ -st turn by selecting the bigram with the highest frequency stored in $F$ . We describe in the following how we update of $F$ and how large $f_{k+1}$ can be at least.

2.3 Algorithmic Details

Suppose that we are in the $k$ -th round and in the $i$ -th turn. Let $t$ be the lowest frequency in $F$ computed at the beginning of the $k$ -th round. We keep $t$ as a constant threshold for the invariant that all frequencies in $F$ are at least $t$ during the $k$ -th round. With this threshold we can assure in the following that $F$ is either empty or stores a most frequent bigram.

Now suppose that the most frequent bigram of $T_{i}$ is $\texttt{bc}\in\Sigma_{i}^{2}$ , which is stored in $F$ . To produce $T_{i+1}$ (and hence advancing to the $(i+1)$ -st turn), we enlarge the space of $T_{i}$ from $n_{i}\left\lceil\lg\sigma_{i}\right\rceil$ to $n_{i}\left\lceil\lg\sigma_{i+1}\right\rceil$ , and replace all occurrences of bc in $T_{i}$ with a new non-terminal $X_{i+1}$ . Subsequently, we would like to take the next bigram of $F$ . For that, however, we need to update the stored frequencies in $F$ . To see this necessity, suppose that there is an occurrence of abcd with two characters $\texttt{a},\texttt{d}\in\Sigma_{i}$ in $T_{i}$ . By replacing bc with $X_{i+1}$ ,

(a)

the frequencies of ab and cd decrease by one111For the border case a = b = c (resp. b = c = d), there is no need to decrement the frequency of ab (resp. cd)., and 2. (b)

the frequencies of $\texttt{a}X_{i+1}$ and $X_{i+1}\texttt{d}$ increase by one.

Updating $F$

We can take care of the former changes (a) by decreasing the respective bigram in $F$ (in case that it is present). If the frequency of this bigram drops below the threshold $t$ , we remove it from $F$ as there may be bigrams with a higher frequency that are not present in $F$ . To cope with the latter changes (b), we track the characters adjacent to $X_{i+1}$ after the replacement, count their numbers, and add their respective bigrams to $F$ if their frequencies are sufficiently high. In detail, suppose that we have substituted bc with $X_{i+1}$ exactly $h$ times. Consequently, with the new text $T_{i+1}$ we have additionally $h\lg\sigma_{i+1}$ bits of free space222The free space is consecutive after shifting all characters to the left., which we call $D$ in the following. Subsequently, we scan the text and put the characters of $\Sigma_{i+1}$ appearing to the left of each of the $h$ occurrences of $X_{i+1}$ into $D$ . After sorting the characters in $D$ lexicographically, we can count the frequency of $\texttt{a}X_{i+1}$ for each character $\texttt{a}\in\Sigma_{i+1}$ preceding an occurrence of $X_{i+1}$ in the text $T_{i+1}$ by scanning $D$ linearly. If the obtained frequency of such a bigram $\texttt{a}X_{i+1}$ is at least as high as the threshold $t$ , we insert $\texttt{a}X_{i+1}$ into $F$ , and subsequently discard a bigram with the currently lowest frequency in $F$ if the size of $F$ has become $f_{k}+1$ . In case that we visit a run of $X_{i+1}$ ’s during the creation of $D$ , we must take care of not counting the overlapping occurrences of $X_{i+1}X_{i+1}$ . Finally, we can count analogously the occurrences of $X_{i+1}\texttt{d}$ for all characters $\texttt{d}\in\Sigma_{i}$ succeeding an occurrence of $X_{i+1}$ .

Capacity of $F$

After the above procedure we have updated the frequencies of $F$ . When $F$ becomes empty, we end the $k$ -th round and continue with the ( $k+1$ )-st round by creating a new frequency table $F$ with capacity $f_{k+1}$ . In what follows, we (a) analyze in detail when $F$ becomes empty (as this determines the sizes $f_{k}$ and $f_{k+1}$ ), and (b) show that we can compensate the number of discarded bigrams with an enlargement of $F$ ’s capacity from $f_{k}$ bigrams to $f_{k+1}$ bigrams for the sake of our aimed total running time: If the frequency of bc in $T_{i}$ is $x$ , then we can reduce at most $2x$ frequencies of other bigrams. Since a bigram must occur at least twice in $T_{i}$ to be present in $F$ , the frequency of bc has to be at least $\max(2,(f_{k}-1)/2)$ for discarding all bigrams of $F$ , and each replacement of bc with $X_{i+1}$ frees up $\left\lceil\lg\sigma_{i+1}\right\rceil$ bits of the text.

Suppose that we have enough space available for storing the frequencies of $\alpha f_{k}$ bigrams, where $\alpha$ is a constant (depending on $\sigma_{i}$ and $n_{i}$ ) such that $F$ and the working space of Lemma 2.2 with $d=f_{k}$ can be stored within this space. Let $\delta:=\lg(\sigma^{2}_{i+1}n_{i}/2)$ be the number of bits needed to store one entry in $F$ , and let $\beta:=\min(\delta/\lg\sigma_{i+1},c\delta/\lg n)$ be the minimum number of characters that need to be freed to store one frequency in this space. To understand the value of $\beta$ , we look at the arguments of the minimum function in the definition of $\beta$ and simultaneously at the maximum function in our aimed working space of $\max(n\left\lceil\lg\sigma_{m}\right\rceil,(n/c)\lg n)+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits (cf. Thm. 2.3):

•

The first item in this maximum function allows us to spend $\lg\sigma_{i+1}$ bits for each freed character such that we obtain space for one additional entry in $F$ after freeing $\delta/\lg\sigma_{i+1}$ characters.

•

The second item allows us to use $\lg n$ additional bits after freeing up $c$ characters.333This additional treatment helps us to let $f_{k}$ grow sufficiently fast in the first steps to save our $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ time bound, as for sufficiently small alphabets and large text sizes, $\lg(\sigma^{2}n/2)/\lg\sigma=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ , which means that we might run the first $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ turns with $f_{k}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(1)$ , and therefore already spend $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2}\lg n)$ time. Hence, after freeing up $c\delta/\lg n$ characters, we have space to store one additional entry in $F$ .

[TABLE]

where we used the equivalence $1+2/(\alpha\beta f_{k})=1+1/(2\alpha\beta)-1/(2\alpha\beta f_{k})\Leftrightarrow 5=f_{k}$ to estimate the two arguments of the maximum function.

Since we let $f_{k}$ grow by a factor of at least $\gamma:=\min_{1\leq i\leq n}\gamma_{i}>1$ for each recomputation of $F$ , $f_{k}=\mathop{}\mathopen{}\mathup{\Omega}\mathopen{}(\gamma^{k})$ , and therefore $f_{k}=\mathop{}\mathopen{}\mathup{\Theta}\mathopen{}(n)$ after $k=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ steps. Consequently, after reaching $k=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ , we can iterate the above procedure a constant number of times to compute the non-terminals of the remaining bigrams occurring at least twice.

Time Analysis

On the total picture, we compute $F$ $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ times with Lemma 2.2. For the $k$ -th time, we run the algorithm of Lemma 2.2 with $d=f_{k}$ on a text of length at most $n-f_{k}$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n(n-f_{k})\cdot\lg f_{k}/f_{k})$ time with $f_{k}\leq n$ . Summing this up, we yield

[TABLE]

In the $i$ -th turn, we update $F$ by decreasing the frequencies of the bigrams affected by the substitution of the most frequent bigram bc with $X_{i}$ . For decreasing such a frequency, we look up its respective bigram with a linear scan in $F$ , which takes $f_{k}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time. However, since this decrease is accompanied with a replacement of an occurrence of bc, we obtain $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ total time by charging each text position with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time for a linear search in $F$ . With the same argument, we can bound the total time for sorting the characters in $D$ to $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ overall time: Since we spend $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(h\lg h)$ time on sorting $h$ characters preceding or succeeding a replaced character, and $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(f_{k})=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time on swapping a sufficiently large new bigram composed of $X_{i+1}$ and a character of $\Sigma_{i+1}$ with a bigram with the lowest frequency in $F$ , we charge each text position again with $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time. Putting all time bounds together leads to the main result of this article:

Theorem 2.3.

We can compute Re-Pair on a string of length $n$ in $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ time with $\max((n/c)\lg n,n\left\lceil\lg\sigma_{m}\right\rceil)+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lg n)$ bits of working space including the text space, where $c\geq 1$ is a fixed constant, and $\sigma_{m}$ is the number of terminal and non-terminal symbols.

Output

Finally, we show that we can store the computed grammar in text space. More precisely, we want to store the grammar in an auxiliary array $A$ packed at the end of the working space such that the entry $A[i]$ stores the right hand side of the non-terminal $X_{i}$ , which is a bigram. Thus the non-terminals are represented implicitly as indices of the array $A$ . We therefore need to subtract $2\lg\sigma_{i}$ bits of space from our working space $\alpha f_{k}$ after the $i$ -th turn. By adjusting $\alpha$ in the above equations, we can deal with this additional space requirement as long as the frequencies of the replaced bigrams are at least three (we charge two occurrences for growing the space of $A$ ).

When only bigrams with frequencies of at most two remain, we switch to a simpler algorithm, discarding the idea of maintaining the frequency table $F$ : Suppose that we work with the text $T_{i}$ . Let $k$ be a text position, which is $1$ in the beginning, but will be incremented in the following turns while holding the invariant that $T[1..k]$ does not contain a bigram of frequency two. We scan $T_{i}[k..n]$ linearly from left to right and check, for each text position $j$ , whether the bigram $T_{i}[j]T_{i}[j+1]$ has another occurrence $T_{i}[j^{\prime}]T_{i}[j^{\prime}+1]=T_{i}[j]T_{i}[j+1]$ with $j^{\prime}>j+1$ , and if so,

(a)

append $T_{i}[j]T_{i}[j+1]$ to $A$ , 2. (b)

replace $T_{i}[j]T_{i}[j+1]$ and $T_{i}[j^{\prime}]T_{i}[j^{\prime}+1]$ with a new non-terminal $X_{i+1}$ to transform $T_{i}$ to $T_{i+1}$ , and 3. (c)

recurse on $T_{i+1}$ with $k:=j$ until no bigram with frequency two is left.

The position $k$ , which we never decrement, helps us to skip over all text positions starting with bigrams with a frequency of one. Thus, the algorithm spends $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time for each such text position, and $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n)$ time for each bigram with frequency two. Since there are at most $n$ such bigrams, the overall running time of this algorithm is $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ .

Remark 2.4 (Pointer Machine Model).

Refraining from the usage of complicated algorithms, our algorithm consists only of elementary sorting and scanning steps. This allows us to run our algorithm on a pointer machine, yielding the same time bound of $\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{2})$ . For the space bounds, we assume that the text is given in $n$ words, where a word is large enough to store an element of $\Sigma_{m}$ or a text position.

2.4 Step-by-Step Execution

Here, we present an exemplary execution of the first turn (of the first round) on the input $T=\texttt{cabaacabcabaacaaabcab}$ . We visualize each step of this turn as a row in Fig. 1. A detailed description of each row follows:

Row 1:

Suppose that we have computed $F$ , which has a constant number of entries444In the later turns when the size $f_{k}$ becomes larger, $F$ will be put in the text space.. The highest frequency is five achieved by $\mathtt{ab}$ and $\mathtt{ca}$ . The lowest frequency represented in $F$ is three, which becomes the threshold for a bigram to be present in $F$ such that bigrams whose frequencies drop below this threshold are removed from $F$ . This threshold is a constant for all later turns until $F$ is rebuilt (in the following round). During Turn 1, the algorithm proceeds now as follows:

Row 2:

Choose $\mathtt{ab}$ as a bigram to replace with a new non-terminal $X_{1}$ (break ties arbitrarily). Replace every occurrence of $\mathtt{ab}$ with $X_{1}$ while decrementing frequencies in $F$ accordingly to the neighboring characters of the replaced occurrence.

Row 3:

Remove from $F$ every bigram whose frequency falls below the threshold. Obtain space for $D$ by aligning the compressed text $T_{1}$ . (The process of Row 2 and Row 3 can be done simultaneously.)

Row 4:

Scan the text and copy each character preceding an occurrence of $X_{1}$ in $T_{1}$ to $D$ .

Row 5:

Sort characters in $D$ lexicographically.

Row 6:

Insert new bigrams (consisting of a character of $D$ and $X_{1}$ ) whose frequencies are at least as large as the threshold.

Row 7:

Scan the text again and copy each character succeeding an occurrence of $X_{1}$ in $T_{1}$ to $D$ (symmetric to Row 4).

Row 8:

Sort all characters in $D$ lexicographically (symmetric to Row 5).

Row 9:

Insert new bigrams whose frequencies are at least as large as the threshold (symmetric to Row 6).

2.5 Implementation

We provide a simplified implementation in C++17 at https://github.com/koeppl/repair-inplace. The simplification is that we (a) fix the bit width of the text space to 16 bit, and (b) assume that $\Sigma$ is the byte alphabet. We further skip the step increasing the bit width from $\lg\sigma_{i}$ to $\lg\sigma_{i+1}$ . This means that the program works as long as the characters of $\Sigma_{m}$ fit into 16 bits. The benchmark, whose results are displayed in Sect. 2.5, was conducted on a Mac Pro Server with an Intel Xeon CPU X5670 clocked at 2.93GHz running Arch Linux. The implementation was compiled with gcc-8.2.1 in the highest optimization mode -O3. Looking at Sect. 2.5, we can see that the running time is super-linear to the input size on all text instances, which we obtained from the Pizza&Chili corpus (http://pizzachili.dcc.uchile.cl/). Section 2.5 gives some characteristics about the used data sets. We see that the number of rounds is the number of turns plus one for every unary string $\texttt{a}^{2^{k}}$ with an integer $k\geq 1$ since the text contains only one bigram with a frequency larger than two in each round. Replacing this bigram in the text makes $F$ empty such that the algorithm recomputes $F$ after each turn. Note that the number of rounds can drop while scaling the prefix length based on the choice of the bigrams stored in $F$ .

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aggarwal and Vitter [1988] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM , 31(9):1116–1127, 1988.
2Bannai et al. [2019] H. Bannai, M. Hirayama, D. Hucke, S. Inenaga, A. Jez, M. Lohrey, and C. P. Reh. The smallest grammar problem revisited. ar Xiv 1908.06428 , 2019.
3Batcher [1968] K. E. Batcher. Sorting networks and their applications. In Proc. AFIPS , volume 32 of AFIPS Conference Proceedings , pages 307–314, 1968.
4Bille et al. [2017 a] P. Bille, I. L. Gørtz, and N. Prezza. Practical and effective Re-Pair compression. ar Xiv 1704.08558 , 2017 a.
5Bille et al. [2017 b] P. Bille, I. L. Gørtz, and N. Prezza. Space-efficient Re-Pair compression. In Proc. DCC , pages 171–180, 2017 b.
6Boyer and Moore [1991] R. S. Boyer and J. S. Moore. MJRTY: A fast majority vote algorithm. In Automated Reasoning: Essays in Honor of Woody Bledsoe , Automated Reasoning Series, pages 105–118, 1991.
7Chan et al. [2018] T. M. Chan, J. I. Munro, and V. Raman. Selection and sorting in the “restore” model. ACM Trans. Algorithms , 14(2):11:1–11:18, 2018.
8Charikar et al. [2005] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Trans. Information Theory , 51(7):2554–2576, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Re-Pair in Small Space

Abstract

1 Introduction

1.1 Related Work

In-Place String Algorithms.

Re-Pair Computation.

Our Contribution.

1.2 Preliminaries

Strings.

Re-Pair.

2 Sequential Algorithm

Lemma 2.1** ([32]).**

2.1 Trade-Off Computation

Lemma 2.2**.**

Proof.

2.2 Algorithmic Ideas

Frequency Table FFF.

Algorithm Outline.

2.3 Algorithmic Details

Updating FFF

Capacity of FFF

Time Analysis

Theorem 2.3**.**

Output

Remark 2.4** (Pointer Machine Model).**

2.4 Step-by-Step Execution

2.5 Implementation

Lemma 2.1 ([32]).

Lemma 2.2.

Frequency Table $F$ .

Updating $F$

Capacity of $F$

Theorem 2.3.

Remark 2.4 (Pointer Machine Model).