Direct Linear Time Construction of Parameterized Suffix and LCP Arrays   for Constant Alphabets

Noriki Fujisato; Yuto Nakashima; Shunsuke Inenaga; Hideo Bannai,; Masayuki Takeda

arXiv:1906.00563·cs.DS·June 4, 2019

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,, Masayuki Takeda

PDF

Open Access

TL;DR

This paper introduces the first worst-case linear time algorithm for directly constructing parameterized suffix and LCP arrays for constant alphabets, improving efficiency over previous methods that were slower or required additional structures.

Contribution

It presents a novel linear time algorithm for directly computing parameterized suffix and LCP arrays for constant alphabets, eliminating the need for prior suffix tree construction.

Findings

01

Algorithm runs in O(nπ) time and O(n) space.

02

First worst-case linear time algorithm for this problem.

03

Applicable to strings over static and parameterized alphabets.

Abstract

We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $Σ$ and parameterized alphabet $Π$ , our algorithm runs in $O (nπ)$ time and $O (n)$ words of space, where $π$ is the number of distinct symbols of $Π$ in the string.

Equations16

prev (x) [i] = ⎩ ⎨ ⎧ x [i] 0 i - j if x [i] \in Σ, if x [i] \in Π and x [i] \neq = x [j] for any 1 \leq j < i, if x [i] \in Π, x [i] = x [j] and x [i] \neq = x [k] for any j < k < i .

prev (x) [i] = ⎩ ⎨ ⎧ x [i] 0 i - j if x [i] \in Σ, if x [i] \in Π and x [i] \neq = x [j] for any 1 \leq j < i, if x [i] \in Π, x [i] = x [j] and x [i] \neq = x [k] for any j < k < i .

prev (x [i .. n]) [k] = {0 prev (x) [k^{'}] if x [k^{'}] \in Π and prev (x) [k^{'}] > k . otherwise

prev (x [i .. n]) [k] = {0 prev (x) [k^{'}] if x [k^{'}] \in Π and prev (x) [k^{'}] > k . otherwise

prev (x [i .. n]) 0 = B_{i, 1} \dots B_{i, z_{i} + 1}

prev (x [i .. n]) 0 = B_{i, 1} \dots B_{i, z_{i} + 1}

B_{j}

B_{j}

C_{i, j}

C_{i}

prev (x [i_{1} .. n]) ≺ prev (x [i_{2} .. n]) ⟺ C_{i_{1}} ≺ C_{i_{2}}

prev (x [i_{1} .. n]) ≺ prev (x [i_{2} .. n]) ⟺ C_{i_{1}} ≺ C_{i_{2}}

prev (x [i_{1} .. n]) ≺ prev (x [i_{2} .. n])

prev (x [i_{1} .. n]) ≺ prev (x [i_{2} .. n])

fwd (x) [i] = ⎩ ⎨ ⎧ x [i] \infty j - i if x [i] \in Σ, if x [i] \in Π and x [i] \neq = x [j] for any i < j \leq n, if x [i] \in Π, x [i] = x [j] and x [i] \neq = x [k] for any i < k < j .

fwd (x) [i] = ⎩ ⎨ ⎧ x [i] \infty j - i if x [i] \in Σ, if x [i] \in Π and x [i] \neq = x [j] for any i < j \leq n, if x [i] \in Π, x [i] = x [j] and x [i] \neq = x [k] for any i < k < j .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · DNA and Biological Computing

Full text

11institutetext: Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan

11email: {noriki.fujisato,yuto.nakashima,inenaga,bannai,takeda}@inf.kyushu-u.ac.jp

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

Noriki Fujisato

Yuto Nakashima

Shunsuke Inenaga

Hideo Bannai

0000-0002-6856-5185

Masayuki Takeda

Abstract

We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $\Sigma$ and parameterized alphabet $\Pi$ , our algorithm runs in $O(n\pi)$ time and $O(n)$ words of space, where $\pi$ is the number of distinct symbols of $\Pi$ in the string.

Keywords:

parameterized pattern matching, paramterized suffix array paramterized LCP array

1 Introduction

Parameterized pattern matching is one of the well studied “non-standard” pattern matching problems which was initiated by Baker [1], in an application to find duplicated code where variable names may be renamed. In the parameterized matching problem, we consider strings over an alphabet partitioned into two sets: the parameterized alphabet $\Pi$ and the static alphabet $\Sigma$ . Two strings $x,y\in(\Pi\cup\Sigma)^{*}$ of length $n$ are said to parameterized match (p-match), if one can be obtained from the other with a bijective mapping over symbols of $\Pi$ , i.e., there exists a bijection $\phi:\Pi\rightarrow\Pi$ such that for all $1\leq i\leq n$ , $x[i]=y[i]$ if $x[i]\in\Sigma$ , and $\phi(x[i])=y[i]$ if $x[i]\in\Pi$ . For example, if $\Pi=\{\mathtt{x},\mathtt{y},\mathtt{z}\}$ and $\Sigma=\{\mathtt{A},\mathtt{B},\mathtt{C}\}$ , strings $\mathtt{xxAzxByzBCzy}$ and $\mathtt{yyAxyBzxBCxz}$ p-match, since we can choose $\phi(\mathtt{x})=\mathtt{y},\phi(\mathtt{y})=\mathtt{z},$ and $\phi(\mathtt{z})=\mathtt{x}$ , while strings $\mathtt{xyAzzByxBCz}$ and $\mathtt{yyAzxByxBCy}$ do not p-match, since there is no such bijection on $\Pi$ . As parameterized matching captures the “structure” of the string, it has also been extended to RNA structural matching [16].

Baker introduced the so-called prev encoding of a p-string which maps each symbol of the p-string that is in $\Pi$ to the distance to its previous occurrence (or [math] if it is the first occurrence), and showed that two p-strings p-match if and only if their prev encodings are equivalent. For example, the prev encodings for p-strings $\mathtt{xxAzxByzBCzy}$ and $\mathtt{yyAxyBzxBCxz}$ are both $(0,1,\mathtt{A},0,3,\mathtt{B},0,4,\mathtt{B},\mathtt{C},3,5)$ . Thus, the parameterized matching problem amounts to efficiently comparing the prev encodings of the p-strings.

Using the prev encoding allows for the development of data structures that mimic those of standard strings. The central difficulty, in contrast with standard strings, is in coping with the following property of prev encodings; a substring of a prev encoding is not necessarily equivalent to the prev encoding of the corresponding substring.

Nevertheless, several data structures and algorithms have so far successfully been developed. Baker proposed the parameterized suffix tree (PST), an analogue of the suffix tree for standard strings [17], and showed that for a string of length $n$ , it could be built in $O(n|\Pi|)$ time and $O(n)$ words of space [2]. Using the PST for $T$ , all occurrences of a substring in $T$ which parameterized match a given pattern $P$ can be computed in $O(|P|(\log(|\Pi|+|\Sigma|))+occ)$ time, where $occ$ is the number of occurrences of the pattern in the text. Kosaruju [15] further improved the running time of construction to $O(n\log(|\Pi|+|\Sigma|))$ . Furthermore, Shibuya [16] proposed an on-line algorithm for constructing the PST that runs in the same time bounds.

Deguchi et al. [5] proposed the parameterized suffix array (PSA). Given the PST of a string, the PSA can be constructed in linear time, but as in the case for standard strings, the direct construction of PSAs has been a topic of interest.j Deguchi et al. [5] showed a linear time algorithm for the special case of $|\Pi|=2$ and $\Sigma=\emptyset$ . I et al. [11] proposed a lightweight and practically efficient algorithm for larger $\Pi$ , but the worst-case time was still quadratic in $n$ . Beal and Adjeroh [4] proposed an algorithm based on arithmetic coding that runs in $O(n)$ time on average. Furthermore, they claimed a worst-case running time of $o(n^{2})$ . However, the proved upperbound is $O(n^{2}(\frac{\log(n-\log^{1+\varepsilon}n)}{\log^{1+\varepsilon}n}))$ for a very small $\varepsilon>0$ (Corollary 27 of [4]), so it is only slightly better than quadratic.

In this paper, we break the worst-case quadratic time barrier considerably, and present the first worst-case linear time algorithm for constructing the parameterized suffix and LCP arrays of a given p-string, when the number of distinct parameterized symbols in the string is constant. Namely, our algorithm runs in $O(n\pi)$ time and $O(n)$ words of space, where $\pi$ is the number of distinct symbols of $\Pi$ in the string.

Several other indices for parameterized pattern matching have been proposed. Diptarama et al. [6] and Fujisato et al. [8] proposed the parameterized position heaps (PPH), an analogue of the position heap for standard strings [7], and showed that it could be built in $O(n\log(|\Sigma|+|\Pi|))$ time and $O(n)$ words of space. Using the PPH for $T$ , all occurrences of a substring in $T$ which parameterized match a given pattern $P$ can be computed in $O(|P|(|\Pi|+\log(|\Pi|+|\Sigma|))+occ)$ time, where $occ$ is the number of occurrences of the pattern in the text. Parameterized BWT’s have been proposed in [10]. Also, paramterized text index with one wildcard was proposed in [9].

2 Preliminaries

For any set $A$ of symbols, $A^{*}$ denotes the set of strings over the alphabet $A$ . Let $|x|$ denote the length of a string $x$ . The empty string is denoted by $\varepsilon$ . For any string $w\in A^{*}$ , if $w=xyz$ for some (possibly empty) $x,y,z\in A^{*}$ , $x,y,z$ are respectively called a prefix, substring, suffix of $w$ . When $x,y,z\neq w$ , they are respectively called a proper prefix, substring, and suffix of $w$ . For any integer $1\leq i\leq|x|$ , $x[i]$ denotes the $i$ th symbol in $x$ , and for any $1\leq i\leq j\leq|x|$ , $x[i..j]=x[i]\cdots x[j]$ . Let $\prec$ denote a total order on $A$ , as well as the lexicographic order it induces. For two strings $x,y\in A^{*}$ , $x\prec y$ if and only if $x$ is a proper prefix of $y$ , or there is some position $1\leq k\leq\min\{|x|,|y|\}$ such that $x[1..k-1]=y[1..k-1]$ and $x[k]\prec y[k]$ .

Let $\Pi$ and $\Sigma$ denote disjoint sets of symbols. $\Pi$ is called the parameterized alphabet, and $\Sigma$ is called the static alphabet. A string in $(\Pi\cup\Sigma)^{*}$ is sometimes called a p-string. Two p-strings $x,y\in(\Pi\cup\Sigma)^{*}$ of equal length are said to parameterized match, denoted $x\approx y$ , if there exists a bijection $\phi:\Pi\rightarrow\Pi$ , such that for all $1\leq i\leq|x|$ , $x[i]=y[i]$ if $x[i]\in\Sigma$ , and $\phi(x[i])=y[i]$ if $x[i]\in\Pi$ .

The prev encoding of a p-string $x$ of length $n$ is the string $\mathit{prev}(x)$ over the alphabet $\Sigma\cup\{0,\ldots,n-1\}$ defined as follows:

[TABLE]

For example, if $\Pi=\{\mathtt{s},\mathtt{t},\mathtt{u}\}$ , $\Sigma=\{\mathtt{A}\}$ and p-string $x=\mathtt{ssuAAstuAst}$ , then $\mathit{prev}(x)=(0,1,0,\mathtt{A},\mathtt{A},4,0,5,\mathtt{A},4,4)$ . Baker showed that $x\approx y$ if and only if $\mathit{prev}(x)=\mathit{prev}(y)$ [3]. We assume that $\Pi$ and $\Sigma$ are disjoint integer alphabets, where $\Pi=\{0,\ldots,n^{c_{1}}\}$ for some constant $c_{1}\geq 1$ and $\Sigma=\{n^{c_{1}}+1,\ldots,n^{c_{2}}\}$ for some constant $c_{2}\geq 1$ . This way, we can distinguish whether a symbol of a given prev encoding belongs to $\Sigma$ or not. Also, given p-string $x$ of length $n$ , we can compute $prev(x)$ in $O(n)$ time and space, by sorting the pairs $(x[i],i)$ using radix sort, followed by a simple scan of the result.

The following are the data structures that we consider in this paper.

Definition 1 (Parameterized Suffix Array [5])

The parameterized suffix array of a p-string $x$ of length $n$ , is an array $\mathit{PSA}[1..n]$ of integers such that $\mathit{PSA}[i]=j$ if and only if $\mathit{prev}(x[j..n])$ is the $i$ th lexicographically smallest string in $\{\mathit{prev}(x[i..n])\mid i=1,\ldots,n\}$ .

Definition 2 (Parameterized LCP Array [5])

The parameterized LCP array of a p-string $x$ of length $n$ , is an array $\mathit{pLCP}[1..n]$ of integers such that $\mathit{pLCP}[1]=0$ , and $\mathit{pLCP}[i]$ , for any $i\in\{2,\ldots,n\}$ , is the longest common prefix between $\mathit{prev}(x[\mathit{PSA}[i-1]..n])$ and $\mathit{prev}(x[\mathit{PSA}[i]..n])$ .

The difficulty when dealing with the prev encoding of suffixes of a string, is that they are not necessarily the suffixes of the prev encoding of the string. It is important to notice however, that, given the prev encoding $\mathit{prev}(x)$ of the whole string $x$ , any value specific of the prev encoding of an arbitrary suffix of $x$ can be retrieved in constant time, i.e., for any $1\leq i\leq n$ and $1\leq k\leq n-i+1$ ,

[TABLE]

where $k^{\prime}=i+k-1$ . The critical problem for suffix sorting is that even if two prev encodings $\mathit{prev}(x[i..n])$ and $\mathit{prev}(x[j..n])$ share a common prefix and satisfies $\mathit{prev}(x[i..n])\prec\mathit{prev}(x[j..n])$ , it may still be that $\mathit{prev}(x[j+1..n])\prec\mathit{prev}(x[i+1..n])$ .

Fig. 1 shows an example of $\mathit{PSA}$ and $\mathit{pLCP}$ for the string $\mathtt{stssAtssAs}$ . For example, we have that $\mathit{prev}(x[6..10])\prec\mathit{prev}(x[1..10])$ , which share a common prefix of length $2$ , yet $\mathit{prev}(x[2..10])\prec\mathit{prev}(x[7..10])$ .

3 Algorithms

In this section we describe our algorithms for constructing the parameterized suffix and LCP arrays. First, we mention a simple observation below.

From the definition of $\mathit{prev}(x)$ , we have that $\mathit{prev}(x)[i]=0$ for some position $i$ if and only if $i$ is the first occurrence of symbol $x[i]\in\Pi$ . Therefore, the following observation can be made.

Observation 1

For any p-string $x$ , the prev encoding $\mathit{prev}(x^{\prime})$ of any substring $x^{\prime}$ of $x$ contains at most $\pi$ positions that are [math]’s, where $\pi$ is the number of distinct symbols of $\Pi$ in $x$ .

3.1 $\mathit{PSA}$ Construction

Based on this observation, we can see that the prev encoding of each suffix $x[i..n]$ can be partitioned into $z_{i}+1\leq\pi+1$ blocks, where $z_{i}$ is the number of [math]’s in $\mathit{prev}(x[i..n])$ , and the $j$ th block is the substring of $\mathit{prev}(x[i..n])$ that ends at the $j$ th [math] in $\mathit{prev}(x[i..n])$ for $j=1,\ldots,z_{i}$ , and the (possibly empty) remaining suffix for $j=z_{i}+1$ . For technical reasons, we will append [math] to the last block as well. That is, we can write

[TABLE]

where, $B_{i,j}$ denotes the $j$ th block of $\mathit{prev}(x[i..n])$ . Furthermore, for each $j$ , let $B_{j}$ denote the set of all $j$ th blocks for all $i=1,\ldots,n$ , and let $C_{i,j}$ denote the lexicographic rank of $B_{i,j}$ in $B_{j}$ . Finally, let $C_{i}$ denote the string over the alphabet $\{1,\ldots,n\}$ obtained by renaming each block $B_{i,j}$ of the string $\mathit{prev}(x[i..n])0$ with its lexicographic rank $C_{i,j}$ . More formally,

[TABLE]

Lemma 1

For any $1\leq i_{1},i_{2}\leq n$ ,

[TABLE]

Proof

Notice that [math] is the smallest symbol in the two strings, so

[TABLE]

Also notice that since any block must end with a [math], if two blocks are not identical, it holds that one cannot be a prefix of the other. Thus, if $B_{i_{1},1}\cdots B_{i_{1},z_{i_{1}}+1}\prec B_{i_{2},1}\cdots B_{i_{2},z_{i_{2}}+1}$ , this implies that there is some block $k$ such that $B_{i_{1},j}=B_{i_{2},j}$ , for all $1\leq j<k$ , and $B_{i_{1},k}\prec B_{i_{2},k}$ , where $B_{i_{1},k}$ is not a prefix of $B_{i_{2},k}$ . By definition, $B_{i_{1},k}\preceq B_{i_{2},k}\Leftrightarrow C_{i_{1},k}\leq C_{i_{2},k}$ . Therefore, we have, $B_{i_{1},1}\cdots B_{i_{1},z_{i_{1}}+1}\prec B_{i_{2},1}\cdots B_{i_{2},z_{i_{2}}+1}\Leftrightarrow C_{i_{1}}\prec C_{i_{2}}$ . ∎

From Lemma 1, the problem of lexicographically sorting the set of strings $\{\mathit{prev}(x[1..n]),\ldots,\mathit{prev}(x[n..n])\}$ reduces to the problem of lexicographically sorting the set of strings $\{C_{1},\ldots,C_{n}\}$ . The latter can be done in $O(n\pi)$ time using radix sort, since the strings are over the alphabet $\{1,\ldots,n\}$ and the total length of the strings is at most $n\pi$ .

What remains is to to compute $C_{i,j}$ for all $i,j$ in the same time bound. A problem is that the total length of all $B_{i,j}$ is $\Theta(n^{2})$ , so we cannot afford to naively process all of them.

Denote by $b_{i,j}$ and $e_{i,j}$ the beginning and end positions of $B_{i,j}$ with respect to their (global) position in $x$ . Note that for any $1\leq i\leq n$ , we have $b_{i,1}=i$ , and $b_{i,j}=e_{i,j-1}+1$ for all $2\leq j\leq z_{i}+1$ . Our algorithm depends on the following simple yet crucial lemma.

Lemma 2

For any $1<i\leq n$ and $1\leq j\leq z_{i}+1$ , we have that either

$b_{i,j}=e_{i-1,j}+1$ , or, 2. 2.

$b_{i,j}\geq b_{i-1,j}$ , $e_{i,j}=e_{i-1,j}$ , and $B_{i,j}$ is a suffix of $B_{i-1,j}$

holds.

Proof

If $x[i-1]\in\Sigma$ , then, $\mathit{prev}(x[i..n])$ is a suffix of $\mathit{prev}(x[i-1..n])$ , i.e., $\mathit{prev}(x[i..n])=\mathit{prev}(x[i-1..n])[2..|n-i+2|]$ and $\mathit{prev}(x[i-1..n])[1]\neq 0$ . Thus, $B_{i,1}$ is a suffix of $B_{i-1,1}$ , and $B_{i,j}=B_{i-1,j}$ for all $2\leq j\leq z_{i}$ and the second case of the claim holds.

If $x[i-1]\in\Pi$ , the values in $\mathit{prev}(x[i..n])$ are equivalent to the corresponding values of $\mathit{prev}(x[i-1..n])[2..|n-i+2|]$ , except possibly at some (global) position $k\geq i$ when there is a second occurrence of the symbol $x[i-1]$ at $x[k]$ which becomes the first occurrence in $x[i..n]$ . (In other words, the value corresponding to $x[k]$ in $\mathit{prev}(x[i-1..n])$ is $k-i+1$ .) Since there is no previous occurrence of $x[i-1]$ in $x[i-1..n]$ , $\mathit{prev}(x[i-1..n])[1]=0$ . The situation is depicted in Fig. 2.

Let $B_{i-1,j^{\prime}}$ be the block of $\mathit{prev}(x[i-1..n])$ that contains (global) position $k$ . Because, as mentioned previously, $\mathit{prev}(x[i]..n)$ and $\mathit{prev}(x[i-1..n])[2..|n-i+2|]$ are equivalent except for the value corresponding to (global) position $k$ , the block structure of $\mathit{prev}(x[i-1..n])$ is preserved in $\mathit{prev}(x[i..n])$ , except that (1) the first block $B_{i-1,1}$ disappears, and (2) the block $B_{i-1,j^{\prime}}$ is split into two blocks, corresponding to $B_{i,j^{\prime}-1}$ and $B_{i,j^{\prime}}$ . Therefore, the first case of the claim is satisfied for $1\leq j\leq j^{\prime}$ , since $b_{i,j}=b_{i-1,j+1}=e_{i-1,j}+1$ for any $1\leq j<j^{\prime}$ . Also, we can see that the second case of the claim is satisfied for $j^{\prime}\leq j\leq z_{i}$ , since $B_{i,j^{\prime}}$ is a suffix of $B_{i-1,j^{\prime}}$ , and $B_{i,j}=B_{i-1,j}$ for $j^{\prime}<j\leq z_{i}$ .

Finally, the case when such $k$ does not exist can be considered to be included above by simply assuming we are looking at a prefix of a longer string and $k>|x|,j^{\prime}>z_{i}$ , since the prev encoding is preserved for prefixes, i.e., the prev encoding of a prefix of any p-string $y$ is equivalent to the corresponding prefix of the prev encoding of $y$ . Thus, the lemma holds. ∎

Lemma 2 implies that if we fix some $j$ , we can represent $B_{i,j}$ for all $i$ , as suffixes (in the standard sense) of strings of total length $O(n)$ .

Corollary 1

For any $j$ , there exists a set of strings $S_{j}$ with total length $n+1$ over the alphabet $\Sigma\cup\{0,\ldots,n-1\}$ such that $B_{i,j}$ is a suffix of some string in $S_{j}$ for all $i\in\{1,\ldots,n\}$ .

Proof

We include $B_{i,j}$ in $S_{j}$ , if $i=1$ , or, if $i>1$ and $B_{i,j}$ satisfies the first case of Lemma 2. Since the first case implies that the (global) positions $[b_{i-1,j}..e_{i-1,j}]$ and $[b_{i,j}...e_{i,j}]$ are disjoint, the total length of strings in $S_{j}$ is at most $n+1$ (including the [math] appended to $B_{i,z_{i}+1}$ ). On the other hand, if $B_{i,j}$ satisfies the second case is, it is a suffix of an already included string.∎

Thus, computing $C_{i,j}$ for all $i$ can be done by computing the generalized suffix array for the set $S_{j}$ . This can be done in $O(n)$ time given $S_{j}$ [13, 14, 12] and thus, for all $j$ , the total is $O(n\pi)$ time.

Theorem 3.1

The parameterized suffix array of a p-string of length $n$ can be computed in $O(n\pi)$ time and $O(n)$ space.

Proof

We compute a forward encoding of $x$ , analogous to the prev encoding, defined as follows

[TABLE]

This is done once, and can be computed in $O(n)$ time. Next, for any fixed $j$ , we show how to compute the set $S_{j}$ in linear time. This is done by using $\mathit{fwd}$ and Lemma 2. We can first scan $\mathit{prev}(x)$ to obtain $B_{1,j}$ . Suppose for some $i\geq 2$ , we know the beginning and end positions $b_{i-1,j}$ , $e_{i-1,j}$ of $B_{i-1,j}$ . Notice that when $x[i]\in\Pi$ , $k$ in the proof of Lemma 2 is $i+\mathit{fwd}(x)[i-1]-2$ . Based on this value, we know that if $k<b_{i-1,j}$ , then $B_{i,j}=B_{i-1,j}$ and if $b_{i-1,j}\leq k\leq e_{i-1,j}$ $B_{i,j}$ is a suffix of $B_{i-1,j}$ , which corresponds to the second case of Lemma 2. When $k>e_{i-1,j}$ , this corresponds to the first case of Lemma 2, so we scan $\mathit{prev}(x[i..n])$ starting from position corresponding to the global position $b_{i,j}=e_{i-1,j}+1$ (i.e., $e_{i-1,j}-i$ in $\mathit{prev}(x[i..n])$ ) until we find the first [math], which gives us $B_{i,j}$ which we include in $S_{j}$ . Since we only scan each position once, the total time for computing $S_{j}$ is $O(n)$ .

The time complexity follows from arguments for sorting $C_{j}$ based on radix sort. Since, for a single step of the radix sort, we only require the values $C_{i,j}$ for a fixed $j$ and all $1\leq i\leq n$ and from Corollary 1, the space complexity is $O(n)$ . ∎

3.2 $\mathit{pLCP}$ Construction

Given $\mathit{PSA}$ , we can construct $\mathit{pLCP}$ as follows in $O(n\pi)$ time and $O(n)$ space. We recompute $S_{j}$ for $j=1,\ldots,\pi$ , and each time process it for LCE queries, so that the longest common prefix between $B_{i_{1},j}$ and $B_{i_{2},j}$ for some $1\leq i_{1},i_{2}\leq n$ can be computed in constant time. This can be done in time linear in the total length of $S_{j}$ , so in $O(n\pi)$ total time for all $j$ . We compute the longest common prefix between each adjacent suffix in $\mathit{PSA}$ block by block. Since each block takes constant time, and there are $O(\pi)$ blocks for each suffix, the total is $O(n\pi)$ time for all entries of the $\mathit{pLCP}$ array. The space complexity is $O(n)$ since, as for the case of $\mathit{PSA}$ construction, we only process the $j$ th block at each step.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Baker, B.S.: A program for identifying duplicated code. Computing Science and Statistics 24 , 49–57 (1992)
2[2] Baker, B.S.: Parameterized pattern matching: Algorithms and applications. J. Comput. Syst. Sci. 52 (1), 28–42 (1996). https://doi.org/10.1006/jcss.1996.0003, https://doi.org/10.1006/jcss.1996.0003 · doi ↗
3[3] Baker, B.S.: Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26 (5), 1343–1362 (1997). https://doi.org/10.1137/S 0097539793246707, https://doi.org/10.1137/S 0097539793246707 · doi ↗
4[4] Beal, R., Adjeroh, D.A.: p-suffix sorting as arithmetic coding. J. Discrete Algorithms 16 , 151–169 (2012). https://doi.org/10.1016/j.jda.2012.05.001, https://doi.org/10.1016/j.jda.2012.05.001 · doi ↗
5[5] Deguchi, S., Higashijima, F., Bannai, H., Inenaga, S., Takeda, M.: Parameterized suffix arrays for binary strings. In: Holub, J., Zdárek, J. (eds.) Proceedings of the Prague Stringology Conference 2008, Prague, Czech Republic, September 1-3, 2008. pp. 84–94. Prague Stringology Club, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague (2008), http://www.stringology.org/event/2008/p 08.html
6[6] Diptarama, Katsura, T., Otomo, Y., Narisawa, K., Shinohara, A.: Position heaps for parameterized strings. In: Kärkkäinen, J., Radoszewski, J., Rytter, W. (eds.) 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland. LIP Ics, vol. 78, pp. 8:1–8:13. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017). https://doi.org/10.4230/LIP Ics.CPM.2017.8, https://doi.org/10.4230/LIP Ics.CPM.2017.8 · doi ↗
7[7] Ehrenfeucht, A., Mc Connell, R.M., Osheim, N., Woo, S.W.: Position heaps: A simple and dynamic text indexing data structure. Journal of Discrete Algorithms 9 (1), 100 – 121 (2011). https://doi.org/https://doi.org/10.1016/j.jda.2010.12.001, http://www.sciencedirect.com/science/article/pii/S 1570866710000535 , 20th Anniversary Edition of the Annual Symposium on Combinatorial Pattern Matching (CPM 2009)
8[8] Fujisato, N., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Right-to-left online construction of parameterized position heaps. Co RR abs/1808.01071 (2018), http://arxiv.org/abs/1808.01071

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

Abstract

Keywords:

1 Introduction

2 Preliminaries

Definition 1 (Parameterized Suffix Array [5])

Definition 2 (Parameterized LCP Array [5])

3 Algorithms

Observation 1

3.1 PSA\mathit{PSA}PSA Construction

Lemma 1

Proof

Lemma 2

Proof

Corollary 1

Proof

Theorem 3.1

Proof

3.2 pLCP\mathit{pLCP}pLCP Construction

3.1 $\mathit{PSA}$ Construction

3.2 $\mathit{pLCP}$ Construction