An Extension of Linear-size Suffix Tries for Parameterized Strings

Katsuhito Nakashima; Diptarama Hendrian; Ryo Yoshinaka; Ayumi; Shinohara

arXiv:1902.00216·cs.DS·September 5, 2019

An Extension of Linear-size Suffix Tries for Parameterized Strings

Katsuhito Nakashima, Diptarama Hendrian, Ryo Yoshinaka, Ayumi, Shinohara

PDF

Open Access

TL;DR

This paper introduces PLSTs, a new linear-size indexing structure for parameterized strings, enabling efficient pattern matching and improved space efficiency on repetitive data.

Contribution

It generalizes linear-size suffix tries to parameterized strings, providing a linear-time pattern matching algorithm and demonstrating space efficiency improvements.

Findings

01

PLSTs enable linear-time parameterized pattern matching.

02

PLSTs are more space-efficient than parameterized suffix trees on repetitive strings.

03

The structure combines features of suffix tries and suffix trees for parameterized strings.

Abstract

In this paper, we propose a new indexing structure for parameterized strings which we call PLSTs, by generalizing linear-size suffix tries for ordinary strings. Two parameterized strings are said to match if there is a bijection on the symbol set that makes the two coincide. PLSTs are applicable to the parameterized pattern matching problem, which is to decide whether the input parameterized text has a substring that matches the input parameterized pattern. The size of PLSTs is linear in the text size, with which our algorithm solves the parameterized pattern matching problem in linear time in the pattern size. PLSTs can be seen as a compacted version of parameterized suffix tries and a combination of linear-size suffix tries and parameterized suffix trees. We experimentally show that PLSTs are more space efficient than parameterized suffix trees for highly repetitive strings.

Tables2

Table 1. Table 1: The numbers of nodes of PLSTs for different sorts of strings

	random strings
	constant string		p-string
length $n$	Type 1	Type 2	Type 1	Type 2

10	16.98	6.04	16.93	5.23
20	35.66	12.78	35.72	12.27
40	74.58	27.25	74.53	26.22
80	153.61	56.82	153.48	56.04
160	312.37	115.55	312.45	115.24
320	631.40	234.55	631.27	235.32
640	1270.34	477.29	1270.47	475.34
1280	2549.35	956.18	2549.39	957.03
2560	5108.37	1923.62	5108.48	1922.97
5120	10227.48	3845.35	10227.29	3853.97
10240	20466.49	7710.50	20466.14	7704.25

Table 2. Table 2: The numbers of nodes of PLSTs for Thue-Morse and Period-doubling strings

	Thue-Morse strings
	constant string		p-string
length $n$	Type 1	Type 2,3	Type 1	Type 2,3

17	28	10	29	6
33	56	14	57	8
65	112	18	113	10
129	224	22	225	12
257	448	26	449	14
513	896	30	897	16
1025	1792	34	1793	18
2049	3584	38	3585	20
4097	7168	42	7169	22
8193	14336	46	14337	24

Equations22

prev (w) [i] = ⎩ ⎨ ⎧ w [i] 0 i - k if w [i] \in Σ, if w [i] \in Π and w [i] \neq = w [j] for 1 \leq j < i, if w [i] \in Π and k = max {j ∣ w [j] = w [i] and 1 \leq j < i} .

prev (w) [i] = ⎩ ⎨ ⎧ w [i] 0 i - k if w [i] \in Σ, if w [i] \in Π and w [i] \neq = w [j] for 1 \leq j < i, if w [i] \in Π and k = max {j ∣ w [j] = w [i] and 1 \leq j < i} .

⟨ u ⟩_{k} [i] = {0 u [i] if u [i] \in N and u [i] \geq i - k + 1, otherwise.

⟨ u ⟩_{k} [i] = {0 u [i] if u [i] \in N and u [i] \geq i - k + 1, otherwise.

Re (v) = {i - ∣ u ∣ 0 \mbox i f t h er ee x i s t s i \mbox s u c h t ha t v [i] = i - 1 \mbox an d ∣ u ∣ < i \leq ∣ v ∣, o t h er w i se .

Re (v) = {i - ∣ u ∣ 0 \mbox i f t h er ee x i s t s i \mbox s u c h t ha t v [i] = i - 1 \mbox an d ∣ u ∣ < i \leq ∣ v ∣, o t h er w i se .

p^{'} [i] = {0 p [i] if i = Re (v), otherwise

p^{'} [i] = {0 p [i] if i = Re (v), otherwise

R = {(u, v) \in V_{1} \times V_{2} ∣ there is i s.t. v = sl^{i} (u) and sl^{j} (u) \in / V_{2} for all j < i}

R = {(u, v) \in V_{1} \times V_{2} ∣ there is i s.t. v = sl^{i} (u) and sl^{j} (u) \in / V_{2} for all j < i}

T_{n}=\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{y}_{1}\mathtt{a}_{1}\dots\mathtt{y}_{n}\mathtt{a}_{n}\mathtt{z}\

$

T_{n}=\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{y}_{1}\mathtt{a}_{1}\dots\mathtt{y}_{n}\mathtt{a}_{n}\mathtt{z}\

$

w_{i} = 0 a_{i} 0 a_{i + 1} \dots 0 a_{n} 0 a_{1} \dots 0 a_{i - 1} \in PrevSub (T_{n})

w_{i} = 0 a_{i} 0 a_{i + 1} \dots 0 a_{n} 0 a_{1} \dots 0 a_{i - 1} \in PrevSub (T_{n})

Fib_{1} = b, Fib_{2} = a, Fib_{k} = Fib_{k - 1} + Fib_{k - 2} for k > 2 .

Fib_{1} = b, Fib_{2} = a, Fib_{k} = Fib_{k - 1} + Fib_{k - 2} for k > 2 .

σ (a) = ab

σ (a) = ab

σ (b) = ba

σ (a) = ab

σ (a) = ab

σ (b) = aa

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Natural Language Processing Techniques

Full text

An Extension of Linear-size Suffix Tries for Parameterized Strings

Katsuhito Nakashima

Graduate School of Information Sciences, Tohoku University, Japan

Diptarama Hendrian

Graduate School of Information Sciences, Tohoku University, Japan

Ryo Yoshinaka

Graduate School of Information Sciences, Tohoku University, Japan

Ayumi Shinohara

Graduate School of Information Sciences, Tohoku University, Japan

Abstract

In this paper, we propose a new indexing structure for parameterized strings which we call PLSTs, by generalizing linear-size suffix tries for ordinary strings. Two parameterized strings are said to match if there is a bijection on the symbol set that makes the two coincide. PLSTs are applicable to the parameterized pattern matching problem, which is to decide whether the input parameterized text has a substring that matches the input parameterized pattern. The size of PLSTs is linear in the text size, with which our algorithm solves the parameterized pattern matching problem in linear time in the pattern size. PLSTs can be seen as a compacted version of parameterized suffix tries and a combination of linear-size suffix tries and parameterized suffix trees. We experimentally show that PLSTs are more space efficient than parameterized suffix trees for highly repetitive strings.

1 Introduction

The pattern matching problem is to check whether a pattern string occurs in a text string or not. To efficiently solve the pattern matching problem, a numerous number of text indexing structures have been proposed. Suffix trees are most widely used data structures and provide many applications including several variants of pattern matching problems [5, 11]. They can be seen as a compacted type of suffix tries, where two branching nodes that have no other branching nodes between them in a suffix trie are directly connected in the suffix tree. The new edges have a reference to an interval of the text so that the original path label of the suffix trie can be recovered. Recently, Crochemore et al. [6] proposed a new indexing structure, called a linear-size suffix trie (LST), which is another compacted variant of a suffix trie. An LST replaces paths consisting only of non-branching nodes by edges like a suffix tree, but the original path labels are recovered by referring to other edge labels in the LST itself unlike suffix trees. LSTs use less memory space than suffix trees for indexing the same highly repetitive strings [takagi2017linear]. LSTs may be used as an alternative of suffix trees for various applications, like computing the longest common substrings, not limited to the pattern matching problem.

On the other hand, different types of pattern matching have been proposed and intensively studied. The variant this paper is concerned with is the parameterized pattern matching problem, introduced by Baker [3]. Considering two disjoint sets of symbols $\Sigma$ and $\Pi$ , we call a string over $\Sigma\cup\Pi$ a parameterized string (p-string). In the parameterized pattern matching problem, given p-strings ${T}$ and ${P}$ , we must check whether substrings of ${T}$ that can be transformed into ${P}$ by applying a one-to-one function that renames symbols in $\Pi$ . The parameterized pattern matching is motivated by applying to the software maintenance [1, 3], the plagiarism detection [9], the analysis of gene structure [14], and so on. Similarly to the basic string matching problem, several indexing structures that support the parameterized pattern matching have been proposed, such as parameterized suffix trees [3], structural suffix trees [14], parameterized suffix arrays [7, 12], and parameterized position heaps [8, 10].

In this paper, we propose a new indexing structure for p-strings, which we call PLST. A PLST is a tree structure that combines a linear-size suffix trie and a parameterized suffix tree for prev-encoded [3] suffixes of a p-string. We show that the size of a PLST is $O(n)$ and give an algorithm for the parameterized pattern matching problem for given a pattern and a PLST, to find the occurrences of a pattern in the text, that runs in $O(m)$ time, where $n$ is the length of the text and $m$ is the length of the pattern. Furthermore, we experimentally show that PLSTs are more space efficient than parameterized suffix trees for highly repetitive strings such as Fibonacci strings.

2 Preliminaries

2.1 Basic definitions and notation

We denote the set of all non-negative integers by ${\cal N}$ . Let $\Delta$ be an alphabet. For a string ${w}={xyz}\in\Delta^{*}$ , ${x}$ , ${y}$ , and ${z}$ are called prefix, substring, and suffix of ${w}$ , respectively. The length of ${w}$ is denoted by $|{w}|$ and the $i$ -th symbol of ${w}$ is denoted by ${w}[i]$ for $1\leq i\leq|{w}|$ . The substring of ${w}$ that begins at position $i$ and ends at position $j$ is denoted by ${w}[i:j]$ for $1\leq i\leq j\leq|{w}|$ . For convenience, we abbreviate ${w}[1:i]$ to ${w}[:i]$ and ${w}[i:|w|]$ to ${w}[i:]$ for $1\leq i\leq|{w}|$ . The empty string is denoted by $\varepsilon$ , that is $|\varepsilon|=0$ . Moreover, let ${w}[i:j]=\varepsilon$ if $i>j$ . For a string $u$ and an extension $uv$ , we write ${\mathsf{str}({u},{uv}})=v$ .

Throughout this paper, we fix two alphabets $\Sigma$ and $\Pi$ . We call elements of $\Sigma$ constant symbols and those of $\Pi$ parameter symbols. An element of $\Sigma^{*}$ is called a constant string and that of $(\Sigma\cup\Pi)^{*}$ is called a parameterized string, or p-string for short. We assume that the size of $\Sigma$ and $\Pi$ are constant.

Given two p-strings $w_{1}$ and $w_{2}$ of length $n$ , $w_{1}$ and $w_{2}$ are a parameterized match (p-match), denoted by $w_{1}\approx w_{2}$ , if there is a bijection $f$ on $\Sigma\cup\Pi$ such that $f(a)=a$ for any $a\in\Sigma$ and $f(w_{1}[i])=w_{2}[i]$ for all $1\leq i\leq n$ [3]. We can determine whether $w_{1}\approx w_{2}$ or not by using an encoding called prev-encoding defined as follows.

Definition 1 (Prev-encoding [3]).

For a p-string $w$ of length $n$ over $\Sigma\cup\Pi$ , the prev-encoding for $w$ , denoted by $\mathsf{prev}({w})$ , is defined to be a string over $\Sigma\cup{\cal N}$ of length $n$ such that for each $1\leq i\leq n$ ,

[TABLE]

We call strings over $\Sigma\cup{\cal N}$ pv-strings.

For any p-strings $w_{1}$ and $w_{2}$ , $w_{1}\approx w_{2}$ if and only if $\mathsf{prev}({w_{1}})=\mathsf{prev}({w_{2}})$ . For example, given $\Sigma=\{{\tt a,b}\}$ and $\Pi=\{{\tt u,v,x,y}\}$ , $s_{1}={\tt uvvvauuvb}$ and $s_{2}={\tt xyyyaxxyb}$ are p-matches by $f$ such that $f(\mathtt{u})=\mathtt{x}$ and $f(\mathtt{v})=\mathtt{y}$ , where $\mathsf{prev}({s_{1}})=\mathsf{prev}({s_{2}})=0011{\tt a}514{\tt b}$ .

We define parameterized pattern matching as follows.

Definition 2 (Parameterized pattern matching [3]).

Given two p-strings, text $T$ and pattern $P$ , decide whether $T$ has a substring that p-matches $P$ .

For example, considering a text $T={\tt auvaubuavbv}$ and a pattern $P={\tt xayby}$ over $\Sigma=\{{\tt a,b}\}$ and $\Pi=\{{\tt u,v,x,y}\}$ , $T$ has two substrings $T[3:7]={\tt vaubu}$ and $T[7:11]={\tt uavbv}$ that p-match $P$ .

Throughout this paper, we assume that a text $T$ ends with a sentinel symbol $\texttt{\$ }\in\Sigma $, which occurs nowhere else in$ T$.

2.2 Suffix tries, suffix trees, and linear-size suffix tries

This subsection briefly reviews tree structures for indexing all the substrings of a constant string $T\in\Sigma^{*}$ .

The suffix trie $\mathsf{STrie}(T)$ is a tree with nodes corresponding to all the substrings of $T$ . Figure 1 (a) shows an example of a suffix trie. Throughout this paper, we identify a node with its corresponding string for explanatory convenience. Note that each node does not explicitly remember its corresponding string. For each nonempty substring $ua$ of $T$ where $a\in\Sigma$ , we have an edge from $u$ to $ua$ labeled with $a$ . Then by reading the labels on the path from the root to a node $u$ , one can obtain the string $u$ the node corresponds. Then the path label from the node $u$ to a descendant $uv$ is ${\mathsf{str}({u},{uv}})=v$ for $u,v\in\Sigma^{*}$ . Since there are $\Theta(|T|^{2})$ substrings of $T$ , the size of $\mathsf{STrie}(T)$ is $\Theta(|T|^{2})$ .

The suffix tree $\mathsf{STree}(T)$ is a tree obtained from $\mathsf{STrie}(T)$ by removing all non-branching internal nodes and replacing each path with no branching nodes by a single edge whose label refers to a corresponding interval of the text $T$ . That is, the label on the edge $(u,v)$ is a pair $(i,j)$ such that $T[i:j]={\mathsf{str}({u},{v}})$ . Since there are at most $O(|T|)$ branching nodes, the size of $\mathsf{STree}(T)$ is $\Theta(|T|)$ .

An important auxiliary map on nodes is called suffix links, denoted by $\mathsf{SL}$ , which is defined by $\mathsf{SL}(aw)=w$ for each node $aw$ with $a\in\Sigma$ and $w\in\Sigma^{*}$ .

The linear-size suffix trie (LST) [6] $\mathsf{LST}(T)$ of a string $T$ is another compact variant of a suffix trie (see Figure 1 (b)). An LST suppresses (most) non-branching nodes and replaces paths with edges like a suffix tree, but the labels of those new edges do not refer to intervals of the input text. Each edge $(u,v)$ retains only the first symbol ${\mathsf{str}({u},{v}})[1]$ of the original path label ${\mathsf{str}({u},{v}})$ . To recover the original label ${\mathsf{str}({u},{v}})$ , we refer to another edge or a path in the LST itself following a suffix link, using the fact that ${\mathsf{str}({u},{v}})={\mathsf{str}({\mathsf{SL}(u)},{\mathsf{SL}(v)}})$ . The reference will be recursive, but eventually one can regain the original path label by collecting those retained symbols. For this sake, $\mathsf{LST}(T)$ keeps some non-branching internal nodes from $\mathsf{STrie}(T)$ and thus it may have more nodes than $\mathsf{STree}(T)$ , but still the size is linear in $|T|$ . The nodes of $\mathsf{LST}(T)$ consist of those of $\mathsf{STree}(T)$ and non-branching node whose suffix links point at a branching node. We call the former Type 1 and the latter Type 2. Each edge $(u,v)$ has a 1-bit flag that tells whether $|v|-|u|=1$ . If it is the case, one knows the complete label ${\mathsf{str}({u},{v}})={\mathsf{str}({u},{v}})[1]$ . Otherwise, one needs to follow the suffix link to regain the other symbols. An LST uses suffix links to regain the original path label in the suffix trie. If we had only Type 1 nodes, for some edge $(u,v)$ , there may be a branching node between $\mathsf{SL}(u)$ and $\mathsf{SL}(v)$ , which makes it difficult to regain the original path label. Having Type 2 nodes, there is no branching node between $\mathsf{SL}(u)$ and $\mathsf{SL}(v)$ for every edge $(u,v)$ . Then it is enough to go straight down from $\mathsf{SL}(u)$ to regain the original path label.

2.3 Parameterized suffix tries and parameterized suffix trees

For a p-string $T\in(\Sigma\cup\Pi)^{*}$ , a prev-encoded substring (pv-substring) of $T$ is the prev-encoding $\mathsf{prev}({w})$ of a substring $w$ of $T$ . The set of pv-substrings of $T$ is denoted by $\mathsf{PrevSub}(T)$ .

A parameterized suffix trie of $T$ , denoted by $\mathsf{PSTrie}(T)$ , is the trie that represents all the pv-substrings of $T$ . The size of $\mathsf{PSTrie}(T)$ is $\Theta(|T|^{2})$ .

For a pv-string $u\in(\Sigma\cup{\cal N})^{*}$ , the $k$ -re-encoding for $u$ , denoted by $\langle u\rangle_{k}$ , is defined to be the pv-string of length $|u|$ such that for each $1\leq i\leq|u|$ ,

[TABLE]

When $k=1$ , we omit $k$ . We then have $\langle\mathsf{prev}({w})[i:j]\rangle=\mathsf{prev}({w[i:j]})$ for any p-string $w\in(\Sigma\cup\Pi)^{*}$ and $i,j\leq|w|$ .

Usually suffix links are defined on nodes of suffix trees, but it is convenient to have “implicit suffix links” on all nodes except the root of $\mathsf{STrie}(T)$ , i.e., all the nonempty substrings of $T$ , as well. For a nonempty pv-string $u\in(\Sigma\cup{\cal N})^{+}$ , let $\mathsf{sl}(u)$ denote the re-encoding $\langle u[2:]\rangle$ of the string obtained by deleting the first symbol. This operation on strings will define real suffix links in indexing structures for parameterized strings based on parameterized suffix tries. Differently from constant strings, $u\in\mathsf{PrevSub}(T)$ does not necessarily imply $u[2:]\in\mathsf{PrevSub}(T)$ . What we actually have is $\mathsf{sl}(u)=\langle u[2:]\rangle\in\mathsf{PrevSub}(T)$ .

A parameterized suffix tree (p-suffix tree) [3] of $T$ , denoted by $\mathsf{PSTree}(T)$ , is a compacted variant of the parameterized suffix trie. Figure 2 shows an example of a p-suffix tree. Like the suffix tree for a constant string over $\Sigma$ , $\mathsf{PSTree}(T)$ is obtained from $\mathsf{PSTrie}(T)$ by removing non-branching internal nodes and giving each edge as a label a reference to some interval of the prev-encoded text $\mathsf{prev}({T})$ . The reference is represented by a triple $(i,j,k)$ of a text start position, end position, and suffix number, which refers to the pv-string $\langle\mathsf{prev}({T})[k:]\rangle[i:j]$ .

3 PLSTs

We now introduce our indexing tree structures for p-strings, which we call PLSTs, based on LSTs and p-suffix trees reviewed in Sections 2.2 and 2.3. There are two difficulties in extending LSTs to deal with p-strings. Figure 3(a) shows the LST-like structure obtained from $\mathsf{PSTrie}(T)$ in the same way as $\mathsf{LST}(T)$ is obtained from $\mathsf{STrie}(T)$ . We want to know ${\mathsf{str}({u},{v}})$ for an edge $(u,v)$ by “reduction by suffix links”, but

it is not necessarily that ${\mathsf{str}({u},{v}})={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ , 2. 2.

there can be a branching node $u$ of $\mathsf{PSTrie}(T)$ such that $\mathsf{sl}(u)$ is not branching.

An example edge $(u,v)$ exhibiting the first difficulty consists of $u=00$ and $v=00\mathtt{b}3\texttt{\$ } $, where$ {\mathsf{str}({u},{v}})=\texttt{b}3\texttt{$} $but$ {\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})=\texttt{b}0\texttt{$} $. This is caused by the fact that$ \mathsf{sl}(u)=\langle u[2:]\rangle $rather than$ \mathsf{sl}(u)=u[2:] $. Then, the path label$ {\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}}) $referenced by the suffix link may not give exactly what we want. We solve this problem by giving the node$ v $a “re-encoding sign” with which one can recover$ {\mathsf{str}({u},{v}}) $from$ {\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}}) $. An example node for the second case is$ 00\mathtt{ab} $. This is a branching node but$ \langle\mathsf{sl}(00\mathtt{ab})\rangle=0\mathtt{ab} $does not appear as a node. To handle this case, we simply refer to the corresponding interval of the original text$ T$ by keeping the necessary subsequence, where, as we will observe in experiments, the necessary subsequence is tend to be rather small. Our proposed structure PLST is shown in Figure 3(b). In what follows we explain PLSTs.

3.1 Definition and properties of PLSTs

Let $U=\mathsf{PrevSub}(T)$ be the set of nodes of $\mathsf{PSTrie}(T)$ . The set $V$ of nodes of the PLST $\mathsf{PLST}(T)$ for $T$ is a subset of $U$ , which is partitioned as $V=V_{1}\cup V_{2}\subseteq U$ . Nodes in $V_{i}$ are called Type $i$ for $i=1,2$ . The definition of Type 1 and Type 2 nodes follows the one for original LSTs [6].

A node $u\in U$ is Type 1 if $u$ is a leaf or a branching node in $\mathsf{PSTrie}(T)$ . 2. 2.

A node $u\in U$ is Type 2 if $u\notin V_{1}$ and $\mathsf{sl}(u)\in V_{1}$ .

Edges of $\mathsf{PLST}(T)$ are trivially determined: we have $(u,uv)\in V\times V$ as an edge if and only if $v\neq\varepsilon$ and there is no proper nonempty prefix $v^{\prime}$ of $v$ such that $uv^{\prime}\in V$ . We will show in Section 3.3 that $|V|\in O(|T|)$ . We say that $u\in V$ is good if $\mathsf{sl}(u)\in V$ , and $u\in V$ is bad otherwise. Note that any $u\in V_{2}$ is good by the definition of $V_{2}$ , and that the root $\varepsilon$ is bad.

To obtain ${\mathsf{str}({u},{v}})$ for an edge $(u,v)$ , if $|v|-|u|=1$ , we simply read the edge label $v[1]$ like an LST. Otherwise, if both $u$ and $v$ are good, we basically use the technique of “reduction by suffix links”. An important observation is that the equation ${\mathsf{str}({u},{v}})={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ , which was a key property to regain the original label in (non-parameterized) LSTs, does not necessarily hold for PLSTs. Figure 5 shows an example, where ${\mathsf{str}({u},{v}})={\tt cb}40\neq{\tt cb}00={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ ; the third symbol $4$ in ${\mathsf{str}({u},{v}})$ is re-encoded to [math] in ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ , because the first symbol $v[1]=0$ of $v$ , that is referenced by the symbol $4$ , is cut out in $\mathsf{sl}(v)$ . Fortunately, the possible difference between ${\mathsf{str}({u},{v}})$ and ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ is limited.

Observation 1.

Any prev-encoded substring $v$ of text $T$ has at most one position $i$ such that $v[i]=i-1$ . For such a position $i$ , we have $\mathsf{sl}(v)[i-1]=0$ and for any $j\in\{2,\dots,|v|\}\setminus\{i\}$ , $\mathsf{sl}(v)[j-1]=v[j]$ . Thus, such a position is unique in ${\mathsf{str}({u},{v}})$ for each edge $(u,v)$ in $\mathsf{PLST}(T)$ .

For each edge $(u,v)$ , we associate an integer named re-encoding sign, so that we can regain ${\mathsf{str}({u},{v}})$ from ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ as follows.

Definition 3 (Re-encoding sign).

For each node $v\in V$ , let $u$ be the parent of $v$ . We define re-encoding sign to $v$ by

[TABLE]

The re-encoding sign $\mathsf{Re}(v)$ is well-defined by Observation 1. Figure 5 shows an example of re-encoding signs. The next lemma immediately follows from Observation 1 and Definition 3.

Lemma 1.

Let $(u,v)$ be an edge in $\mathsf{PLST}(T)$ such that both $u$ and $v$ are good. Then for any $i\in\{1,\dots,|{\mathsf{str}({u},{v}})|\}\setminus\{\mathsf{Re}(v)\}$ , ${\mathsf{str}({u},{v}})[i]={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})[i]$ . If $\mathsf{Re}(v)\geq 1$ , then ${\mathsf{str}({u},{v}})[\mathsf{Re}(v)]=|u|+\mathsf{Re}(v)-1$ and ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})[\mathsf{Re}(v)]=0$ .

Lemma 1 tells how to recover ${\mathsf{str}({u},{v}})$ from ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ using the re-encoding sign at $v$ and the depth $|u|$ of $u$ . Note that the depth $|u|$ is the depth of $u$ in parameterized suffix tries, not the number of nodes from the root to $u$ in PLSTs.

If either $u$ or $v$ is bad in an edge $(u,v)$ with $|v|-|u|\geq 2$ , we give up “reduction by suffix links” and simply label the edge with the reference to the corresponding substring of the original text $T$ , like p-suffix trees. However, differently from p-suffix trees, not every part of the original text is referenced by an edge in our case. We keep only the subsequence $T^{\prime}$ of $T$ obtained by removing parts that are not referred to. We label an edge $(u,v)$ connected to a bad node with an integer triple $(i,j,k)$ such that ${\mathsf{str}({u},{v}})=\langle T^{\prime}\rangle_{k}[i:j]$ .

In summary, $\mathsf{PLST}(T)$ consists of three kinds of nodes: good Type 1, bad Type 1, and Type 2 (all good). If $u\in V$ is a good node, $u$ has its depth, suffix link and re-encoding sign, i.e., the triple $(|u|,\mathsf{SL}(u),\mathsf{Re}(u))$ , where $\mathsf{SL}(u)=\mathsf{sl}(u)$ . Here we use the notation $\mathsf{SL}(u)$ to emphasize that the suffix link $\mathsf{SL}(u)$ is a pointer to the node corresponding to the string $\mathsf{sl}(u)$ rather than the string itself. Therefore, it requires only constant size of memory space. If $u\in V$ is bad, $u$ dose not have a suffix link, i.e., $u$ has the triple $(|u|,\mathsf{null},\mathsf{Re}(u))$ . Each edge $(u,v)$ has either a label character or triple; if both $u$ and $v$ are good or ${\mathsf{str}({u},{v}})=1$ , the edge label is ${\mathsf{str}({u},{v}})[1]$ . Otherwise, the edge label is a triple $(i,j,k)$ such that ${\mathsf{str}({u},{v}})=\langle T^{\prime}\rangle_{k}[i:j]$ . If some bad nodes appear in $\mathsf{PLST}(T)$ , we need the subsequence $T^{\prime}$ of $\mathsf{prev}({T})$ to recover the labels of edges connecting the bad nodes. Otherwise, we do not need any text.

We remark that another idea to overcome the problem of the absence of $\mathsf{sl}(u)$ in a PLST for a node $u$ might be to add $\mathsf{sl}^{i}(u)$ to $V$ for all $i=1,\dots,|u|$ so that $V$ is closed under $\mathsf{sl}$ , where $\mathsf{sl}^{i}(u)=\mathsf{sl}(\mathsf{sl}^{i-1}(u))$ and $\mathsf{sl}^{0}(u)=u$ . However, there exists a series of texts $T_{n}=\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{x}_{1}\mathtt{a}_{1}\dots\mathtt{x}_{n}\mathtt{a}_{n}\mathtt{y}_{1}\mathtt{a}_{1}\dots\mathtt{y}_{n}\mathtt{a}_{n}\mathtt{z}\$$ where$ \mathtt{x}{i},\mathtt{y}{i},\mathtt{z}\in\Pi $and$ \mathtt{a}{i}\in\Sigma $for each$ i $, for which the number of those additional nodes will be$ \Omega(|T{n}|^{2})$. Thus, the size of the index structures cannot be kept in linear.

3.2 Parameterized pattern matching with PLSTs

This subsection presents our algorithm for solving the parameterized pattern matching problem as an application of PLSTs. The function P-Match of Algorithm 1 takes a prev-encoded string $p$ and a node in $\mathsf{PLST}(T)$ and checks whether there is $v\in\mathsf{PrevSub}(T)$ such that $p={\mathsf{str}({u},{v}})$ . If it is the case, it returns the least extension ${v^{\prime}}$ of $v$ such that $v^{\prime}\in V$ . In other words, $p$ is a prefix of ${\mathsf{str}({u},{v^{\prime}}})$ , where $v^{\prime}$ should be $v$ itself if $v\in V$ . Otherwise, it returns $\mathsf{null}$ .

For an input pair $(p,u)$ , if $p=\varepsilon$ , then P-Match returns $u$ , as it is required. Otherwise, it first tries to regain ${\mathsf{str}({u},{v}})$ for the $p[1]$ -child $v$ of $u$ , if $u$ has such a child. At first, suppose $|p|\geq|v|-|u|=l$ . We would like to know whether $p[1:l]={\mathsf{str}({u},{v}})$ . If $l=1$ , it means that we have already confirmed that $p[1:l]={\mathsf{str}({u},{v}})$ . Then we just go down to $v$ and recursively call $\textsc{P-Match}(p[2:],v)$ . If $l\geq 2$ and either $u$ or $v$ is bad, we refer to $T^{\prime}$ and check if $p[1:l]={\mathsf{str}({u},{v}})$ as with matching in a p-suffix tree. If $l\geq 2$ and both $u$ and $v$ are good, we cannot know from the edge $(u,v)$ itself what ${\mathsf{str}({u},{v}})$ is except for its first symbol ${\mathsf{str}({u},{v}})[1]=p[1]$ . To recover whole ${\mathsf{str}({u},{v}})$ , we use the suffix link of $u$ . Since $u$ is good, $\mathsf{SL}(u)$ is defined. If $\mathsf{Re}(v)=0$ , we have ${\mathsf{str}({u},{v}})={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ by Lemma 1, and we simply call $\textsc{P-Match}(p[1:l],\mathsf{SL}(u))$ . Otherwise, we have $p[1:l]={\mathsf{str}({u},{v}})$ if and only if $p[\mathsf{Re}(v)]=|u|+\mathsf{Re}(v)-1$ and $p^{\prime}[1:l]={\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ , where

[TABLE]

for $i=1,\dots,|p|$ . Thus, the recursive call of $\textsc{P-Match}(p^{\prime}[1:l],\mathsf{SL}(u))$ returns $\mathsf{null}$ iff $p[1:l]\neq{\mathsf{str}({u},{v}})$ . If $\textsc{P-Match}(p^{\prime}[1:l],\mathsf{SL}(u))$ returns a node, then $p[1:l]={\mathsf{str}({u},{v}})$ and thus we continue matching by calling $\textsc{P-Match}(p[l+1:],v)$ .

The above discussion is valid when $|p|\leq|v|-|u|$ . If $\mathsf{Re}(v)=0$ or $\mathsf{Re}(v)>|p|$ , then $p$ is a prefix of ${\mathsf{str}({u},{v}})$ iff $p$ is a prefix of ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ . Otherwise, $p$ is a prefix of ${\mathsf{str}({u},{v}})$ iff $p[\mathsf{Re}(v)]=|u|+\mathsf{Re}(v)-1$ and $p^{\prime}$ is a prefix of ${\mathsf{str}({\mathsf{sl}(u)},{\mathsf{sl}(v)}})$ . Thus the recursion is justified. If $\textsc{P-Match}(p^{\prime}[1:l],\mathsf{SL}(u))$ returns a node, $p$ is a prefix of ${\mathsf{str}({u},{v}})$ and we call $\textsc{P-Match}(\varepsilon,v)$ , which returns $v$ .

Proposition 1.

We can decide whether $T$ has a substring that p-matches $P$ using Algorithm 1.

The time complexity of Algorithm 1 is not linear as it is. Suppose that $\textsc{P-Match}(p,u)$ is called. It can be the case $|v|-|u|\geq|p|\geq 2$ and either $\mathsf{Re}(v)=0$ or $\mathsf{Re}(v)>l$ where $v=\mathsf{child}(u,p[1])$ . In this case, the algorithm simply calls $\textsc{P-Match}(p,\mathsf{SL}(u))$ , where the first argument has not changed from the preceding call. Such recursion may be repeated, and amortized time complexity is not linear. The same difficulty and a solution have already been discussed by Crochemore et al. [6] for LSTs. Following them, we introduce fast links as follows, which allow us to skip recursions that always preserve the first argument.

Definition 4 (Fast link).

For each edge $(u,v)\in V\times V$ such that $|v|-|u|>1$ and both $u$ , $v$ are good, the fast link for $(u,v)$ is defined to be $\mathsf{FL}(u,v)=\mathsf{SL}^{k}(u)$ where $k\geq 1$ is the smallest integer satisfying either $|v_{k}|<|v|-k$ or $0<\mathsf{Re}(v_{k})$ , where $v_{k}=\mathsf{child}(\mathsf{SL}^{k}(u),a)$ for $a={\mathsf{str}({u},{v}})[1]$ .

Algorithm 1 will run in linear time by replacing $\mathsf{SL}(u)$ in Line 1 by $\mathsf{FL}(u,v)$ . If $|v_{k}|<|v|-k=|\mathsf{sl}^{k}(v)|$ , the node $v_{k}$ occurs between $\mathsf{sl}^{k}(u)$ and $\mathsf{sl}^{k}(v)$ . Then, $\textsc{P-Match}(p,\mathsf{SL}^{k}(u))$ will call $\textsc{P-Match}(p[1:|v_{k}|-|\mathsf{SL}^{k}(u)|],\mathsf{SL}^{k+1}(u))$ . When $0<\mathsf{Re}(v_{k})$ , we change the $\mathsf{Re}(v_{k})$ -th symbol of $p$ , which must be a positive integer, to [math]. Therefore, the number of fast links we follow is bounded by $2|p|$ . Figure 5 shows how to p-match ${\mathsf{str}({u},{v}})$ and $p={\tt a}04{\tt b}$ using fast links. We know that $p[1]={\mathsf{str}({u},{v}})[1]={\tt a}$ . After following the fast link (1), we check whether $p[\mathsf{Re}(\mathsf{sl}^{2}(v))]=4$ and rewrite the value of $p[\mathsf{Re}(\mathsf{sl}^{2}(v))]$ to [math]. After using (2), we check whether $p[3]=0$ . In this way, we can know that $p$ matches ${\mathsf{str}({u},{v}})$ .

Theorem 1.

Given $\mathsf{PLST}(T)$ and a pattern $P$ of length $m$ , we can decide whether $T$ has a substring that p-matches $P$ in $O(m)$ time.

3.3 The size of PLSTs

We now show that the size of $\mathsf{PLST}(T)$ is linear with respect to the length $n$ of a text $T$ . First, we show a linear upper bound on the number of nodes of $\mathsf{PLST}(T)$ . The nodes of Type 1 appear in the p-suffix tree, so they are at most $2n$ [3]. It is enough to show that the number of nodes of Type 2 is linearly bounded as well.

Lemma 2.

The number of Type 2 nodes in $\mathsf{PLST}(T)$ is smaller than $2n$ .

Proof.

Let us consider an implicit suffix link chain in $\mathsf{PSTrie}(T)$ starting from $w=\mathsf{prev}({T[:k]})$ with $1\leq k<n$ , i.e., $(w,\mathsf{sl}(w),\mathsf{sl}^{2}(w),\dots,\mathsf{sl}^{|w|}(w))$ . $\mathsf{PSTrie}(T)$ has $n-1$ such chains and every internal node of $\mathsf{PLST}(T)$ appears in at least one chain. If a chain has two distinct Type 2 nodes $\mathsf{sl}^{i}(w)$ and $\mathsf{sl}^{j}(w)$ with $i<j$ , since $\mathsf{sl}^{i+1}(w)$ is Type 1 by definition, one can always find a Type 1 node between them.

Define a binary relation $R$ between $V_{1}$ and $V_{2}$ by

[TABLE]

and let $R_{2}=\{\,v\in V_{2}\mid(u,v)\in R\text{ for some }u\in V_{1}\,\}$ . Since $R$ is a partial function from branching nodes to Type 2 nodes, we have $|R_{2}|\leq n$ . By the above argument on a chain, each chain has at most one Type 2 node $v\in V_{2}$ such that $v\notin R_{2}$ . Since there are $n-1$ chains, we have $|V_{2}\setminus R_{2}|<n$ . All in all, $|V_{2}|=|R_{2}|+|V_{2}\setminus R_{2}|<n+n=2n$ . ∎∎

The number of edges and their labels, as well as the number of suffix links, depth and re-encoding sign for nodes, is asymptotically bounded above by the number of nodes in $\mathsf{PLST}(T)$ . $T^{\prime}$ is a subsequence of $\mathsf{prev}({T})$ , thus its length is $O(n)$ . Therefore, the size of $\mathsf{PLST}(T)$ is $O(n)$ .

Theorem 2.

Given a p-string $T$ of length $n$ , the size of $\mathsf{PLST}(T)$ is $O(n)$ .

4 Experiments

We performed comparative experiments on the number of nodes of PLSTs and p-suffix trees for four sorts of text strings changing their length. Text strings we used are random strings over a constant alphabet $\Sigma$ with $|\Sigma|=2$ and those over a parameter alphabet $\Pi$ with $|\Pi|=2$ , and Fibonacci strings over $\Sigma$ with $|\Sigma|=2$ and those over $\Pi$ with $|\Pi|=2$ . PLSTs for constant strings are of course identical to LSTs. For random strings, we measured the average number of nodes for 100 strings of each length $n=10,\dots,10240$ . For Fibonacci strings, we measured the number of nodes for each of the 11th through 22nd Fibonacci strings. The results of our experiments are shown in Table 1. Recall that p-suffix trees consist of Type 1 nodes, while PLSTs have Type 2 nodes in addition. For random strings, we can see that the number of Type 2 nodes is close to the text length. On the other hand, for Fibonacci strings, PLSTs have few Type 2 nodes. In these experiments, since no bad node appeared except the root, PLSTs did not need any text, that is, $T^{\prime}=\epsilon$ .

Because the size of each node is the same in a p-suffix tree and a PLST, the difference of the memory efficiency of the two data structures is just the difference of the memory size for $\mathsf{prev}({T})$ and the Type 2 nodes (and $T^{\prime}$ if necessary). The experimental results suggest that PLSTs use less memory than p-suffix trees for indexing highly repetitive strings such as Fibonacci strings.

5 Conclusion and future work

In this paper, we presented an indexing structure called a PLST for the parameterized pattern matching problem. Given a p-string $T$ of length $n$ , the size of PLST for ${T}$ is $O(n)$ . We presented an algorithm that solves the problem in $O(m)$ time, where $m$ is the length of the pattern. We experimentally showed that PLST is space-saving from p-suffix tree for indexing highly repetitive strings such as Fibonacci strings.

For PLSTs to be useful for various applications, like computing the longest common substrings, an efficient algorithm for constructing PLSTs is required like LSTs [onlineLST]. Furthermore, the ideas developed in this paper may be useful to generalize L-CDAWGs [takagi2017linear] to a data structure for parameterized strings.

Appendix A Appendix

A.1 The implicit suffix link closure of branching nodes is too big

We show that the total number of nodes of the form $\mathsf{sl}^{j}(u)\in\mathsf{PrevSub}(T)$ for some $u\in V_{1}$ cannot be linearly bounded by $|T|$ . Let us consider a text

[TABLE]

where $\mathtt{x}_{i},\mathtt{y}_{i},\mathtt{z}\in\Pi$ and $\mathtt{a}_{i}\in\Sigma$ for each $i$ . Note that $|T_{n}|\in O(n)$ . Here

[TABLE]

is a Type 1 node, since $w_{i}0,w_{i}(2n)\in\mathsf{PrevSub}(T_{n})$ . Then the set $\{\,\mathsf{sl}^{j}(w_{i})\mid 1\leq i\leq n,\,0\leq j<2n\,\}$ has $2n^{2}$ elements. Therefore, we cannot keep our indexing structure in linear size. Figure 6 illustrates the case of $n=3$ , where twelve additional nodes are created.

A.2 Other experiments

We performed comparative experiments on the numbers of nodes of PLSTs and p-suffix trees for texts in addition to random and Fibonacci strings. The results of our experiments for Thue-Morse strings and Period-doubling strings are shown in Tables 2. For Thue-Morse strings and Period-doubling strings, our data structure only have a limited number of additional nodes. The Fibonacci strings, Thue-Morse strings and Period-doubling strings are defined as follows.

The $k$ -th Fibonacci string $\mathit{Fib}_{k}$ is defined by the following recurrence:

[TABLE]

The $k$ -th Thue-Morse string can be obtained by applying the following homomorphism $\sigma$ to $\tt{a}$ $k$ times:

[TABLE]

The $k$ -th Period-doubling string can be obtained by applying the following homomorphism $\sigma$ to $\tt{a}$ $k$ times:

[TABLE]

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Brenda S. Baker. A program for identifying duplicated code. Computing Science and Statistics , 24:49–57, 1992.
2[2] Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications. In Proc. 25th annual ACM symposium on Theory of computing , pages 71–80, 1993. doi:10.1145/167088.167115 . · doi ↗
3[3] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences , 52(1):28–42, 1996. doi:10.1006/jcss.1996.0003 . · doi ↗
4[4] Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing , 26(5):1343–1362, 1997.
5[5] M. Crochemore and W. Rytter. Jewels of Stringology: Text Algorithms . World Scientific, 2003.
6[6] Maxime Crochemore, Chiara Epifanio, Roberto Grossi, and Filippo Mignosi. Linear-size suffix tries. Theoretical Computer Science , 638:171–178, 2016.
7[7] Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Parameterized suffix arrays for binary strings. In Proceedings of the Prague Stringology Conference 2008 , pages 84–94, Czech Technical University in Prague, Czech Republic, 2008.
8[8] Diptarama, Takashi Katsura, Yuhei Otomo, Kazuyuki Narisawa, and Ayumi Shinohara. Position heaps for parameterized strings. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017) , pages 8:1–8:13, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

An Extension of Linear-size Suffix Tries for Parameterized Strings

Abstract

1 Introduction

2 Preliminaries

2.1 Basic definitions and notation

Definition 1** (Prev-encoding [3]).**

Definition 2** (Parameterized pattern matching [3]).**

2.2 Suffix tries, suffix trees, and linear-size suffix tries

2.3 Parameterized suffix tries and parameterized suffix trees

3 PLSTs

3.1 Definition and properties of PLSTs

Observation 1**.**

Definition 3** (Re-encoding sign).**

Lemma 1**.**

3.2 Parameterized pattern matching with PLSTs

Proposition 1**.**

Definition 4** (Fast link).**

Theorem 1**.**

3.3 The size of PLSTs

Lemma 2**.**

Proof.

Theorem 2**.**

4 Experiments

5 Conclusion and future work

Appendix A Appendix

A.1 The implicit suffix link closure of branching nodes is too big

A.2 Other experiments

Definition 1 (Prev-encoding [3]).

Definition 2 (Parameterized pattern matching [3]).

Observation 1.

Definition 3 (Re-encoding sign).

Lemma 1.

Proposition 1.

Definition 4 (Fast link).

Theorem 1.

Lemma 2.

Theorem 2.