Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets
Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,, Masayuki Takeda

TL;DR
This paper introduces the first worst-case linear time algorithm for directly constructing parameterized suffix and LCP arrays for constant alphabets, improving efficiency over previous methods that were slower or required additional structures.
Contribution
It presents a novel linear time algorithm for directly computing parameterized suffix and LCP arrays for constant alphabets, eliminating the need for prior suffix tree construction.
Findings
Algorithm runs in O(nπ) time and O(n) space.
First worst-case linear time algorithm for this problem.
Applicable to strings over static and parameterized alphabets.
Abstract
We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet and parameterized alphabet , our algorithm runs in time and words of space, where is the number of distinct symbols of in the string.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · DNA and Biological Computing
11institutetext: Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
11email: {noriki.fujisato,yuto.nakashima,inenaga,bannai,takeda}@inf.kyushu-u.ac.jp
Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets
Noriki Fujisato
Yuto Nakashima
Shunsuke Inenaga
Hideo Bannai
Masayuki Takeda
Abstract
We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet and parameterized alphabet , our algorithm runs in time and words of space, where is the number of distinct symbols of in the string.
Keywords:
parameterized pattern matching, paramterized suffix array paramterized LCP array
1 Introduction
Parameterized pattern matching is one of the well studied “non-standard” pattern matching problems which was initiated by Baker [1], in an application to find duplicated code where variable names may be renamed. In the parameterized matching problem, we consider strings over an alphabet partitioned into two sets: the parameterized alphabet and the static alphabet . Two strings of length are said to parameterized match (p-match), if one can be obtained from the other with a bijective mapping over symbols of , i.e., there exists a bijection such that for all , if , and if . For example, if and , strings and p-match, since we can choose and , while strings and do not p-match, since there is no such bijection on . As parameterized matching captures the “structure” of the string, it has also been extended to RNA structural matching [16].
Baker introduced the so-called prev encoding of a p-string which maps each symbol of the p-string that is in to the distance to its previous occurrence (or [math] if it is the first occurrence), and showed that two p-strings p-match if and only if their prev encodings are equivalent. For example, the prev encodings for p-strings and are both . Thus, the parameterized matching problem amounts to efficiently comparing the prev encodings of the p-strings.
Using the prev encoding allows for the development of data structures that mimic those of standard strings. The central difficulty, in contrast with standard strings, is in coping with the following property of prev encodings; a substring of a prev encoding is not necessarily equivalent to the prev encoding of the corresponding substring.
Nevertheless, several data structures and algorithms have so far successfully been developed. Baker proposed the parameterized suffix tree (PST), an analogue of the suffix tree for standard strings [17], and showed that for a string of length , it could be built in time and words of space [2]. Using the PST for , all occurrences of a substring in which parameterized match a given pattern can be computed in time, where is the number of occurrences of the pattern in the text. Kosaruju [15] further improved the running time of construction to . Furthermore, Shibuya [16] proposed an on-line algorithm for constructing the PST that runs in the same time bounds.
Deguchi et al. [5] proposed the parameterized suffix array (PSA). Given the PST of a string, the PSA can be constructed in linear time, but as in the case for standard strings, the direct construction of PSAs has been a topic of interest.j Deguchi et al. [5] showed a linear time algorithm for the special case of and . I et al. [11] proposed a lightweight and practically efficient algorithm for larger , but the worst-case time was still quadratic in . Beal and Adjeroh [4] proposed an algorithm based on arithmetic coding that runs in time on average. Furthermore, they claimed a worst-case running time of . However, the proved upperbound is for a very small (Corollary 27 of [4]), so it is only slightly better than quadratic.
In this paper, we break the worst-case quadratic time barrier considerably, and present the first worst-case linear time algorithm for constructing the parameterized suffix and LCP arrays of a given p-string, when the number of distinct parameterized symbols in the string is constant. Namely, our algorithm runs in time and words of space, where is the number of distinct symbols of in the string.
Several other indices for parameterized pattern matching have been proposed. Diptarama et al. [6] and Fujisato et al. [8] proposed the parameterized position heaps (PPH), an analogue of the position heap for standard strings [7], and showed that it could be built in time and words of space. Using the PPH for , all occurrences of a substring in which parameterized match a given pattern can be computed in time, where is the number of occurrences of the pattern in the text. Parameterized BWT’s have been proposed in [10]. Also, paramterized text index with one wildcard was proposed in [9].
2 Preliminaries
For any set of symbols, denotes the set of strings over the alphabet . Let denote the length of a string . The empty string is denoted by . For any string , if for some (possibly empty) , are respectively called a prefix, substring, suffix of . When , they are respectively called a proper prefix, substring, and suffix of . For any integer , denotes the th symbol in , and for any , . Let denote a total order on , as well as the lexicographic order it induces. For two strings , if and only if is a proper prefix of , or there is some position such that and .
Let and denote disjoint sets of symbols. is called the parameterized alphabet, and is called the static alphabet. A string in is sometimes called a p-string. Two p-strings of equal length are said to parameterized match, denoted , if there exists a bijection , such that for all , if , and if .
The prev encoding of a p-string of length is the string over the alphabet defined as follows:
[TABLE]
For example, if , and p-string , then . Baker showed that if and only if [3]. We assume that and are disjoint integer alphabets, where for some constant and for some constant . This way, we can distinguish whether a symbol of a given prev encoding belongs to or not. Also, given p-string of length , we can compute in time and space, by sorting the pairs using radix sort, followed by a simple scan of the result.
The following are the data structures that we consider in this paper.
Definition 1 (Parameterized Suffix Array [5])
The parameterized suffix array of a p-string of length , is an array of integers such that if and only if is the th lexicographically smallest string in .
Definition 2 (Parameterized LCP Array [5])
The parameterized LCP array of a p-string of length , is an array of integers such that , and , for any , is the longest common prefix between and .
The difficulty when dealing with the prev encoding of suffixes of a string, is that they are not necessarily the suffixes of the prev encoding of the string. It is important to notice however, that, given the prev encoding of the whole string , any value specific of the prev encoding of an arbitrary suffix of can be retrieved in constant time, i.e., for any and ,
[TABLE]
where . The critical problem for suffix sorting is that even if two prev encodings and share a common prefix and satisfies , it may still be that .
Fig. 1 shows an example of and for the string . For example, we have that , which share a common prefix of length , yet .
3 Algorithms
In this section we describe our algorithms for constructing the parameterized suffix and LCP arrays. First, we mention a simple observation below.
From the definition of , we have that for some position if and only if is the first occurrence of symbol . Therefore, the following observation can be made.
Observation 1
For any p-string , the prev encoding of any substring of contains at most positions that are [math]’s, where is the number of distinct symbols of in .
3.1 Construction
Based on this observation, we can see that the prev encoding of each suffix can be partitioned into blocks, where is the number of [math]’s in , and the th block is the substring of that ends at the th [math] in for , and the (possibly empty) remaining suffix for . For technical reasons, we will append [math] to the last block as well. That is, we can write
[TABLE]
where, denotes the th block of . Furthermore, for each , let denote the set of all th blocks for all , and let denote the lexicographic rank of in . Finally, let denote the string over the alphabet obtained by renaming each block of the string with its lexicographic rank . More formally,
[TABLE]
Lemma 1
For any ,
[TABLE]
Proof
Notice that [math] is the smallest symbol in the two strings, so
[TABLE]
Also notice that since any block must end with a [math], if two blocks are not identical, it holds that one cannot be a prefix of the other. Thus, if , this implies that there is some block such that , for all , and , where is not a prefix of . By definition, . Therefore, we have, . ∎
From Lemma 1, the problem of lexicographically sorting the set of strings reduces to the problem of lexicographically sorting the set of strings . The latter can be done in time using radix sort, since the strings are over the alphabet and the total length of the strings is at most .
What remains is to to compute for all in the same time bound. A problem is that the total length of all is , so we cannot afford to naively process all of them.
Denote by and the beginning and end positions of with respect to their (global) position in . Note that for any , we have , and for all . Our algorithm depends on the following simple yet crucial lemma.
Lemma 2
For any and , we have that either
, or, 2. 2.
, , and is a suffix of
holds.
Proof
If , then, is a suffix of , i.e., and . Thus, is a suffix of , and for all and the second case of the claim holds.
If , the values in are equivalent to the corresponding values of , except possibly at some (global) position when there is a second occurrence of the symbol at which becomes the first occurrence in . (In other words, the value corresponding to in is .) Since there is no previous occurrence of in , . The situation is depicted in Fig. 2.
Let be the block of that contains (global) position . Because, as mentioned previously, and are equivalent except for the value corresponding to (global) position , the block structure of is preserved in , except that (1) the first block disappears, and (2) the block is split into two blocks, corresponding to and . Therefore, the first case of the claim is satisfied for , since for any . Also, we can see that the second case of the claim is satisfied for , since is a suffix of , and for .
Finally, the case when such does not exist can be considered to be included above by simply assuming we are looking at a prefix of a longer string and , since the prev encoding is preserved for prefixes, i.e., the prev encoding of a prefix of any p-string is equivalent to the corresponding prefix of the prev encoding of . Thus, the lemma holds. ∎
Lemma 2 implies that if we fix some , we can represent for all , as suffixes (in the standard sense) of strings of total length .
Corollary 1
For any , there exists a set of strings with total length over the alphabet such that is a suffix of some string in for all .
Proof
We include in , if , or, if and satisfies the first case of Lemma 2. Since the first case implies that the (global) positions and are disjoint, the total length of strings in is at most (including the [math] appended to ). On the other hand, if satisfies the second case is, it is a suffix of an already included string.∎
Thus, computing for all can be done by computing the generalized suffix array for the set . This can be done in time given [13, 14, 12] and thus, for all , the total is time.
Theorem 3.1
The parameterized suffix array of a p-string of length can be computed in time and space.
Proof
We compute a forward encoding of , analogous to the prev encoding, defined as follows
[TABLE]
This is done once, and can be computed in time. Next, for any fixed , we show how to compute the set in linear time. This is done by using and Lemma 2. We can first scan to obtain . Suppose for some , we know the beginning and end positions , of . Notice that when , in the proof of Lemma 2 is . Based on this value, we know that if , then and if is a suffix of , which corresponds to the second case of Lemma 2. When , this corresponds to the first case of Lemma 2, so we scan starting from position corresponding to the global position (i.e., in ) until we find the first [math], which gives us which we include in . Since we only scan each position once, the total time for computing is .
The time complexity follows from arguments for sorting based on radix sort. Since, for a single step of the radix sort, we only require the values for a fixed and all and from Corollary 1, the space complexity is . ∎
3.2 Construction
Given , we can construct as follows in time and space. We recompute for , and each time process it for LCE queries, so that the longest common prefix between and for some can be computed in constant time. This can be done in time linear in the total length of , so in total time for all . We compute the longest common prefix between each adjacent suffix in block by block. Since each block takes constant time, and there are blocks for each suffix, the total is time for all entries of the array. The space complexity is since, as for the case of construction, we only process the th block at each step.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Baker, B.S.: A program for identifying duplicated code. Computing Science and Statistics 24 , 49–57 (1992)
- 2[2] Baker, B.S.: Parameterized pattern matching: Algorithms and applications. J. Comput. Syst. Sci. 52 (1), 28–42 (1996). https://doi.org/10.1006/jcss.1996.0003, https://doi.org/10.1006/jcss.1996.0003 · doi ↗
- 3[3] Baker, B.S.: Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26 (5), 1343–1362 (1997). https://doi.org/10.1137/S 0097539793246707, https://doi.org/10.1137/S 0097539793246707 · doi ↗
- 4[4] Beal, R., Adjeroh, D.A.: p-suffix sorting as arithmetic coding. J. Discrete Algorithms 16 , 151–169 (2012). https://doi.org/10.1016/j.jda.2012.05.001, https://doi.org/10.1016/j.jda.2012.05.001 · doi ↗
- 5[5] Deguchi, S., Higashijima, F., Bannai, H., Inenaga, S., Takeda, M.: Parameterized suffix arrays for binary strings. In: Holub, J., Zdárek, J. (eds.) Proceedings of the Prague Stringology Conference 2008, Prague, Czech Republic, September 1-3, 2008. pp. 84–94. Prague Stringology Club, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague (2008), http://www.stringology.org/event/2008/p 08.html
- 6[6] Diptarama, Katsura, T., Otomo, Y., Narisawa, K., Shinohara, A.: Position heaps for parameterized strings. In: Kärkkäinen, J., Radoszewski, J., Rytter, W. (eds.) 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland. LIP Ics, vol. 78, pp. 8:1–8:13. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017). https://doi.org/10.4230/LIP Ics.CPM.2017.8, https://doi.org/10.4230/LIP Ics.CPM.2017.8 · doi ↗
- 7[7] Ehrenfeucht, A., Mc Connell, R.M., Osheim, N., Woo, S.W.: Position heaps: A simple and dynamic text indexing data structure. Journal of Discrete Algorithms 9 (1), 100 – 121 (2011). https://doi.org/https://doi.org/10.1016/j.jda.2010.12.001, http://www.sciencedirect.com/science/article/pii/S 1570866710000535 , 20th Anniversary Edition of the Annual Symposium on Combinatorial Pattern Matching (CPM 2009)
- 8[8] Fujisato, N., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Right-to-left online construction of parameterized position heaps. Co RR abs/1808.01071 (2018), http://arxiv.org/abs/1808.01071
