Indexing Weighted Sequences: Neat and Efficient
Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub, Radoszewski

TL;DR
This paper introduces a simple, efficient indexing method for weighted sequences that enables fast pattern matching queries, improving upon previous approaches in complexity and simplicity.
Contribution
A novel, straightforward construction of a weighted sequence index that matches the best query time and reduces complexity compared to prior work.
Findings
Constructed an $O(nz)$-sized index for weighted sequences.
Achieved optimal $O(m+Occ)$ query time.
Improved space and complexity over previous methods.
Abstract
In a \emph{weighted sequence}, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold , we say that a string of length occurs in a weighted sequence at position if the product of probabilities of the letters of at positions in is at least . In this article, we consider an \emph{indexing} variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an -time construction of an -sized index for a weighted sequence of length over a constant-sized alphabet…
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 0 |
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 2 | 2 | 3 | 4 | 5 | 6 | |
| 4 | 4 | 5 | 6 | 6 | 6 | |
| 4 | 4 | 5 | 6 | 6 | 6 | |
| 2 | 2 | 3 | 3 | 5 | 6 |
| — | — | — | — | — | ||||||
| — | — | — | — | — | ||||||
| — | — | — | — | — | ||||||
| — | — | — | — | — |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Indexing Weighted Sequences: Neat and Efficient
Carl Barton
European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
Tomasz Kociumaka
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad]@mimuw.edu.pl
Chang Liu
Department of Informatics, King’s College London, London, UK
[chang.2.liu,solon.pissis]@kcl.ac.uk
Solon P. Pissis
Department of Informatics, King’s College London, London, UK
[chang.2.liu,solon.pissis]@kcl.ac.uk
Jakub Radoszewski
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad]@mimuw.edu.pl
Department of Informatics, King’s College London, London, UK
[chang.2.liu,solon.pissis]@kcl.ac.uk
Abstract
In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold , we say that a string of length occurs in a weighted sequence at position if the product of probabilities of the letters of at positions in is at least . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an -time construction of an -sized index for a weighted sequence of length over a constant-sized alphabet that answers pattern matching queries in optimal, time, where is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. [3], but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. [6] and an improvement of the space complexity of their general index.
1 Introduction
We consider a type of uncertain sequence called a weighted sequence. In a weighted sequence every position contains a subset of the alphabet and every letter of the alphabet is associated with a probability of occurrence such that the sum of probabilities at each position equals 1.
Weighted sequences are common in a wide range of applications: (i) data measurements with imprecise sensor measurements; (ii) flexible sequence modelling, such as binding profiles of DNA sequences; (iii) observations that are private and thus sequences of observations may have artificial uncertainty introduced deliberately (see [1] for a survey). Pattern matching (or substring matching) is a core operation in a wide variety of applications including genome assembly, computer virus detection, database search and short read alignment. Many of the applications of pattern matching generalise immediately to the weighted case as much of this data is more commonly uncertain (e.g. reads with quality scores) than certain. In particular probabilistic databases have been a very active area of research in recent years; see e.g. [8]. A common assumption in practice is that the alphabet of weighted sequences is constant since the most commonly studied alphabet is .
In the Weighted Pattern Matching (WPM) problem we are given a string called a pattern, a weighted sequence called a text, both over an alphabet , and a threshold probability . The task is to find all positions in where the product of probabilities of the letters of at positions in is at least . Each such position is called an occurrence of the pattern; we also say that the fragment and the pattern match.
In this article, we consider the indexing (or off-line) version of the WPM problem, called Weighted Indexing. Here we are given a text being a weighted sequence and we are asked to construct a data structure (called an index) to provide efficient operations for answering WPM queries related to the text. We also consider other variants of the indexing problem. In the Approximate Weighted Indexing problem, given a pattern and a threshold , we are to report all occurrences of the pattern with probability at least but we may also report additional occurrences with probability , for a pre-selected value of . In the Generalised Weighted Indexing problem we are to construct a data structure that allows for WPM queries to be answered for any threshold with .
A problem that is known to be closely related to the Weighted Indexing problem is Property Indexing. In this problem, we are given a string called the text and a hereditary property , which is a family of integer intervals contained in (hereditary means that it is closed under subintervals). Our goal is to preprocess the text so that, for a query string , we can report all occurrences of in which, interpreted as intervals, belong to . The property can be represented in space using an array such the longest interval starting at position is .
In each of the indexing problems, we denote the length of the text by , the length of a query pattern by , and the number of occurrences of the pattern in the text by .
1.1 Previous Results
An -time solution for the Weighted Pattern Matching problem based on the Fast Fourier Transform was proposed in [7, 19]. Recently, an -time solution using the suffix array and lookahead scoring was presented in [15]. The average case complexity of the WPM problem has also been studied and a number of fast algorithms have been shown with both linear [4] and sub-linear on average algorithms being presented [5].
The Weighted Indexing problem was first considered by Iliopoulos et al. [12], who introduced a data structure called weighted suffix tree allowing optimal -time queries. The construction time and size of that data structure was, however, .
Amir et al. [2] reduced the Weighted Indexing problem to the Property Indexing problem in a text of length . For the latter, they proposed a solution with preprocessing time and optimal query time. Later it was shown that the Property Indexing problem can be solved in linear time; see [13, 14] (see also [16]). This lead to a solution to the Weighted Indexing problem with index size and construction time , preserving optimal query time.
These results were recently improved by some of the authors in [3], where they proposed an -sized data structure for the Weighted Indexing problem that can be constructed also in time. The query time is still . The authors proposed several applications of their index.
Biswas et al. [6] presented a data structure that solves the Approximate Weighted Indexing problem in space (with construction time) with -time queries; here denotes the number of occurrences reported. They also proposed a data structure for the Generalised Weighted Indexing problem with space and query time. The construction time is not mentioned, but a direct construction of their index works in time. Moreover, they also consider the problem of document listing for weighted sequences.
1.2 Our Contribution
We present a new -time construction of an -sized data structure for the Weighted Indexing problem that answers queries in optimal time. Our index is based on a novel observation that one can always construct a family of special strings of length that carries all the information about all the strings that occur in the weighted sequence. This yields a significantly simpler construction than in the previous index [3] preserving all of its applications. As a by product, we obtain an optimal solution to the Property Indexing problem that avoids complex tools used in the previous solutions [2, 13, 14, 16]. We provide a proof-of-concept implementation of our index that was validated for correctness and efficiency. We also discuss an even simpler randomised construction with worse space complexity and construction time of the index.
Our approach lets us significantly improve upon the variants of the weighted index proposed in [6]. In the Approximate Weighted Indexing problem, we obtain space and construction time, preserving the query time. We also improve the space usage in the Generalised Weighted Indexing problem to , also in the document listing variant.
1.3 Comparison of Our Techniques with the Previous Work
Two main building blocks of our weighted index are a construction of a family of special strings with properties and a solution to the Property Indexing problem.
The family of strings that we construct has the same set of patterns occurring at each position as the weighted text and, moreover, the number of occurrences of each pattern at each position is a good estimate of the probability of its occurrence at this position in . The former property is used in the construction of a weighted index and the latter in the construction of an approximate weighted index. The existence of this family is not immediate. However, its proof not involved and we design a -time elementary construction algorithm based on tries (also known as radix trees). In the end we show that a simple generation of a number of strings according to the probability distribution implied by the weighted sequence with high probability yields a family of strings that also well describes the set of patterns in . However, the number of strings that one needs to generate is much larger. Excluding the previous, exponential-size index of Iliopoulos et al. [12], previous work includes the -space index of Amir et al. [2] and -space index by Barton et al. [3]. Amir et al. [2] show that, after a small modification of the weighted sequence, the set of maximal string patterns that occur in it has a total length . Barton et al. [3] show a representation of this set as a trie and apply Shibuya’s algorithm for suffix tree of a trie construction [20].
In our solution to the Property Indexing problem we construct a data structure called property suffix tree being the suffix tree in which the nodes corresponding to factors that do not belong to the property are trimmed. The algorithm makes only several traversals of the suffix tree and uses an amortisation argument similar to the one from Ukkonen’s suffix tree construction [21]. Very similar data structures were constructed by Amir et al. [2] and Kopelowitz [16]. Amir et al. [2] use a heavy machinery of weighted ancestor queries and a fancy algorithm to mark the properties on edges of the suffix tree. Kopelowitz [16] designs an algorithm for a dynamic setting, but also mentions its static application. He uses amortisation ideas similar to ours, but his construction is more involved due to its generality and also utilises less basic longest common extension queries (i.e., range minimum queries). The solution to the Property Indexing problem that was developed by Iliopoulos et al. [13] and clarified by Juan et al. [14] constructs a different data structure that, in a sense, shifts the hardness of the problem from the construction to the queries. It also requires range minimum queries.
Our techniques enable us immediately to answer decision queries of a weighted index. To answer counting and reporting queries in optimal time, we require coloured range counting and reporting data structures in the property suffix tree that were already used for this purpose by Barton et al. [3]. In our solution to the Approximate Weighted Indexing, we need to augment the property suffix tree with a data structure for top- document retrieval queries. The same type of queries were used in the previous solution by Biswas et al. [6], however, not as a black box. Moreover, they also use the less efficient reduction of [2] which caused their data structure to use space, assuming that in each query. Finally, we improve the space complexity of the generalised weighted index of Biswas et al. [6] by plugging in our construction of special strings.
1.4 Structure of the Paper
In Section 3 we present a combinatorial construction of the special family of strings. An efficient implementation of the construction of this family based on tries is proposed in Section 4. In Section 5 the new optimal solution for the Property Indexing problem is presented. Using the construction and the property index, we obtain our weighted index in Section 6 and, with the aid of an auxiliary tool, an approximate weighted index in Section 7. Alternative randomised constructions of the two indexes with worse parameters are discussed in Section 8. Our improvement to the Generalised Weighted Index and our C++ implementation are briefly discussed Section 9.
2 Preliminaries
2.1 Strings and Property Indexing
A string over an alphabet is a finite sequence of letters from . By we denote the length of and by , for , we denote the -th letter of . By we denote the string called a factor of (if , then the factor is an empty string). A factor is called a prefix if and a suffix if . We say that a string occurs at position in if .
A property of is a hereditary collection of integer intervals contained in . For simplicity, we represent every property with an array such that the longest interval starting at position is . Observe that can be an arbitrary array satisfying and . For a string , by we denote the set of occurrences of in such that . These notions lead us to the statement of the following problem.
Problem 1** (Property Indexing).**
Input: A string of length over an alphabet and an array representing a property .
Queries: For a given pattern string of length , compute or report all elements of .
Let us consider an indexed family of strings with properties . For a string and an index , by
[TABLE]
we denote the total number of occurrences of at the position in the strings that respect the properties.
2.2 Weighted Sequences and Weighted Indexing
A weighted sequence of length over an alphabet is a sequence of sets of pairs of the form . Here, is the occurrence probability of the letter at the position . These values are non-negative and sum up to 1 for a given . An example of a weighted sequence is shown in Table 1.
The probability of matching of a string at position of a weighted sequence equals
[TABLE]
We say that a string occurs in at position if . We also say that is a solid factor of (starting, occurring) at position . By we denote the set of all positions where occurs in . The main problem in scope can be formulated as follows.
Problem 2** (Weighted Indexing).**
Input: A weighted sequence of length over an alphabet and a threshold .
Queries: For a given pattern string of length , check if (decision query), compute (counting query), or report all elements of (reporting query).
Our model of computations.
We assume the word-RAM model with word size . We consider the log-probability model of representations of weighted sequences in which probabilities can be multiplied exactly in time. We further assume that ; under this assumption a weighted sequence of length has a representation using space.
3 Existence of an Equivalent Family of Strings
In the definition below, we formalise the property of a string family that we aim to construct.
Definition 1**.**
We say that an indexed family containing strings of length is a -estimation of a weighted sequence of length if and only if, for every string and position , .
Note that a -estimation of a weighted sequence carries the information about all solid factors of : a string occurs in at position if and only if it occurs at position in at least one of the strings respecting its property . This observation will be used in the construction of our weighted index. Moreover, the value provides a good estimation of the probability :
[TABLE]
This will let us design an approximate weighted index. An example of a -estimation is shown in Table 2.
Below, we prove existence of a -estimation. An efficient construction is deferred to the next section.
For a fixed weighted sequence of length and a threshold , we can use compact notation:
[TABLE]
for . We start with an equivalent characterisation of -estimations of .
Observation 1**.**
A family is a -estimation of if and only if for each position , every string is a prefix of exactly strings .
Next, we prove that this condition uniquely defines the multiset .
Lemma 1**.**
There exists a unique multiset such that each string is a prefix of exactly strings in .
Proof.
Consider a multiset satisfying the required condition and an arbitrary string . For each , there are strings in with the prefix is followed by a character . In the remaining strings in , the prefix it is not followed by any letter. Thus, the multiplicity of in must be . This implies uniqueness of .
Observe that , because and the function is superadditive. Consequently, we may define a multiset using values as multiplicities. It remains to prove that this multiset satisfies the required condition. For this, we consider strings in the order of decreasing lengths. The base case is trivial because strings longer than satisfy . The inductive hypothesis yields that, for each , the string is a prefix of strings in . Consequently, the string is a prefix of strings in , as claimed. ∎
Observe that in a -estimation, can be obtained from by inserting a leading character and dropping some number of trailing characters. This statement holds if only ; otherwise . The relation between these strings can be formalised as follows:
Definition 2**.**
We say that is compatible with if or for some character and a prefix of .
Thus, if a -estimation exists, it yields a perfect matching between and such that the matched strings are compatible. We prove that such a matching exists unconditionally. For an example, see Table 3.
Lemma 2**.**
For every , there is a one-to-one correspondence from into such that each is matched with a compatible .
Proof.
We greedily transform each into the longest compatible which is still unmatched. If no compatible is available, we leave unmatched. We will show that all strings are actually matched at the end of this process. Since , it suffices to prove that no is left unmatched.
An empty string is compatible with every , so it cannot be left unmatched. Thus, suppose that , for some and string , is left unmatched. Let us denote by the multiset containing all strings compatible with , i.e., starting with . We further define as the multiset containing all strings that start with for some . The construction procedure guarantees that each has been matched to a compatible satisfying ; such must belong to the multiset .
Observe that because and the function is superadditive. Consequently, each must be matched to some . Since is unmatched, we obtain a contradiction. ∎
Due to Lemma 2, we can index the strings so that we have chains with compatible subsequent strings. It is easy to transform each such chain to a string with property so that . The value is not specified if ; in this case, we may set to an arbitrary character. The resulting family clearly satisfies the characterisation of 1, which completes the proof of the following result.
Theorem 1**.**
Each weighted sequence has a -estimation.
4 Efficient Implementation
In this section we describe an algorithm which, given a weighted sequence of length and threshold , constructs a -estimation of in time.
At a high level, we follow the existential construction of Section 3. We start with , which consists of copies of , and we iterate over positions transforming to so that each is replaced with a compatible string . We simultaneously build the -estimation . More precisely, we set to and to the leading character of , or an arbitrary character if .
Each transformation simulates the procedure provided in the proof of Lemma 2. However, our implementation uses solid factor tries in order to achieve amortised running time.
4.1 Solid Factor Tries
Recall that a trie is a rooted tree in which each node represents a string; the string corresponding to node , called the label of , is denoted . The root has label , and the parent of a node with for is the node with ; the edge from to is labelled with . Observe that the family of solid factors occurring at position (i.e., strings such that ) is closed with respect to prefixes. Thus, we can define a solid factor trie whose nodes represent these factors.
We store using tokens in : each is represented by a token (with identifier ) located at the node with . For each token , we store the node with and the probability . Observe that the number of tokens at the node is and the number of tokens in the subtree rooted at is . To simplify notation, we denote and . We have the following simple observation; see also Figure 1.
Observation 2**.**
The trie contains tokens in total and every leaf contains tokens.
4.2 Transformation Algorithm
For each index , we transform the solid factor trie to and move the tokens so that is transformed to .
Before we describe the implementation, let us formulate a relation between and .
Observation 3**.**
If has a non-empty label, , for some , then contains a node with label .
Consequently, each non-root node has a corresponding node . In our construction algorithm, we sometimes reuse as ; otherwise, we create as a copy of . More precisely, we distinguish a heavy letter maximising probability over . We reuse if starts with and create a copy of otherwise.
This approach is implemented as follows. First, we create the root of and attach to the new root using an edge with label . The resulting subtree, denoted , contains all tokens present in and may contain nodes with (we piggyback trimming them to the last phase when tokens are moved). Next, we consider all the remaining letters . For each such letter we shall build a subtree representing solid factors occurring at position and starting with character . We simultaneously build and traverse : we construct the children of a node while visiting for the first time. While at node with , we maintain the probability and a pointer to the corresponding node such that . To construct the children of , we simply compute for each . Moreover, we determine and place token requests at node , announcing that tokens are needed at .
Finally, we move the tokens and trim the redundant nodes of . We process the tokens in an arbitrary order. Consider a token located at node of with (the token used to represent ). We traverse the path from towards the root of maintaining the probability at the currently visited node . First, we check if there is any token request at . If so, we comply with the request, remove it, and terminate the traversal. Otherwise, we compute using the probability. If contains less than already processed tokens, we place our token at and terminate the traversal. Otherwise, we proceed to the parent of . If is a leaf and does not contain any (processed or unprocessed) tokens, we remove from . If the traversal reaches the root of , we place the token unconditionally at the root. Figure 2 illustrates this procedure on an example.
4.2.1 Correctness
We shall prove that the procedure described above correctly computes and . Due to 3, the trie contains all the necessary nodes. We only need to prove that no redundant nodes (with ) are left in . Suppose that is the deepest such node; clearly, it must be a leaf of . We did not place the token at because . On the other hand, tokens were present in all leaves of , so the subtree of in initially contained a token. Let us consider the moment of moving the last token in this subtree. If the token travelled further to the parent of , we would have removed . Thus, the token must have been placed at a node complying with a token request at . However, in that case we have , because is the heavy character. This contradiction concludes the proof.
Hence, we proceed to proving that the final configuration of tokens represents . For this, we observe that our algorithm simulates the greedy procedure in the proof of Lemma 2. In other words, we shall prove that we transformed to the longest compatible element of which was still unmatched when we processed token . Suppose that there was an unmatched string longer than . Let and observe that, when processing token , we visited the node with . If , then we would have less than processed tokens at . Otherwise, there must have been a token request at . For either event we would not have proceeded to the parent of . This contradiction concludes the proof.
4.2.2 Running Time Analysis
It remains to show that the total running time of the transformations is . In a single iteration, processing the -th token, i.e., transforming to , we visited at most nodes of and deleted some of them. Across all iterations this is per token and in total. The remaining operations (construction of subtrees ) take time per created node. The final tree has nodes and the overall number of deleted nodes is . Hence, the total number of created nodes is also .
This concludes the proof that the running time is . Hence, we achieve the main goal of this section.
Theorem 2**.**
For a weighted sequence of length over a constant-sized alphabet, one can construct a -estimation in time.
5 Property Indexing Made Simple
Every known solution to the Property Indexing problem makes use of suffix trees; ours is no exception. Below we recall the basics on suffix trees.
5.1 Suffix Trees
The suffix tree of a non-empty string of length is a compact trie representing all suffixes of . The nodes of the trie which become nodes of the suffix tree (i.e., branching nodes, terminal nodes, and the root) are called explicit nodes, while the other nodes are called implicit. The edges out-going from a node are labelled with their first letters and can be stored, e.g., in a list.
Each edge of the suffix tree can be viewed as an upward maximal path of implicit nodes starting with an explicit node. Moreover, each node belongs to a unique path of that kind. Then, each node of the trie can be represented in the suffix tree by the edge it belongs to and an index within the corresponding path. We use to denote the path-label of a node , i.e., the concatenation of the edge labels along the path from the root to . The terminal node corresponding to suffix is marked with the index . Each string occurring in is uniquely represented by either an explicit or an implicit node of , called the locus of . The suffix link of a node with path-label is a pointer to the node path-labelled , where is a single letter and is a string. The suffix link of every non-root explicit leads to an explicit node of .
The suffix tree of a string of length even over an integer alphabet can be constructed in time [9].
5.2 Property Suffix Tree Construction
In analogy to the suffix tree, given a string with property represented by an array , we define the property suffix tree of as the compact trie representing strings . Each terminal node stores a list containing all indices such that is the path-label of . This way, can be retrieved by locating the locus of and writing down indices in lists for all descendants of the locus.
For a given string , we construct the property suffix tree with respect to property from the suffix tree of . This process is implemented in three steps. First, for each index we determine the locus of . Next, we make all these loci explicit to create new terminal nodes. Finally, we remove nodes which should no longer exist in the tree or no longer be explicit.
Our approach to the first phase is similar to Ukkonen’s suffix tree construction [21]. We are to determine the locus of . For this, we shall traverse the suffix tree starting from an explicit node guaranteed to be an ancestor of . We obtain by following the suffix link of the nearest explicit ancestor of ( itself if it is explicit). If or the explicit ancestor of is the root, we simply set as the root. Since for , is indeed an ancestor of . Therefore, we can progress down the edges in the suffix tree from , keeping track of the current depth until the desired depth is reached. We know that exists in the tree, so it suffices to read only the first letters of each traversed edge.
This procedure results in the sequence of loci . Let us analyse its time complexity. In the -th iteration we traverse: one edge to reach , then several edges a node whose suffix link is , and finally at most one edge to reach . Hence, the number of edges traversed in this iteration is at most , which gives overall.
The remaining steps of the algorithm are performed as follows. We sort the loci by the path label length and group them based on the edge where they are located. This lets us appropriately subdivide each edge and compute the lists for the new terminal nodes. Finally, we trim the tree: we traverse the tree bottom-up and remove or dissolve nodes which should no longer be explicit. These steps clearly work in time.
Theorem 3**.**
For a string and property represented with a table , the property suffix tree can be computed in time. Moreover, this data structure can answer property indexing queries in time (counting) or time (reporting).
6 Weighted Index
Let us first describe our data structure for the Weighted Indexing problem. For a weighted sequence and a threshold , we construct a -estimation of , concatenate all the strings and shift the properties so that a single string with property is obtained. Our weighted index is the property suffix tree of and . In the property suffix tree, each terminal node is labelled by the list of all the occurrences of the corresponding string in respecting its property. We shift these indices so that they describe the indices within the respective strings .
The space complexity of the index is obviously , where is the length of . Theorems 2 and 3 show that the data structure can be constructed in time. The resulting weighted index is very similar to the one constructed in [3], even though the construction algorithm is very different.
By Definition 1, a string occurs at position in if and only if it occurs at this position in at least one of the strings. Thus, to check if , it suffices to traverse down the property suffix tree and check if it contains a node corresponding to . This search takes time, where . The two remaining types of operations—counting and reporting—require finding distinct positions in the labels of the terminals in the subtree of . They can be implemented after additional preprocessing for the colour set size [11] and coloured range listing problem [17]; details can be found in [3]. We obtain the same complexities as in Theorem 16 from [3].
Theorem 4**.**
For a weighted sequence of length over a constant-sized alphabet and a threshold , there is a weighted index of size that can be constructed in time and answers decision and counting queries in time and reporting queries in time.
Other applications of the weighted index mentioned in [3] include -time computation of the weighted prefix table and of all covers of a weighted sequence. Our weighted index can be used in both.
7 Approximate Weighted Index
Now let us proceed to the solution of the Approximate Weighted Indexing problem. We are to answer queries for a pattern and a probability threshold and are allowed to report occurrences with probability , for a given value of . Let us recall that [6] solve this problem in space (with construction time) with -time queries, assuming that holds in all queries. Our techniques lead to a substantial improvement over the complexities of this index.
Assume that the query is for a pattern and a threshold . If , then the query is trivial as all the positions in can be reported. Henceforth, we assume that .
Let us consider a -estimation for the weighted sequence with . Let . By Definition 1, we can return position as an occurrence of based on whether ; this is shown in the following lemma.
Lemma 3**.**
If , then . If , then .
Proof.
Assume that . Then
[TABLE]
Now assume that . As , this concludes that , which is equivalent to . ∎
Thus our approximate weighted index for is the weighted index for constructed for . To obtain the desired accuracy, it suffices to find the node in the property suffix tree that corresponds to and report all positions in such that there are at least leaves in the subtree of labelled with the position . Let us show that this can be done by augmenting the weighted index by a data structure for (top-) document retrieval.
A version of the document retrieval problem (see Section 4.1 in [18]) can be stated operationally as follows. We are given a compact trie with leaves, each leaf labelled with a document number being a positive integer up to . (Usually is a suffix tree of a collection of documents.) Given a pattern , let be the locus of . Our goal is to report subsequent documents whose numbers occur most frequently in the leaves of the subtree of until the process of reporting is interrupted. In [18] a data structure of size is shown that, given the node , reports top-scoring documents in time. The construction time of the data structure is .
We can augment our property suffix tree with this data structure with the document numbers being the labels of terminals (we can create a separate leaf for each label). This gives . To find the documents with at least occurrences, we compute by doubling the smallest such that the last of the top documents reported has less than occurrences. The number of documents reported in the last step of the doubling search will be at most and the total number will not exceed .
Theorem 5**.**
For a weighted sequence of length over a constant-sized alphabet and parameter , the Approximate Weighted Indexing problem can be solved in space with -time queries. The construction time is .
8 Randomised Construction with Greater Space Usage
A symbol of a weighted sequence can be interpreted as a probability distribution on , and the whole sequence can be interpreted as a product distribution on strings of length over . In this setting, if , i.e., is a random string with distribution , then, for any position and string , we have . This interpretation can be used to provide a randomised construction of families of strings with properties equivalent to the weighted sequence in a certain sense, weaker than the one used in Definition 1.
Lemma 4**.**
There is a randomised algorithm which, given a weighted sequence of length and a threshold parameter , in time constructs a family of strings with properties such that if and only if . It succeeds with high probability ( for arbitrarily large constant ).
Proof.
We randomly sample strings . Formally, these are independent random variables with distribution . The properties are specified so that is the longest prefix of with .
This way, implies . On the other hand, if , then, since , we have:
[TABLE]
There are at most pairs satisfying (this is the bound for the sum of lengths of all strings in the sets from Section 3). Consequently, the resulting family has the required property with probability at least . ∎
We can directly use the same methods as in Section 6 to construct a weighted index from the family of strings constructed in Lemma 4. The space complexity of the resulting index is worse than the one in Theorem 4 by a factor of and the construction is randomised.
Corollary 6**.**
There is a data structure of size for the Weighted Indexing problem which answers queries in optimal time. It can be constructed using a randomised -time algorithm which returns a valid weighted index with high probability.
The same type of construction can be used to obtain an approximate weighted index. To this end, we need a stronger equivalence property of a string family and a greater number of sampled strings to satisfy this property.
Lemma 5**.**
There is a randomised algorithm which, given a weighted sequence of length and a parameter , in time constructs a family of strings with properties such that for every position and string . It succeeds with high probability ( for arbitrarily large constant ).
Proof.
We randomly sample strings . The properties satisfy that is the longest prefix of such that .
Observe that if , then . On the other hand, if , then . Consequently, Hoeffding’s inequality [10] implies
[TABLE]
There are at most such pairs , so the family satisfies the required condition with probability at least , as claimed. ∎
We can use this family of strings to construct an approximate weighted index using top- document retrieval just as in Section 7. We arrive at the following construction with space complexity greater than the one from Theorem 5 by a factor of (and has a randomised construction).
Corollary 7**.**
There is a data structure of size which solves the Approximate Weighted Indexing problem with -time queries. It can be constructed using a randomised -time algorithm which returns a valid approximate weighted index with high probability.
9 Conclusions
In this article we present an efficient index for Weighted Pattern Matching along with new combinatorial insights into the nature of weighted sequences. We have produced an implementation of the index (see https://bitbucket.org/kociumaka/weighted_index) that we have validated for correctness and efficiency against known weighted pattern matching algorithms [15, 4, 5]. Our implementation supports decision, counting, and reporting variants of queries; however, only decision operations were implemented in worst-case optimal time.
Let us mention that our results can be extended to integer alphabets , i.e., , without influencing the space and construction time. We have omitted the description of this extension and preferred to focus on the basic case of a constant-sized alphabet that is also most relevant in practice.
Finally, our ideas can be used to improve the solution for the Generalised Weighted Indexing problem from [6]. They use a notion of special weighted sequences in which each position contains at most one letter with a positive probability. (In this case the assumption that the probabilities sum up to 1 at each position is waived.) In [6] the input weighted sequence is transformed using the reduction of [2] into a special weighted sequence of length that preserves the set of maximal solid factors. In the special weighted sequence, a query for a pattern under the probability threshold is answered in time.
Our -estimation can be transformed into a special weighted sequence of length that also preserves the set of solid factors. We simply concatenate the strings, taking the letter probabilities from the respective positions in , and split the concatenated parts with a zero-probability position. This gives a more space-efficient reduction that can be used in the data structure of [6].
Corollary 8**.**
For a weighted sequence of length over an integer alphabet, the Generalised Weighted Indexing problem can be solved with -time queries with an index of size .
Acknowledgement
We thank an anonymous referee of the previous version of the paper for the idea of a simple randomised construction. We also thank Tsvi Kopelowitz for bringing our attention to the multitude of existing solutions to the Property Indexing problem.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Charu C. Aggarwal and Philip S. Yu. A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering , 21(5):609–623, 2009.
- 2[2] Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. Property matching and weighted matching. Theoretical Computer Science , 395(2-3):298–310, April 2008.
- 3[3] Carl Barton, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Efficient index for weighted sequences. In Roberto Grossi and Moshe Lewenstein, editors, Combinatorial Pattern Matching, CPM 2016 , volume 54 of LIP Ics , pages 4:1–4:13, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
- 4[4] Carl Barton, Chang Liu, and Solon P. Pissis. Fast average-case pattern matching on weighted sequences, 2015.
- 5[5] Carl Barton, Chang Liu, and Solon P. Pissis. On-line pattern matching on uncertain sequences and applications. In T.-H. Hubert Chan, Minming Li, and Lusheng Wang, editors, Combinatorial Optimization and Applications, COCOA 2016 , volume 10043 of LNCS , pages 547–562. Springer, 2016.
- 6[6] Sudip Biswas, Manish Patil, Sharma V. Thankachan, and Rahul Shah. Probabilistic threshold indexing for uncertain strings. In Evaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amélie Marian, Letizia Tanca, Ioana Manolescu, and Kostas Stefanidis, editors, 19th International Conference on Extending Database Technology, EDBT 2016 , pages 401–412. Open Proceedings.org, 2016.
- 7[7] Manolis Christodoulakis, Costas S. Iliopoulos, Laurent Mouchard, and Kostas Tsichlas. Pattern matching on weighted sequences. In Algorithms and Computational Methods for Biochemical and Evolutionary Networks, Comp Bio Nets 2004 , KCL publications, 2004.
- 8[8] Nilesh N. Dalvi, Christopher Ré, and Dan Suciu. Probabilistic databases: diamonds in the dirt. Communications of the ACM , 52(7):86–94, 2009.
