Indexing Weighted Sequences: Neat and Efficient

Carl Barton; Tomasz Kociumaka; Chang Liu; Solon P. Pissis; and Jakub; Radoszewski

arXiv:1704.07625·cs.DS·August 28, 2017

Indexing Weighted Sequences: Neat and Efficient

Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub, Radoszewski

PDF

TL;DR

This paper introduces a simple, efficient indexing method for weighted sequences that enables fast pattern matching queries, improving upon previous approaches in complexity and simplicity.

Contribution

A novel, straightforward construction of a weighted sequence index that matches the best query time and reduces complexity compared to prior work.

Findings

01

Constructed an $O(nz)$-sized index for weighted sequences.

02

Achieved optimal $O(m+Occ)$ query time.

03

Improved space and complexity over previous methods.

Abstract

In a \emph{weighted sequence}, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac{1}{z}$ , we say that a string $P$ of length $m$ occurs in a weighted sequence $X$ at position $i$ if the product of probabilities of the letters of $P$ at positions $i, \dots, i + m - 1$ in $X$ is at least $\frac{1}{z}$ . In this article, we consider an \emph{indexing} variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an $O (n z)$ -time construction of an $O (n z)$ -sized index for a weighted sequence of length $n$ over a constant-sized alphabet…

Tables3

Table 1. Table 1 : A weighted sequence X 𝑋 X of length 6 over Σ = { 𝙰 , 𝙱 } Σ 𝙰 𝙱 \Sigma=\{\mathtt{A},\mathtt{B}\} .

$i$	1	2	3	4	5	6
$p_{i}^{(X)} (𝙰)$	1	$\frac{1}{2}$	$\frac{3}{4}$	$\frac{4}{5}$	$\frac{1}{2}$	$\frac{1}{4}$
$p_{i}^{(X)} (𝙱)$	0	$\frac{1}{2}$	$\frac{1}{4}$	$\frac{1}{5}$	$\frac{1}{2}$	$\frac{3}{4}$

Table 2. Table 2 : To the left: a 4-estimation SS SS \SS of the weighted sequence X 𝑋 X from Table 1 . To the right: all the strings that occur at position i = 3 𝑖 3 i=3 in X 𝑋 X together with the probabilities of occurrence in X 𝑋 X and occurrences in SS SS \SS .

$i$	1	2	3	4	5	6
$S_{1} [i]$	$𝙰$	$𝙰$	$𝙰$	$𝙰$	$𝙰$	$𝙰$
$π_{1} [i]$	2	2	3	4	5	6
$S_{2} [i]$	$𝙰$	$𝙰$	$𝙰$	$𝙰$	$𝙰$	$𝙱$
$π_{2} [i]$	4	4	5	6	6	6
$S_{3} [i]$	$𝙰$	$𝙱$	$𝙰$	$𝙰$	$𝙱$	$𝙱$
$π_{3} [i]$	4	4	5	6	6	6
$S_{4} [i]$	$𝙰$	$𝙱$	$𝙱$	$𝙱$	$𝙱$	$𝙱$
$π_{4} [i]$	2	2	3	3	5	6

Table 3. Table 3 : The sets ℳ i subscript ℳ 𝑖 \mathcal{M}_{i} for the weighted sequence X 𝑋 X from Table 1 with z = 4 𝑧 4 z=4 . Perfect matchings of compatible strings between ℳ i subscript ℳ 𝑖 \mathcal{M}_{i} and ℳ i + 1 subscript ℳ 𝑖 1 \mathcal{M}_{i+1} are marked. The first letters of the strings form the 4-estimation from Table 2 and the length of the j 𝑗 j -th string in ℳ i subscript ℳ 𝑖 \mathcal{M}_{i} corresponds to π j [ i ] − i + 1 subscript 𝜋 𝑗 delimited-[] 𝑖 𝑖 1 \pi_{j}[i]-i+1 .

$ℳ_{1}$		$ℳ_{2}$		$ℳ_{3}$		$ℳ_{4}$		$ℳ_{5}$		$ℳ_{6}$
$\underline{𝙰} 𝙰$	—	$\underline{𝙰}$	—	$\underline{𝙰}$	—	$\underline{𝙰}$	—	$\underline{𝙰}$	—	$\underline{𝙰}$
$\underline{𝙰} 𝙰𝙰𝙰$	—	$\underline{𝙰} 𝙰𝙰$	—	$\underline{𝙰} 𝙰𝙰$	—	$\underline{𝙰} 𝙰𝙱$	—	$\underline{𝙰} 𝙱$	—	$\underline{𝙱}$
$\underline{𝙰} 𝙱𝙰𝙰$	—	$\underline{𝙱} 𝙰𝙰$	—	$\underline{𝙰} 𝙰𝙱$	—	$\underline{𝙰} 𝙱𝙱$	—	$\underline{𝙱} 𝙱$	—	$\underline{𝙱}$
$\underline{𝙰} 𝙱$	—	$\underline{𝙱}$	—	$\underline{𝙱}$	—	$ε$	—	$\underline{𝙱}$	—	$\underline{𝙱}$

Equations14

Count_{\SS} (P, i) = ∣ {j : i \in Occ_{π_{j}} (P, S_{j})} ∣

Count_{\SS} (P, i) = ∣ {j : i \in Occ_{π_{j}} (P, S_{j})} ∣

Prob_{X} (P, i) = j = 1 \prod ∣ P ∣ p_{i + j - 1}^{(X)} (P [j]) .

Prob_{X} (P, i) = j = 1 \prod ∣ P ∣ p_{i + j - 1}^{(X)} (P [j]) .

\frac{1}{z} Count_{\SS} (P, i) \leq Prob_{X} (P, i) < \frac{1}{z} Count_{\SS} (P, i) + \frac{1}{z} .

\frac{1}{z} Count_{\SS} (P, i) \leq Prob_{X} (P, i) < \frac{1}{z} Count_{\SS} (P, i) + \frac{1}{z} .

t_{i} (P) = ⌊ Prob_{X} (P, i) z ⌋ \mbox an d m_{i} (P) = t_{i} (P) - c \in Σ \sum t_{i} (P c)

t_{i} (P) = ⌊ Prob_{X} (P, i) z ⌋ \mbox an d m_{i} (P) = t_{i} (P) - c \in Σ \sum t_{i} (P c)

Prob_{X} (P, i) \geq \frac{1}{z} Count_{\SS} (P, i) \geq \frac{1}{z} ⌊ \frac{z}{z ^{'}} ⌋ \geq \frac{1}{z} (\frac{z}{z ^{'}} - 1) = \frac{1}{z ^{'}} - ϵ .

Prob_{X} (P, i) \geq \frac{1}{z} Count_{\SS} (P, i) \geq \frac{1}{z} ⌊ \frac{z}{z ^{'}} ⌋ \geq \frac{1}{z} (\frac{z}{z ^{'}} - 1) = \frac{1}{z ^{'}} - ϵ .

P [Count_{\SS} (P, i) = 0] = (1 - Prob_{X} (P, i))^{k} \leq e^{- k Prob_{X} (P, i)} \leq e^{- (c + 2) l n (n z)} = \frac{1}{( n z ) ^{c + 2}} .

P [Count_{\SS} (P, i) = 0] = (1 - Prob_{X} (P, i))^{k} \leq e^{- k Prob_{X} (P, i)} \leq e^{- (c + 2) l n (n z)} = \frac{1}{( n z ) ^{c + 2}} .

P [∣ Prob_{X} (P, i) - \frac{1}{k} Count_{\SS} (P, i) ∣ > ϵ] \leq 2 e^{- ϵ^{2} k} = 2 e^{- (c + 2) l n \frac{n}{ϵ}} \leq 2 (\frac{ϵ}{n})^{c + 2} .

P [∣ Prob_{X} (P, i) - \frac{1}{k} Count_{\SS} (P, i) ∣ > ϵ] \leq 2 e^{- ϵ^{2} k} = 2 e^{- (c + 2) l n \frac{n}{ϵ}} \leq 2 (\frac{ϵ}{n})^{c + 2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Indexing Weighted Sequences: Neat and Efficient

Carl Barton

European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK

[email protected]

Tomasz Kociumaka

Institute of Informatics, University of Warsaw, Warsaw, Poland

[kociumaka,jrad]@mimuw.edu.pl

Chang Liu

Department of Informatics, King’s College London, London, UK

[chang.2.liu,solon.pissis]@kcl.ac.uk

Solon P. Pissis

Department of Informatics, King’s College London, London, UK

[chang.2.liu,solon.pissis]@kcl.ac.uk

Jakub Radoszewski

Institute of Informatics, University of Warsaw, Warsaw, Poland

[kociumaka,jrad]@mimuw.edu.pl

Department of Informatics, King’s College London, London, UK

[chang.2.liu,solon.pissis]@kcl.ac.uk

Abstract

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac{1}{z}$ , we say that a string $P$ of length $m$ occurs in a weighted sequence $X$ at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+m-1$ in $X$ is at least $\frac{1}{z}$ . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an $\mathcal{O}(nz)$ -time construction of an $\mathcal{O}(nz)$ -sized index for a weighted sequence of length $n$ over a constant-sized alphabet that answers pattern matching queries in optimal, $\mathcal{O}(m+\mathit{Occ})$ time, where $\mathit{Occ}$ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of $\lfloor z\rfloor$ special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. [3], but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. [6] and an improvement of the space complexity of their general index.

1 Introduction

We consider a type of uncertain sequence called a weighted sequence. In a weighted sequence every position contains a subset of the alphabet and every letter of the alphabet is associated with a probability of occurrence such that the sum of probabilities at each position equals 1.

Weighted sequences are common in a wide range of applications: (i) data measurements with imprecise sensor measurements; (ii) flexible sequence modelling, such as binding profiles of DNA sequences; (iii) observations that are private and thus sequences of observations may have artificial uncertainty introduced deliberately (see [1] for a survey). Pattern matching (or substring matching) is a core operation in a wide variety of applications including genome assembly, computer virus detection, database search and short read alignment. Many of the applications of pattern matching generalise immediately to the weighted case as much of this data is more commonly uncertain (e.g. reads with quality scores) than certain. In particular probabilistic databases have been a very active area of research in recent years; see e.g. [8]. A common assumption in practice is that the alphabet of weighted sequences is constant since the most commonly studied alphabet is $\Sigma=\{\mathtt{A},\mathtt{C},\mathtt{G},\mathtt{T}\}$ .

In the Weighted Pattern Matching (WPM) problem we are given a string $P$ called a pattern, a weighted sequence $X$ called a text, both over an alphabet $\Sigma$ , and a threshold probability $\frac{1}{z}$ . The task is to find all positions $i$ in $X$ where the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in $X$ is at least $\frac{1}{z}$ . Each such position is called an occurrence of the pattern; we also say that the fragment and the pattern match.

In this article, we consider the indexing (or off-line) version of the WPM problem, called Weighted Indexing. Here we are given a text being a weighted sequence and we are asked to construct a data structure (called an index) to provide efficient operations for answering WPM queries related to the text. We also consider other variants of the indexing problem. In the Approximate Weighted Indexing problem, given a pattern and a threshold $z^{\prime}$ , we are to report all occurrences of the pattern with probability at least $\frac{1}{z^{\prime}}$ but we may also report additional occurrences with probability $\frac{1}{z^{\prime}}-\epsilon$ , for a pre-selected value of $\epsilon>0$ . In the Generalised Weighted Indexing problem we are to construct a data structure that allows for WPM queries to be answered for any threshold $z^{\prime}$ with $z^{\prime}\leq z$ .

A problem that is known to be closely related to the Weighted Indexing problem is Property Indexing. In this problem, we are given a string $S$ called the text and a hereditary property $\Pi$ , which is a family of integer intervals contained in $\{1,\ldots,|S|\}$ (hereditary means that it is closed under subintervals). Our goal is to preprocess the text so that, for a query string $P$ , we can report all occurrences of $P$ in $S$ which, interpreted as intervals, belong to $\Pi$ . The property $\Pi$ can be represented in $\mathcal{O}(|S|)$ space using an array $\pi[1\mathinner{.\,.}|S|]$ such the longest interval starting at position $i$ is $\{i,\ldots,\pi[i]\}$ .

In each of the indexing problems, we denote the length of the text by $n$ , the length of a query pattern by $m$ , and the number of occurrences of the pattern in the text by $\mathit{Occ}$ .

1.1 Previous Results

An $\mathcal{O}(n\log m)$ -time solution for the Weighted Pattern Matching problem based on the Fast Fourier Transform was proposed in [7, 19]. Recently, an $\mathcal{O}(n\log z)$ -time solution using the suffix array and lookahead scoring was presented in [15]. The average case complexity of the WPM problem has also been studied and a number of fast algorithms have been shown with both linear [4] and sub-linear on average algorithms being presented [5].

The Weighted Indexing problem was first considered by Iliopoulos et al. [12], who introduced a data structure called weighted suffix tree allowing optimal $\mathcal{O}(m+\mathit{Occ})$ -time queries. The construction time and size of that data structure was, however, $\mathcal{O}(n|\Sigma|^{z\log z})$ .

Amir et al. [2] reduced the Weighted Indexing problem to the Property Indexing problem in a text of length $\mathcal{O}(nz^{2}\log z)$ . For the latter, they proposed a solution with $O(n\log\log n)$ preprocessing time and optimal $O(m+\mathit{Occ})$ query time. Later it was shown that the Property Indexing problem can be solved in linear time; see [13, 14] (see also [16]). This lead to a solution to the Weighted Indexing problem with index size and construction time $\mathcal{O}(nz^{2}\log z)$ , preserving optimal query time.

These results were recently improved by some of the authors in [3], where they proposed an $\mathcal{O}(nz)$ -sized data structure for the Weighted Indexing problem that can be constructed also in $\mathcal{O}(nz)$ time. The query time is still $O(m+\mathit{Occ})$ . The authors proposed several applications of their index.

Biswas et al. [6] presented a data structure that solves the Approximate Weighted Indexing problem in $\mathcal{O}(\frac{1}{\epsilon}nz^{2})$ space (with $\Omega(\frac{1}{\epsilon}n^{2}z^{2})$ construction time) with $\mathcal{O}(m+\mathit{Occ})$ -time queries; here $\mathit{Occ}$ denotes the number of occurrences reported. They also proposed a data structure for the Generalised Weighted Indexing problem with $\mathcal{O}(nz^{2}\log z)$ space and $\mathcal{O}(m+m\cdot\mathit{Occ})$ query time. The construction time is not mentioned, but a direct construction of their index works in $\Omega(n^{2}z^{2})$ time. Moreover, they also consider the problem of document listing for weighted sequences.

1.2 Our Contribution

We present a new $\mathcal{O}(nz)$ -time construction of an $\mathcal{O}(nz)$ -sized data structure for the Weighted Indexing problem that answers queries in optimal $\mathcal{O}(m+\mathit{Occ})$ time. Our index is based on a novel observation that one can always construct a family of $\lfloor z\rfloor$ special strings of length $n$ that carries all the information about all the strings that occur in the weighted sequence. This yields a significantly simpler construction than in the previous index [3] preserving all of its applications. As a by product, we obtain an optimal solution to the Property Indexing problem that avoids complex tools used in the previous solutions [2, 13, 14, 16]. We provide a proof-of-concept implementation of our index that was validated for correctness and efficiency. We also discuss an even simpler randomised construction with worse space complexity and construction time of the index.

Our approach lets us significantly improve upon the variants of the weighted index proposed in [6]. In the Approximate Weighted Indexing problem, we obtain $\mathcal{O}(\frac{n}{\epsilon})$ space and $\mathcal{O}(\frac{n}{\epsilon}\log{\frac{n}{\epsilon}})$ construction time, preserving the query time. We also improve the space usage in the Generalised Weighted Indexing problem to $\mathcal{O}(nz)$ , also in the document listing variant.

1.3 Comparison of Our Techniques with the Previous Work

Two main building blocks of our weighted index are a construction of a family of $\left\lfloor z\right\rfloor$ special strings with properties and a solution to the Property Indexing problem.

The family of strings that we construct has the same set of patterns occurring at each position as the weighted text $X$ and, moreover, the number of occurrences of each pattern at each position is a good estimate of the probability of its occurrence at this position in $X$ . The former property is used in the construction of a weighted index and the latter in the construction of an approximate weighted index. The existence of this family is not immediate. However, its proof not involved and we design a $\mathcal{O}(nz)$ -time elementary construction algorithm based on tries (also known as radix trees). In the end we show that a simple generation of a number of strings according to the probability distribution implied by the weighted sequence with high probability yields a family of strings that also well describes the set of patterns in $X$ . However, the number of strings that one needs to generate is much larger. Excluding the previous, exponential-size index of Iliopoulos et al. [12], previous work includes the $\mathcal{O}(nz^{2}\log z)$ -space index of Amir et al. [2] and $\mathcal{O}(nz)$ -space index by Barton et al. [3]. Amir et al. [2] show that, after a small modification of the weighted sequence, the set of maximal string patterns that occur in it has a total length $\mathcal{O}(nz^{2}\log z)$ . Barton et al. [3] show a representation of this set as a trie and apply Shibuya’s algorithm for suffix tree of a trie construction [20].

In our solution to the Property Indexing problem we construct a data structure called property suffix tree being the suffix tree in which the nodes corresponding to factors that do not belong to the property are trimmed. The algorithm makes only several traversals of the suffix tree and uses an amortisation argument similar to the one from Ukkonen’s suffix tree construction [21]. Very similar data structures were constructed by Amir et al. [2] and Kopelowitz [16]. Amir et al. [2] use a heavy machinery of weighted ancestor queries and a fancy algorithm to mark the properties on edges of the suffix tree. Kopelowitz [16] designs an algorithm for a dynamic setting, but also mentions its static application. He uses amortisation ideas similar to ours, but his construction is more involved due to its generality and also utilises less basic longest common extension queries (i.e., range minimum queries). The solution to the Property Indexing problem that was developed by Iliopoulos et al. [13] and clarified by Juan et al. [14] constructs a different data structure that, in a sense, shifts the hardness of the problem from the construction to the queries. It also requires range minimum queries.

Our techniques enable us immediately to answer decision queries of a weighted index. To answer counting and reporting queries in optimal time, we require coloured range counting and reporting data structures in the property suffix tree that were already used for this purpose by Barton et al. [3]. In our solution to the Approximate Weighted Indexing, we need to augment the property suffix tree with a data structure for top- $k$ document retrieval queries. The same type of queries were used in the previous solution by Biswas et al. [6], however, not as a black box. Moreover, they also use the less efficient reduction of [2] which caused their data structure to use $\mathcal{O}(\frac{1}{\epsilon}nz^{2})$ space, assuming that $z^{\prime}\leq z$ in each query. Finally, we improve the space complexity of the generalised weighted index of Biswas et al. [6] by plugging in our construction of $\left\lfloor z\right\rfloor$ special strings.

1.4 Structure of the Paper

In Section 3 we present a combinatorial construction of the special family of $\left\lfloor z\right\rfloor$ strings. An efficient implementation of the construction of this family based on tries is proposed in Section 4. In Section 5 the new optimal solution for the Property Indexing problem is presented. Using the construction and the property index, we obtain our weighted index in Section 6 and, with the aid of an auxiliary tool, an approximate weighted index in Section 7. Alternative randomised constructions of the two indexes with worse parameters are discussed in Section 8. Our improvement to the Generalised Weighted Index and our C++ implementation are briefly discussed Section 9.

2 Preliminaries

2.1 Strings and Property Indexing

A string $S$ over an alphabet $\Sigma$ is a finite sequence of letters from $\Sigma$ . By $n=|S|$ we denote the length of $S$ and by $S[i]$ , for $1\leq i\leq n$ , we denote the $i$ -th letter of $S$ . By $S[i{\mathinner{.\,.}}j]$ we denote the string $S[i]\ldots S[j]$ called a factor of $S$ (if $i>j$ , then the factor is an empty string). A factor is called a prefix if $i=1$ and a suffix if $j=n$ . We say that a string $P$ occurs at position $i$ in $S$ if $P=S[i\mathinner{.\,.}i+|P|-1]$ .

A property $\Pi$ of $S$ is a hereditary collection of integer intervals contained in $\{1,\ldots,n\}$ . For simplicity, we represent every property $\Pi$ with an array $\pi[1\mathinner{.\,.}|S|]$ such that the longest interval $I\in\Pi$ starting at position $i$ is $\{i,\ldots,\pi[i]\}$ . Observe that $\pi$ can be an arbitrary array satisfying $\pi[i]\in\{i-1,\ldots,n\}$ and $\pi[1]\leq\pi[2]\leq\cdots\leq\pi[n]$ . For a string $P$ , by $\mathit{Occ}_{\pi}(P,S)$ we denote the set of occurrences $i$ of $P$ in $S$ such that $i+|P|-1\leq\pi[i]$ . These notions lead us to the statement of the following problem.

Problem 1 (Property Indexing).

Input: A string $S$ of length $n$ over an alphabet $\Sigma$ and an array $\pi$ representing a property $\Pi$ .

Queries: For a given pattern string $P$ of length $m$ , compute $|\mathit{Occ}_{\pi}(P,S)|$ or report all elements of $\mathit{Occ}_{\pi}(P,S)$ .

Let us consider an indexed family $\SS=(S_{j},\pi_{j})_{j=1}^{k}$ of strings $S_{j}$ with properties $\pi_{j}$ . For a string $P$ and an index $i$ , by

[TABLE]

we denote the total number of occurrences of $P$ at the position $i$ in the strings $S_{1},\ldots,S_{k}$ that respect the properties.

2.2 Weighted Sequences and Weighted Indexing

A weighted sequence $X=x_{1}x_{2}\ldots x_{n}$ of length $|X|=n$ over an alphabet $\Sigma$ is a sequence of sets of pairs of the form $x_{i}=\{(c,\ p^{(X)}_{i}(c))\ :\ c\in\Sigma\}$ . Here, $p^{(X)}_{i}(c)$ is the occurrence probability of the letter $c$ at the position $i\in\{1,\ldots,n\}$ . These values are non-negative and sum up to 1 for a given $i$ . An example of a weighted sequence is shown in Table 1.

The probability of matching of a string $P$ at position $i$ of a weighted sequence $X$ equals

[TABLE]

We say that a string $P$ occurs in $X$ at position $i$ if $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z}$ . We also say that $P$ is a solid factor of $X$ (starting, occurring) at position $i$ . By $\mathit{Occ}_{\frac{1}{z}}(P,X)$ we denote the set of all positions where $P$ occurs in $X$ . The main problem in scope can be formulated as follows.

Problem 2 (Weighted Indexing).

Input: A weighted sequence $X$ of length $n$ over an alphabet $\Sigma$ and a threshold $\frac{1}{z}$ .

Queries: For a given pattern string $P$ of length $m$ , check if $\mathit{Occ}_{\frac{1}{z}}(P,X)\neq\emptyset$ (decision query), compute $|\mathit{Occ}_{\frac{1}{z}}(P,X)|$ (counting query), or report all elements of $\mathit{Occ}_{\frac{1}{z}}(P,X)$ (reporting query).

Our model of computations.

We assume the word-RAM model with word size $w=\Omega(\log(nz))$ . We consider the log-probability model of representations of weighted sequences in which probabilities can be multiplied exactly in $\mathcal{O}(1)$ time. We further assume that $|\Sigma|=\mathcal{O}(1)$ ; under this assumption a weighted sequence of length $n$ has a representation using $\mathcal{O}(n)$ space.

3 Existence of an Equivalent Family of Strings

In the definition below, we formalise the property of a string family that we aim to construct.

Definition 1.

We say that an indexed family $\SS=(S_{j},\pi_{j})_{j=1}^{\left\lfloor z\right\rfloor}$ containing strings $S_{j}$ of length $n$ is a $z$ -estimation of a weighted sequence $X$ of length $n$ if and only if, for every string $P$ and position $i\in\{1,\ldots,n\}$ , $\mathit{Count}_{\SS}(P,i)=\left\lfloor\mathit{Prob}_{X}(P,i)z\right\rfloor$ .

Note that a $z$ -estimation $\SS$ of a weighted sequence $X$ carries the information about all solid factors of $X$ : a string $P$ occurs in $X$ at position $i$ if and only if it occurs at position $i$ in at least one of the strings $S_{j}$ respecting its property $\pi_{j}$ . This observation will be used in the construction of our weighted index. Moreover, the value $\mathit{Count}_{\SS}(P,i)$ provides a good estimation of the probability $\mathit{Prob}_{X}(P,i)$ :

[TABLE]

This will let us design an approximate weighted index. An example of a $z$ -estimation is shown in Table 2.

Below, we prove existence of a $z$ -estimation. An efficient construction is deferred to the next section.

For a fixed weighted sequence $X$ of length $n$ and a threshold $z$ , we can use compact notation:

[TABLE]

for $i=1,\ldots,n$ . We start with an equivalent characterisation of $z$ -estimations of $X$ .

Observation 1.

A family $\SS=(S_{j},\pi_{j})_{j=1}^{\left\lfloor z\right\rfloor}$ is a $z$ -estimation of $X$ if and only if for each position $i$ , every string $P$ is a prefix of exactly $t_{i}(P)$ strings $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]$ .

Next, we prove that this condition uniquely defines the multiset $\{S_{j}[i\mathinner{.\,.}\pi_{j}[i]]:1\leq j\leq\left\lfloor z\right\rfloor\}$ .

Lemma 1.

There exists a unique multiset $\mathcal{M}_{i}$ such that each string $P$ is a prefix of exactly $t_{i}(P)$ strings in $\mathcal{M}_{i}$ .

Proof.

Consider a multiset $\mathcal{M}_{i}$ satisfying the required condition and an arbitrary string $P$ . For each $c\in\Sigma$ , there are $t_{i}(Pc)$ strings in $\mathcal{M}_{i}$ with the prefix $P$ is followed by a character $c$ . In the remaining $t_{i}(P)-\sum_{c\in\Sigma}t_{i}(Pc)$ strings in $\mathcal{M}_{i}$ , the prefix $P$ it is not followed by any letter. Thus, the multiplicity of $P$ in $\mathcal{M}_{i}$ must be $m_{i}(P)$ . This implies uniqueness of $\mathcal{M}_{i}$ .

Observe that $t_{i}(P)\geq\sum_{c\in\Sigma}t_{i}(Pc)$ , because $\mathit{Prob}_{X}(P,i)\geq\sum_{c\in\Sigma}\mathit{Prob}_{X}(Pc,i)$ and the function $x\mapsto\left\lfloor xz\right\rfloor$ is superadditive. Consequently, we may define a multiset $\mathcal{M}_{i}$ using values $m_{i}(P)$ as multiplicities. It remains to prove that this multiset satisfies the required condition. For this, we consider strings $P$ in the order of decreasing lengths. The base case is trivial because strings $P$ longer than $X$ satisfy $\mathit{Prob}_{X}(P,i)=0$ . The inductive hypothesis yields that, for each $c\in\Sigma$ , the string $Pc$ is a prefix of $t_{i}(Pc)$ strings in $\mathcal{M}_{i}$ . Consequently, the string $P$ is a prefix of $m_{i}(P)+\sum_{c\in\Sigma}t_{i}(Pc)=t_{i}(P)$ strings in $\mathcal{M}_{i}$ , as claimed. ∎

Observe that in a $z$ -estimation, $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]$ can be obtained from $S_{j}[i+1\mathinner{.\,.}\pi_{j}[i+1]]$ by inserting a leading character and dropping some number of trailing characters. This statement holds if only $\pi_{j}[i]\geq i$ ; otherwise $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]=\varepsilon$ . The relation between these strings can be formalised as follows:

Definition 2.

We say that $P\in\mathcal{M}_{i}$ is compatible with $Q\in\mathcal{M}_{i+1}$ if $P=\varepsilon$ or $P=cQ^{\prime}$ for some character $c\in\Sigma$ and a prefix $Q^{\prime}$ of $Q$ .

Thus, if a $z$ -estimation exists, it yields a perfect matching between $\mathcal{M}_{i+1}$ and $\mathcal{M}_{i}$ such that the matched strings are compatible. We prove that such a matching exists unconditionally. For an example, see Table 3.

Lemma 2.

For every $1\leq i\leq n-1$ , there is a one-to-one correspondence from $\mathcal{M}_{i+1}$ into $\mathcal{M}_{i}$ such that each $Q\in\mathcal{M}_{i+1}$ is matched with a compatible $P\in\mathcal{M}_{i}$ .

Proof.

We greedily transform each $Q\in\mathcal{M}_{i+1}$ into the longest compatible $P\in\mathcal{M}_{i}$ which is still unmatched. If no compatible $P\in\mathcal{M}_{i}$ is available, we leave $Q$ unmatched. We will show that all strings $Q\in\mathcal{M}_{i+1}$ are actually matched at the end of this process. Since $|\mathcal{M}_{i}|=t_{i}(\varepsilon)=\left\lfloor z\right\rfloor=t_{i+1}(\varepsilon)=|\mathcal{M}_{i+1}|$ , it suffices to prove that no $P\in\mathcal{M}_{i}$ is left unmatched.

An empty string $P\in\mathcal{M}_{i}$ is compatible with every $Q\in\mathcal{M}_{i+1}$ , so it cannot be left unmatched. Thus, suppose that $P=cQ^{\prime}\in\mathcal{M}_{i}$ , for some $c\in\Sigma$ and string $Q^{\prime}$ , is left unmatched. Let us denote by $\mathcal{R}$ the multiset containing all strings $Q\in\mathcal{M}_{i+1}$ compatible with $P$ , i.e., starting with $Q^{\prime}$ . We further define $\mathcal{L}$ as the multiset containing all strings $P^{\prime}\in\mathcal{M}_{i}$ that start with $c^{\prime}Q^{\prime}$ for some $c^{\prime}\in\Sigma$ . The construction procedure guarantees that each $Q\in\mathcal{R}$ has been matched to a compatible $P^{\prime}$ satisfying $|P^{\prime}|\geq|P|$ ; such $P^{\prime}$ must belong to the multiset $\mathcal{L}$ .

Observe that $|\mathcal{L}|=\sum_{c^{\prime}\in\Sigma}t_{i}(c^{\prime}Q^{\prime})\leq t_{i+1}(Q^{\prime})=|\mathcal{R}|$ because $\mathit{Prob}_{X}(Q^{\prime},i+1)\geq\sum_{c^{\prime}\in\Sigma}\mathit{Prob}_{X}(c^{\prime}Q^{\prime},i)$ and the function $x\mapsto\left\lfloor xz\right\rfloor$ is superadditive. Consequently, each $P^{\prime}\in\mathcal{L}$ must be matched to some $Q\in\mathcal{R}$ . Since $P\in\mathcal{L}$ is unmatched, we obtain a contradiction. ∎

Due to Lemma 2, we can index the strings $\mathcal{M}_{i}=\{P_{j,i}:1\leq j\leq\left\lfloor z\right\rfloor\}$ so that we have $\left\lfloor z\right\rfloor$ chains $P_{j,1},\ldots,P_{j,n},P_{j,n+1}=\varepsilon$ with compatible subsequent strings. It is easy to transform each such chain to a string $S_{j}$ with property $\pi_{j}$ so that $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]=P_{j,i}$ . The value $S_{j}[i]$ is not specified if $P_{j,i}=\varepsilon$ ; in this case, we may set $S_{j}[i]$ to an arbitrary character. The resulting family $\SS=(S_{j},\pi_{j})_{j=1}^{\left\lfloor z\right\rfloor}$ clearly satisfies the characterisation of 1, which completes the proof of the following result.

Theorem 1.

Each weighted sequence $X$ has a $z$ -estimation.

4 Efficient Implementation

In this section we describe an algorithm which, given a weighted sequence $X$ of length $n$ and threshold $z$ , constructs a $z$ -estimation of $X$ in $\mathcal{O}(nz)$ time.

At a high level, we follow the existential construction of Section 3. We start with $\mathcal{M}_{n+1}$ , which consists of $\left\lfloor z\right\rfloor$ copies of $\varepsilon$ , and we iterate over positions $i=n,\ldots,1$ transforming $\mathcal{M}_{i+1}$ to $\mathcal{M}_{i}$ so that each $P_{j,i+1}\in\mathcal{M}_{i+1}$ is replaced with a compatible string $P_{j,i}\in\mathcal{M}_{i}$ . We simultaneously build the $z$ -estimation $\SS=(S_{j},\pi_{j})_{j=1}^{\left\lfloor z\right\rfloor}$ . More precisely, we set $\pi_{j}[i]$ to $i+|P_{j,i}|-1$ and $S_{j}[i]$ to the leading character of $P_{j,i}$ , or an arbitrary character if $P_{j,i}=\varepsilon$ .

Each transformation simulates the procedure provided in the proof of Lemma 2. However, our implementation uses solid factor tries in order to achieve $\mathcal{O}(z)$ amortised running time.

4.1 Solid Factor Tries

Recall that a trie is a rooted tree in which each node represents a string; the string corresponding to node $u$ , called the label of $u$ , is denoted $\mathsf{L}(u)$ . The root has label $\varepsilon$ , and the parent of a node $u$ with $\mathsf{L}(u)=Pc$ for $c\in\Sigma$ is the node $v$ with $\mathsf{L}(v)=P$ ; the edge from $P$ to $Pc$ is labelled with $c$ . Observe that the family of solid factors occurring at position $i$ (i.e., strings $P$ such that $t_{i}(P)>0$ ) is closed with respect to prefixes. Thus, we can define a solid factor trie $T_{i}$ whose nodes represent these factors.

We store $\mathcal{M}_{i}$ using tokens in $T_{i}$ : each $P_{j,i}\in\mathcal{M}_{i}$ is represented by a token (with identifier $j$ ) located at the node $u\in T_{i}$ with $\mathsf{L}(u)=P_{j,i}$ . For each token $j$ , we store the node $u\in T_{i}$ with $\mathsf{L}(u)=P_{j,i}$ and the probability $\mathit{Prob}_{X}(P_{j,i},i)$ . Observe that the number of tokens at the node $u$ is $m_{i}(\mathsf{L}(u))$ and the number of tokens in the subtree rooted at $u$ is $t_{i}(\mathsf{L}(u))$ . To simplify notation, we denote $m_{i}(u)=m_{i}(\mathsf{L}(u))$ and $t_{i}(u)=t_{i}(\mathsf{L}(u))$ . We have the following simple observation; see also Figure 1.

Observation 2.

The trie $T_{i}$ contains $\left\lfloor z\right\rfloor$ tokens in total and every leaf contains tokens.

4.2 Transformation Algorithm

For each index $i$ , we transform the solid factor trie $T_{i+1}$ to $T_{i}$ and move the tokens so that $\mathcal{M}_{i+1}$ is transformed to $\mathcal{M}_{i}$ .

Before we describe the implementation, let us formulate a relation between $T_{i}$ and $T_{i+1}$ .

Observation 3.

If $u\in T_{i}$ has a non-empty label, $\mathsf{L}(u)=cP$ , for some $c\in\Sigma$ , then $T_{i+1}$ contains a node $v$ with label $\mathsf{L}(v)=P$ .

Consequently, each non-root node $u\in T_{i}$ has a corresponding node $v\in T_{i+1}$ . In our construction algorithm, we sometimes reuse $v$ as $u$ ; otherwise, we create $u$ as a copy of $v$ . More precisely, we distinguish a heavy letter $h\in\Sigma$ maximising probability $p^{(X)}_{i}(c)$ over $c\in\Sigma$ . We reuse $v$ if $\mathsf{L}(u)$ starts with $h$ and create a copy of $v$ otherwise.

This approach is implemented as follows. First, we create the root of $T_{i}$ and attach $T_{i+1}$ to the new root using an edge with label $h$ . The resulting subtree, denoted $T_{i,h}$ , contains all tokens present in $T_{i+1}$ and may contain nodes $v$ with $t_{i}(v)=0$ (we piggyback trimming them to the last phase when tokens are moved). Next, we consider all the remaining letters $c\in\Sigma\setminus\{h\}$ . For each such letter we shall build a subtree $T_{i,c}$ representing solid factors occurring at position $i$ and starting with character $c$ . We simultaneously build and traverse $T_{i,c}$ : we construct the children of a node $u$ while visiting $u$ for the first time. While at node $u$ with $\mathsf{L}(u)=cP$ , we maintain the probability $\mathit{Prob}_{X}(cP,i)$ and a pointer to the corresponding node $v\in T_{i,h}$ such that $\mathsf{L}(v)=hP$ . To construct the children of $u$ , we simply compute $t_{i}(cPc^{\prime})$ for each $c^{\prime}\in\Sigma$ . Moreover, we determine $m_{i}(cP)$ and place $m_{i}(cP)$ token requests at node $v$ , announcing that $m_{i}(cP)$ tokens are needed at $u$ .

Finally, we move the tokens and trim the redundant nodes of $T_{i,h}$ . We process the tokens in an arbitrary order. Consider a token located at node $v$ of $T_{i,h}$ with $\mathsf{L}(v)=hQ$ (the token used to represent $Q\in\mathcal{M}_{i+1}$ ). We traverse the path from $v$ towards the root of $T_{i}$ maintaining the probability $\mathit{Prob}_{X}(\mathsf{L}(v^{\prime}),i)$ at the currently visited node $v^{\prime}$ . First, we check if there is any token request at $v^{\prime}$ . If so, we comply with the request, remove it, and terminate the traversal. Otherwise, we compute $m_{i}(v^{\prime})$ using the probability. If $v^{\prime}$ contains less than $m_{i}(v^{\prime})$ already processed tokens, we place our token at $v^{\prime}$ and terminate the traversal. Otherwise, we proceed to the parent of $v^{\prime}$ . If $v^{\prime}$ is a leaf and does not contain any (processed or unprocessed) tokens, we remove $v^{\prime}$ from $T_{i,h}$ . If the traversal reaches the root of $T_{i}$ , we place the token unconditionally at the root. Figure 2 illustrates this procedure on an example.

4.2.1 Correctness

We shall prove that the procedure described above correctly computes $T_{i}$ and $\mathcal{M}_{i}$ . Due to 3, the trie $T_{i}$ contains all the necessary nodes. We only need to prove that no redundant nodes $v$ (with $t_{i}(v)=0$ ) are left in $T_{i,h}$ . Suppose that $v$ is the deepest such node; clearly, it must be a leaf of $T_{i,h}$ . We did not place the token at $v$ because $m_{i}(v)\leq t_{i}(v)=0$ . On the other hand, tokens were present in all leaves of $T_{i+1}$ , so the subtree of $v$ in $T_{i,h}$ initially contained a token. Let us consider the moment of moving the last token in this subtree. If the token travelled further to the parent of $v$ , we would have removed $v$ . Thus, the token must have been placed at a node $u$ complying with a token request at $u$ . However, in that case we have $t_{i}(v)\geq t_{i}(u)\geq m_{i}(u)>0$ , because $h$ is the heavy character. This contradiction concludes the proof.

Hence, we proceed to proving that the final configuration of tokens represents $\mathcal{M}_{i}$ . For this, we observe that our algorithm simulates the greedy procedure in the proof of Lemma 2. In other words, we shall prove that we transformed $P_{j,i+1}\in\mathcal{M}_{i+1}$ to the longest compatible element of $\mathcal{M}_{i}$ which was still unmatched when we processed token $j$ . Suppose that there was an unmatched string $P^{\prime}\in\mathcal{M}_{i}$ longer than $P_{j,i}$ . Let $P^{\prime}=cQ^{\prime}$ and observe that, when processing token $j$ , we visited the node $v^{\prime}$ with $\mathsf{L}(v^{\prime})=hQ^{\prime}$ . If $c=h$ , then we would have less than $m_{i}(v^{\prime})$ processed tokens at $v^{\prime}$ . Otherwise, there must have been a token request at $v^{\prime}$ . For either event we would not have proceeded to the parent of $v^{\prime}$ . This contradiction concludes the proof.

4.2.2 Running Time Analysis

It remains to show that the total running time of the $n$ transformations is $\mathcal{O}(nz)$ . In a single iteration, processing the $j$ -th token, i.e., transforming $P_{j,i+1}$ to $P_{j,i}$ , we visited at most $1+|P_{j,i+1}|-|P_{j,i}|$ nodes of $T_{i,h}$ and deleted some of them. Across all iterations this is $\mathcal{O}(n)$ per token and $\mathcal{O}(nz)$ in total. The remaining operations (construction of subtrees $T_{i,c}$ ) take $\mathcal{O}(1)$ time per created node. The final tree $T_{1}$ has $\mathcal{O}(nz)$ nodes and the overall number of deleted nodes is $\mathcal{O}(nz)$ . Hence, the total number of created nodes is also $\mathcal{O}(nz)$ .

This concludes the proof that the running time is $\mathcal{O}(nz)$ . Hence, we achieve the main goal of this section.

Theorem 2.

For a weighted sequence $X$ of length $n$ over a constant-sized alphabet, one can construct a $z$ -estimation in $\mathcal{O}(nz)$ time.

5 Property Indexing Made Simple

Every known solution to the Property Indexing problem makes use of suffix trees; ours is no exception. Below we recall the basics on suffix trees.

5.1 Suffix Trees

The suffix tree $T$ of a non-empty string $S$ of length $n$ is a compact trie representing all suffixes of $S$ . The nodes of the trie which become nodes of the suffix tree (i.e., branching nodes, terminal nodes, and the root) are called explicit nodes, while the other nodes are called implicit. The edges out-going from a node are labelled with their first letters and can be stored, e.g., in a list.

Each edge of the suffix tree can be viewed as an upward maximal path of implicit nodes starting with an explicit node. Moreover, each node belongs to a unique path of that kind. Then, each node of the trie can be represented in the suffix tree by the edge it belongs to and an index within the corresponding path. We use $\mathsf{L}(v)$ to denote the path-label of a node $v$ , i.e., the concatenation of the edge labels along the path from the root to $v$ . The terminal node corresponding to suffix $S[i\mathinner{.\,.}n]$ is marked with the index $i$ . Each string $P$ occurring in $S$ is uniquely represented by either an explicit or an implicit node of $T$ , called the locus of $P$ . The suffix link of a node $v$ with path-label $\mathsf{L}(v)=cP$ is a pointer to the node path-labelled $P$ , where $c\in\Sigma$ is a single letter and $P$ is a string. The suffix link of every non-root explicit $v$ leads to an explicit node of $T$ .

The suffix tree of a string of length $n$ even over an integer alphabet can be constructed in $\mathcal{O}(n)$ time [9].

5.2 Property Suffix Tree Construction

In analogy to the suffix tree, given a string $S$ with property $\Pi$ represented by an array $\pi$ , we define the property suffix tree of $(S,\pi)$ as the compact trie representing strings $S[i\mathinner{.\,.}\pi[i]]$ . Each terminal node $v$ stores a list $L_{v}$ containing all indices $i$ such that $S[i\mathinner{.\,.}\pi[i]]$ is the path-label of $v$ . This way, $\mathit{Occ}_{\pi}(P,S)$ can be retrieved by locating the locus of $P$ and writing down indices in lists $L_{v}$ for all descendants of the locus.

For a given string $S$ , we construct the property suffix tree with respect to property $\Pi$ from the suffix tree of $S$ . This process is implemented in three steps. First, for each index $i$ we determine the locus $v_{i}$ of $S[i\mathinner{.\,.}\pi[i]]$ . Next, we make all these loci explicit to create new terminal nodes. Finally, we remove nodes which should no longer exist in the tree or no longer be explicit.

Our approach to the first phase is similar to Ukkonen’s suffix tree construction [21]. We are to determine the locus $v_{i}$ of $S[i\mathinner{.\,.}\pi[i]]$ . For this, we shall traverse the suffix tree starting from an explicit node $u_{i}$ guaranteed to be an ancestor of $v_{i}$ . We obtain $u_{i}$ by following the suffix link of the nearest explicit ancestor of $v_{i-1}$ ( $v_{i-1}$ itself if it is explicit). If $i=1$ or the explicit ancestor of $v_{i-1}$ is the root, we simply set $u_{i}$ as the root. Since $\pi[i]\geq\pi[i-1]$ for $i>1$ , $u_{i}$ is indeed an ancestor of $v_{i}$ . Therefore, we can progress down the edges in the suffix tree from $u_{i}$ , keeping track of the current depth until the desired depth is reached. We know that $v_{i}$ exists in the tree, so it suffices to read only the first letters of each traversed edge.

This procedure results in the sequence of loci $(v_{i})_{i=1}^{n}$ . Let us analyse its time complexity. In the $i$ -th iteration we traverse: one edge to reach $u_{i}$ , then several edges a node $w$ whose suffix link is $u_{i+1}$ , and finally at most one edge to reach $v_{i}$ . Hence, the number of edges traversed in this iteration is at most $2+|\mathsf{L}(u^{\prime})|-|\mathsf{L}(u_{i})|\leq 3+|\mathsf{L}(u_{i+1})|-|\mathsf{L}(u_{i})|$ , which gives $\mathcal{O}(n)$ overall.

The remaining steps of the algorithm are performed as follows. We sort the loci $v_{i}$ by the path label length $\pi[i]-i+1$ and group them based on the edge where they are located. This lets us appropriately subdivide each edge and compute the lists $L_{v}$ for the new terminal nodes. Finally, we trim the tree: we traverse the tree bottom-up and remove or dissolve nodes which should no longer be explicit. These steps clearly work in $\mathcal{O}(n)$ time.

Theorem 3.

For a string $S$ and property $\Pi$ represented with a table $\pi$ , the property suffix tree can be computed in $\mathcal{O}(n)$ time. Moreover, this data structure can answer property indexing queries in $\mathcal{O}(|P|)$ time (counting) or $\mathcal{O}(|P|+|\mathit{Occ}_{\pi}(P,S)|)$ time (reporting).

6 Weighted Index

Let us first describe our data structure for the Weighted Indexing problem. For a weighted sequence $X$ and a threshold $z$ , we construct a $z$ -estimation $\SS=(S_{j},\pi_{j})_{j=1}^{\left\lfloor z\right\rfloor}$ of $X$ , concatenate all the strings and shift the properties so that a single string $S$ with property $\pi$ is obtained. Our weighted index is the property suffix tree of $S$ and $\pi$ . In the property suffix tree, each terminal node is labelled by the list of all the occurrences of the corresponding string in $S$ respecting its property. We shift these indices so that they describe the indices within the respective strings $S_{j}$ .

The space complexity of the index is obviously $\mathcal{O}(nz)$ , where $n$ is the length of $X$ . Theorems 2 and 3 show that the data structure can be constructed in $\mathcal{O}(nz)$ time. The resulting weighted index is very similar to the one constructed in [3], even though the construction algorithm is very different.

By Definition 1, a string $P$ occurs at position $i$ in $X$ if and only if it occurs at this position in at least one of the strings. Thus, to check if $\mathit{Occ}_{\frac{1}{z}}(P,X)\neq\emptyset$ , it suffices to traverse down the property suffix tree and check if it contains a node $v$ corresponding to $P$ . This search takes $\mathcal{O}(m)$ time, where $m=|P|$ . The two remaining types of operations—counting and reporting—require finding distinct positions in the labels of the terminals in the subtree of $v$ . They can be implemented after additional preprocessing for the colour set size [11] and coloured range listing problem [17]; details can be found in [3]. We obtain the same complexities as in Theorem 16 from [3].

Theorem 4.

For a weighted sequence $X$ of length $n$ over a constant-sized alphabet and a threshold $z$ , there is a weighted index of $\mathcal{O}(nz)$ size that can be constructed in $\mathcal{O}(nz)$ time and answers decision and counting queries in $\mathcal{O}(m)$ time and reporting queries in $\mathcal{O}(m+|\mathit{Occ}_{\frac{1}{z}}(P,X)|)$ time.

Other applications of the weighted index mentioned in [3] include $\mathcal{O}(nz)$ -time computation of the weighted prefix table and of all covers of a weighted sequence. Our weighted index can be used in both.

7 Approximate Weighted Index

Now let us proceed to the solution of the Approximate Weighted Indexing problem. We are to answer queries for a pattern $P$ and a probability threshold $\frac{1}{z^{\prime}}$ and are allowed to report occurrences with probability $\geq\frac{1}{z^{\prime}}-\epsilon$ , for a given value of $\epsilon>0$ . Let us recall that [6] solve this problem in $\mathcal{O}(\frac{1}{\epsilon}nz^{2})$ space (with $\Omega(\frac{1}{\epsilon}n^{2}z^{2})$ construction time) with $\mathcal{O}(m+|\mathit{Occ}_{\frac{1}{z^{\prime}}-\epsilon}(P,X)|)$ -time queries, assuming that $z^{\prime}\leq z$ holds in all queries. Our techniques lead to a substantial improvement over the complexities of this index.

Assume that the query is for a pattern $P$ and a threshold $\frac{1}{z^{\prime}}$ . If $\frac{1}{z^{\prime}}<\epsilon$ , then the query is trivial as all the positions in $X$ can be reported. Henceforth, we assume that $\frac{1}{z^{\prime}}\geq\epsilon$ .

Let us consider a $z$ -estimation $\SS$ for the weighted sequence with $z=\frac{1}{\epsilon}$ . Let $\ell=\left\lfloor\frac{z}{z^{\prime}}\right\rfloor$ . By Definition 1, we can return position $i$ as an occurrence of $P$ based on whether $\mathit{Count}_{\SS}(P,i)\geq\ell$ ; this is shown in the following lemma.

Lemma 3.

If $\mathit{Count}_{\SS}(P,i)\geq\ell$ , then $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z^{\prime}}-\epsilon$ . If $\mathit{Count}_{\SS}(P,i)<\ell$ , then $\mathit{Prob}_{X}(P,i)<\frac{1}{z^{\prime}}$ .

Proof.

Assume that $\mathit{Count}_{\SS}(P,i)\geq\ell$ . Then

[TABLE]

Now assume that $\mathit{Count}_{\SS}(P,i)<\ell$ . As $\mathit{Count}_{\SS}(P,i)=\left\lfloor\mathit{Prob}_{X}(P,i)z\right\rfloor$ , this concludes that $\mathit{Prob}_{X}(P,i)z<\ell$ , which is equivalent to $\mathit{Prob}_{X}(P,i)<\tfrac{\ell}{z}=\tfrac{1}{z}\left\lfloor\tfrac{z}{z^{\prime}}\right\rfloor\leq\tfrac{1}{z^{\prime}}$ . ∎

Thus our approximate weighted index for $X$ is the weighted index for $X$ constructed for $z=\frac{1}{\epsilon}$ . To obtain the desired accuracy, it suffices to find the node $v$ in the property suffix tree that corresponds to $P$ and report all positions $i$ in $X$ such that there are at least $\left\lfloor\frac{z}{z^{\prime}}\right\rfloor$ leaves in the subtree of $v$ labelled with the position $i$ . Let us show that this can be done by augmenting the weighted index by a data structure for (top- $k$ ) document retrieval.

A version of the document retrieval problem (see Section 4.1 in [18]) can be stated operationally as follows. We are given a compact trie $T$ with $N$ leaves, each leaf labelled with a document number being a positive integer up to $N$ . (Usually $T$ is a suffix tree of a collection of documents.) Given a pattern $P$ , let $v$ be the locus of $P$ . Our goal is to report subsequent documents whose numbers occur most frequently in the leaves of the subtree of $v$ until the process of reporting is interrupted. In [18] a data structure of size $\mathcal{O}(N)$ is shown that, given the node $v$ , reports $k$ top-scoring documents in $\mathcal{O}(k)$ time. The construction time of the data structure is $\mathcal{O}(N\log N)$ .

We can augment our property suffix tree with this data structure with the document numbers being the labels of terminals (we can create a separate leaf for each label). This gives $N=\mathcal{O}(nz)=\mathcal{O}(\frac{n}{\epsilon})$ . To find the documents with at least $\ell$ occurrences, we compute by doubling the smallest $k$ such that the last of the top $k$ documents reported has less than $\ell$ occurrences. The number of documents reported in the last step of the doubling search will be at most $2|\mathit{Occ}_{\frac{1}{z^{\prime}}-\epsilon}(P,X)|$ and the total number will not exceed $4|\mathit{Occ}_{\frac{1}{z^{\prime}}-\epsilon}(P,X)|$ .

Theorem 5.

For a weighted sequence of length $n$ over a constant-sized alphabet and parameter $\epsilon>0$ , the Approximate Weighted Indexing problem can be solved in $\mathcal{O}(\frac{n}{\epsilon})$ space with $\mathcal{O}(m+|\mathit{Occ}_{\frac{1}{z^{\prime}}-\epsilon}(P,X)|)$ -time queries. The construction time is $\mathcal{O}(\frac{n}{\epsilon}\log\frac{n}{\epsilon})$ .

8 Randomised Construction with Greater Space Usage

A symbol $X[i]$ of a weighted sequence $X$ can be interpreted as a probability distribution on $\Sigma$ , and the whole sequence $X$ can be interpreted as a product distribution on strings of length $n$ over $\Sigma$ . In this setting, if $S\sim X$ , i.e., $S$ is a random string with distribution $X$ , then, for any position $i$ and string $P$ , we have $\mathbb{P}[S[i\mathinner{.\,.}i+|P|-1]=P]=\mathit{Prob}_{X}(P,i)$ . This interpretation can be used to provide a randomised construction of families $\SS$ of strings with properties equivalent to the weighted sequence $X$ in a certain sense, weaker than the one used in Definition 1.

Lemma 4.

There is a randomised algorithm which, given a weighted sequence $X$ of length $n$ and a threshold parameter $z$ , in $\mathcal{O}(nz\log(nz))$ time constructs a family $\SS$ of $k=\mathcal{O}(z\log(nz))$ strings $S_{j}$ with properties $\pi_{j}$ such that $\mathit{Count}_{\SS}(P,i)>0$ if and only if $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z}$ . It succeeds with high probability ( $1-\frac{1}{(nz)^{c}}$ for arbitrarily large constant $c$ ).

Proof.

We randomly sample $k=\left\lceil(c+2)z\ln(nz)\right\rceil$ strings $S_{1},\ldots,S_{k}$ . Formally, these are independent random variables with distribution $X$ . The properties $\pi_{j}$ are specified so that $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]$ is the longest prefix of $S_{j}[i\mathinner{.\,.}n]$ with $\mathit{Prob}_{X}(S_{j}[i\mathinner{.\,.}\pi_{j}[i]],i)\geq\frac{1}{z}$ .

This way, $\mathit{Count}_{\SS}(P,i)>0$ implies $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z}$ . On the other hand, if $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z}$ , then, since $\mathbb{P}[S_{j}[i\mathinner{.\,.}i+|P|-1]\neq P]=1-\mathit{Prob}_{X}(P,i)$ , we have:

[TABLE]

There are at most $n^{2}z$ pairs $(P,i)$ satisfying $\mathit{Prob}_{X}(P,i)\geq\frac{1}{z}$ (this is the bound for the sum of lengths of all strings in the sets $\mathcal{M}_{i}$ from Section 3). Consequently, the resulting family has the required property with probability at least $1-\frac{n^{2}z}{(nz)^{c+2}}\geq 1-\frac{1}{(nz)^{c}}$ . ∎

We can directly use the same methods as in Section 6 to construct a weighted index from the family of strings constructed in Lemma 4. The space complexity of the resulting index is worse than the one in Theorem 4 by a factor of $\log(nz)$ and the construction is randomised.

Corollary 6.

There is a data structure of size $\mathcal{O}(nz\log(nz))$ for the Weighted Indexing problem which answers queries in optimal time. It can be constructed using a randomised $\mathcal{O}(nz\log(nz))$ -time algorithm which returns a valid weighted index with high probability.

The same type of construction can be used to obtain an approximate weighted index. To this end, we need a stronger equivalence property of a string family and a greater number of sampled strings to satisfy this property.

Lemma 5.

There is a randomised algorithm which, given a weighted sequence $X$ of length $n$ and a parameter $\epsilon$ , in $\mathcal{O}(\frac{n}{\epsilon^{2}}\log(\frac{n}{\epsilon}))$ time constructs a family $\SS$ of $k=\mathcal{O}(\frac{1}{\epsilon^{2}}\log(\frac{n}{\epsilon}))$ strings $S_{j}$ with properties $\pi_{j}$ such that $|\mathit{Prob}_{X}(P,i)-\frac{1}{k}\mathit{Count}_{\SS}(P,i)|<\epsilon$ for every position $i$ and string $P$ . It succeeds with high probability ( $1-(\frac{\epsilon}{n})^{c}$ for arbitrarily large constant $c$ ).

Proof.

We randomly sample $k=\left\lceil(c+2)\frac{1}{\epsilon^{2}}\ln\frac{n}{\epsilon}\right\rceil$ strings $S_{1},\ldots,S_{k}$ . The properties $\pi_{j}$ satisfy that $S_{j}[i\mathinner{.\,.}\pi_{j}[i]]$ is the longest prefix of $S_{j}[i\mathinner{.\,.}n]$ such that $\mathit{Prob}_{X}(S_{j}[i\mathinner{.\,.}\pi_{j}[i]],i)\geq\epsilon$ .

Observe that if $\mathit{Prob}_{X}(P,i)<\epsilon$ , then $\mathit{Count}_{\SS}(P,i)=0$ . On the other hand, if $\mathit{Prob}_{X}(P,i)\geq\epsilon$ , then $\mathit{Count}_{\SS}(P,i)\sim\mathrm{Bin}(k,\mathit{Prob}_{X}(P,i))$ . Consequently, Hoeffding’s inequality [10] implies

[TABLE]

There are at most $\frac{n^{2}}{\epsilon}$ such pairs $(P,i)$ , so the family $\SS$ satisfies the required condition with probability at least $1-(\frac{\epsilon}{n})^{c}$ , as claimed. ∎

We can use this family of strings to construct an approximate weighted index using top- $k$ document retrieval just as in Section 7. We arrive at the following construction with space complexity greater than the one from Theorem 5 by a factor of $\frac{1}{\epsilon}\log\frac{n}{\epsilon}$ (and has a randomised construction).

Corollary 7.

There is a data structure of size $\mathcal{O}(\frac{n}{\epsilon^{2}}\log\frac{n}{\epsilon})$ which solves the Approximate Weighted Indexing problem with $\mathcal{O}(m+|\mathit{Occ}_{\frac{1}{z^{\prime}}-\epsilon}(P,X)|)$ -time queries. It can be constructed using a randomised $\mathcal{O}(\frac{n}{\epsilon^{2}}\log^{2}\frac{n}{\epsilon})$ -time algorithm which returns a valid approximate weighted index with high probability.

9 Conclusions

In this article we present an efficient index for Weighted Pattern Matching along with new combinatorial insights into the nature of weighted sequences. We have produced an implementation of the index (see https://bitbucket.org/kociumaka/weighted_index) that we have validated for correctness and efficiency against known weighted pattern matching algorithms [15, 4, 5]. Our implementation supports decision, counting, and reporting variants of queries; however, only decision operations were implemented in worst-case optimal time.

Let us mention that our results can be extended to integer alphabets $\Sigma$ , i.e., $\Sigma\subseteq\{1,\dots,n^{\mathcal{O}(1)}\}$ , without influencing the space and construction time. We have omitted the description of this extension and preferred to focus on the basic case of a constant-sized alphabet that is also most relevant in practice.

Finally, our ideas can be used to improve the solution for the Generalised Weighted Indexing problem from [6]. They use a notion of special weighted sequences in which each position contains at most one letter with a positive probability. (In this case the assumption that the probabilities sum up to 1 at each position is waived.) In [6] the input weighted sequence is transformed using the reduction of [2] into a special weighted sequence of length $\mathcal{O}(nz^{2}\log z)$ that preserves the set of maximal solid factors. In the special weighted sequence, a query for a pattern $P$ under the probability threshold $\frac{1}{z^{\prime}}$ is answered in $\mathcal{O}(m+m\cdot|\mathit{Occ}_{\frac{1}{z^{\prime}}}(P,X)|)$ time.

Our $z$ -estimation $\SS$ can be transformed into a special weighted sequence of length $\mathcal{O}(nz)$ that also preserves the set of solid factors. We simply concatenate the strings, taking the letter probabilities from the respective positions in $X$ , and split the concatenated parts with a zero-probability position. This gives a more space-efficient reduction that can be used in the data structure of [6].

Corollary 8.

For a weighted sequence of length $n$ over an integer alphabet, the Generalised Weighted Indexing problem can be solved with $\mathcal{O}(m+m\cdot|\mathit{Occ}_{\frac{1}{z^{\prime}}}(P,X)|)$ -time queries with an index of size $\mathcal{O}(nz)$ .

Acknowledgement

We thank an anonymous referee of the previous version of the paper for the idea of a simple randomised construction. We also thank Tsvi Kopelowitz for bringing our attention to the multitude of existing solutions to the Property Indexing problem.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Charu C. Aggarwal and Philip S. Yu. A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering , 21(5):609–623, 2009.
2[2] Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. Property matching and weighted matching. Theoretical Computer Science , 395(2-3):298–310, April 2008.
3[3] Carl Barton, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Efficient index for weighted sequences. In Roberto Grossi and Moshe Lewenstein, editors, Combinatorial Pattern Matching, CPM 2016 , volume 54 of LIP Ics , pages 4:1–4:13, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
4[4] Carl Barton, Chang Liu, and Solon P. Pissis. Fast average-case pattern matching on weighted sequences, 2015.
5[5] Carl Barton, Chang Liu, and Solon P. Pissis. On-line pattern matching on uncertain sequences and applications. In T.-H. Hubert Chan, Minming Li, and Lusheng Wang, editors, Combinatorial Optimization and Applications, COCOA 2016 , volume 10043 of LNCS , pages 547–562. Springer, 2016.
6[6] Sudip Biswas, Manish Patil, Sharma V. Thankachan, and Rahul Shah. Probabilistic threshold indexing for uncertain strings. In Evaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amélie Marian, Letizia Tanca, Ioana Manolescu, and Kostas Stefanidis, editors, 19th International Conference on Extending Database Technology, EDBT 2016 , pages 401–412. Open Proceedings.org, 2016.
7[7] Manolis Christodoulakis, Costas S. Iliopoulos, Laurent Mouchard, and Kostas Tsichlas. Pattern matching on weighted sequences. In Algorithms and Computational Methods for Biochemical and Evolutionary Networks, Comp Bio Nets 2004 , KCL publications, 2004.
8[8] Nilesh N. Dalvi, Christopher Ré, and Dan Suciu. Probabilistic databases: diamonds in the dirt. Communications of the ACM , 52(7):86–94, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Indexing Weighted Sequences: Neat and Efficient

Abstract

1 Introduction

1.1 Previous Results

1.2 Our Contribution

1.3 Comparison of Our Techniques with the Previous Work

1.4 Structure of the Paper

2 Preliminaries

2.1 Strings and Property Indexing

Problem 1** (Property Indexing).**

2.2 Weighted Sequences and Weighted Indexing

Problem 2** (Weighted Indexing).**

Our model of computations.

3 Existence of an Equivalent Family of Strings

Definition 1**.**

Observation 1**.**

Lemma 1**.**

Proof.

Definition 2**.**

Lemma 2**.**

Proof.

Theorem 1**.**

4 Efficient Implementation

4.1 Solid Factor Tries

Observation 2**.**

4.2 Transformation Algorithm

Observation 3**.**

4.2.1 Correctness

4.2.2 Running Time Analysis

Theorem 2**.**

5 Property Indexing Made Simple

5.1 Suffix Trees

5.2 Property Suffix Tree Construction

Theorem 3**.**

6 Weighted Index

Theorem 4**.**

7 Approximate Weighted Index

Lemma 3**.**

Proof.

Theorem 5**.**

8 Randomised Construction with Greater Space Usage

Lemma 4**.**

Proof.

Corollary 6**.**

Lemma 5**.**

Proof.

Corollary 7**.**

9 Conclusions

Corollary 8**.**

Acknowledgement

Problem 1 (Property Indexing).

Problem 2 (Weighted Indexing).

Definition 1.

Observation 1.

Lemma 1.

Definition 2.

Lemma 2.

Theorem 1.

Observation 2.

Observation 3.

Theorem 2.

Theorem 3.

Theorem 4.

Lemma 3.

Theorem 5.

Lemma 4.

Corollary 6.

Lemma 5.

Corollary 7.

Corollary 8.