$L_p$ Pattern Matching in a Stream
Tatiana Starikovskaya, Michal Svagerka, Przemys{\l}aw Uzna\'nski

TL;DR
This paper develops new streaming algorithms for approximate pattern matching under various $L_p$ distances, significantly improving space efficiency for large-scale, noisy data such as biological sequences.
Contribution
It introduces a suite of streaming algorithms for $L_p$ pattern matching with improved space complexity, extending previous work to broader $L_p$ norms and approximation guarantees.
Findings
Achieved $ ilde{O}(rac{1}{ ext{ extsterling}^2}\sqrt{n})$ space algorithms for $L_p$ distances with $0 < p \,\leq 1$.
Extended streaming pattern matching algorithms to $L_1$, $L_2$, and general $L_p$ norms.
Significantly improved space efficiency over previous algorithms for large-scale, noisy data.
Abstract
We consider the problem of computing distance between a pattern of length and all -length subwords of a text in the streaming model. In the streaming setting, only the Hamming distance () has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold , see~[SODA'19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small. On the other hand, the main application of the streaming setting is processing large-scaleâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Pattern Matching in a Stream
Tatiana Starikovskaya This work was partially funded by the grant ANR-19-CE48-0016 from the French National Research Agency (ANR). DIENS, Ăcole normale supĂ©rieure, PSL Research University, France
Michal Svagerka
ETH ZĂŒrich, Switzerland
PrzemysĆaw UznaĆski Supported by Polish National Science Centre grant 2019/33/B/ST6/00298. Institute of Computer Science, University of WrocĆaw, Poland
Abstract
We consider the problem of computing distance between a pattern of length and all -length subwords of a text in the streaming model.
In the streaming setting, only the Hamming distance () has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold , see [SODAâ19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small.
On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. Recent advances in hardware technology allow generating such data at a very high speed, but unfortunately, the produced data may contain about 10% of noise [Biol. Direct.â07, Klebanov and Yakovlev]. To analyse such data, it is not sufficient to consider small distances only. A possible workaround for this issue is the -approximation. This line of research was initiated in [ICALPâ16, Clifford and Starikovskaya] who gave a -approximation algorithm with space .
In this work, we show a suite of new streaming algorithms for computing the Hamming, , and general () distances between the pattern and the text. Our results significantly extend over the previous result in this setting. In particular, for the Hamming distance and for the distance when we show a streaming algorithm that uses space for polynomial-size alphabets.
1 Introduction
In the problem of pattern matching, we are given a pattern of length and a text and must find all occurrences of in . A particularly relevant variant of this fundamental question is approximate pattern matching, where the goal is to find all subwords of the text that are similar to the pattern. This can be restated in the following way: given a pattern , a text , and a distance function, compute the distance between and every -length subword of . A very natural similarity measure for words is the Hamming distance. Furthermore, if both and are over an integer alphabet , one can consider the Manhattan distance or the Euclidean distance.
Definition 1.1** (Hamming, Manhattan and Euclidean distances).**
For a vector , its Hamming norm is defined as , Manhattan norm is defined as and Euclidean norm is defined as . For two words and , their Hamming distance is defined as , their Manhattan distance as , and their Euclidean distance as .
Those distance functions naturally generalize to the so called distances, where is the exponent.
Definition 1.2** (âth moment, âth norm).**
For a vector and , its âth moment is defined as , and for its norm is defined as . For two words and considered as vectors, the âth moment of their difference is and their distance is defined as .
In other words, the Manhattan distance is the distance, the Euclidean distance is the distance, and the Hamming distance can be considered as the distance.
Below we assume that the length of the text is , as any algorithm on a text of larger length can be reduced to repeated application of an algorithm that runs on texts of length . This is done by splitting the text into blocks of length which overlap by characters.
Offline setting.
For the Hamming distance, the problem has been extensively studied in the offline setting, where we assume random access to the input. The first algorithm, for a constant-size alphabet, was shown by Fischer and Paterson [22]. The algorithm uses time and in substance computes the Boolean convolution of two vectors a constant number of times. This was later extended to polynomial-size alphabets in [1, 34]. With a somewhat similar approach, the same complexity can be achieved for the distance in [13]. Later, in [35, 36] the authors proved that these problems must have equal (up to polylogarithmic factors) complexities by showing reductions from the Hamming to the distance and back.
To improve the complexity for large alphabets, the natural next step was to study approximation algorithms. Until very recently, the fastest -approximation algorithm for computing the Hamming distances was by Karloff [30]. The algorithm combines random projections from an arbitrary alphabet to the binary one and Boolean convolution to solve the problem in time. In a breakthrough paper Kopelowitz and Porat [32] gave a new approximation algorithm improving the time complexity to , which was later significantly simplified [33]. Using a similar technique, Gawrychowski and UznaĆski [24] showed an approximation algorithm for computing the distance in (randomized) time, later made deterministic in time in [40]. Using similar techniques, the authors of [40] gave -time -approximation algorithm for distances for any constant positive .111Across the paper we use to indicate that we are suppressing poly-log(n) factors.
Streaming setting.
In the streaming setting, we assume that the pattern and the text arrive as streams, one character at a time (the pattern arrives before the text). The main objective is to design algorithms that use as little space as possible, and we must account for all the space used by the algorithm, including the space required to store the input, in full or in part. It is also often the case that the text arrives at a very high speed and we must be able to process it faster than it arrives to fulfil the space guarantees, preferably, in real time. To this aim, the time complexity of streaming algorithms is defined as the worst-case amount of time spent on processing one character of the text, i.e. per arrival.
In the streaming setting, only the Hamming distance () has been studied. It is known that computing the Hamming distance between a pattern and a streaming text exactly requires space, even for the binary alphabet and with a small probability error allowed, which can be shown by a straightforward reduction to communication complexity (folklore).
Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold . This variant of the problem is often reffered to as -mismatch problem. The -mismatch problem has been extensively studied in the literature [15, 16, 26, 39], with this line of work reaching memory complexity and time per input character. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, in other words, the distance between the pattern and the text should be relatively small. On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. To decrease the cost of generating such data, recently new hardware approaches have been developed. They have become widely used due to cost efficiency, but unfortunately, the produced data may contain about 10% of noise [31]. To analyse such data, it is not sufficient to consider small distances only, and a possible workaround for this issue is -approximation. This line of research was initiated by Clifford and Starikovskaya [17] who gave a -approximation algorithm with space that uses time per arriving character of the text.
Independently and in parallel with this work, authors of [12] showed a -approximation streaming algorithm for the -mismatch problem that uses space. For a special case of , they show how to reduce the space further to . Compared to our solution, their algorithm has worse time complexity of per arrival, and more importantly, it is not obvious whether it can be generalised to other norms as it uses a very different set of techniques.
Sliding window.
The problem of computing distance between and every -length subword of in the streaming setting resembles the problem of maintaining the norm of a -length suffix of a streaming text, also referred to as sliding window. In fact, the latter is a simplification of the former, with setting . There is an extensive line of work on maintaining the norm of a sliding window, refer to [4, 5, 6, 7, 8, 19] and references therein. The main message is that the norm of a sliding window can be maintained efficiently, e.g. for the norms can be maintained -approximately in space . However, those results do not translate to our case: in the sliding window, one can easily isolate âheavy hittersâ, that is updates with a significant contribution to the output. In our case, the contribution of an update depends on its relative position to the pattern, and one can easily construct instances where a contribution of a position in the text changes drastically relative to its alignment with the pattern, which necessitates a significantly different approach.
1.1 Our results
In this work, we show a suite of new streaming algorithms for computing the Hamming, , and general () distances between the pattern and the text. Our results significantly improve and extend the results of [17].
Theorem 1.3**.**
Given a pattern of length and a text over an alphabet , where , there is a streaming algorithm that computes a -approximation of the distance between and every -length subword of correctly w.h.p.
in space, and time per arrival when (Hamming distance); 2. 2.
in space and time per arrival when (Manhattan distance); 3. 3.
in space and time per arrival when ; 4. 4.
in space and time per arrival when ; 5. 5.
in space and time per arrival when ; 6. 6.
in space and time per arrival for .
We also improve and extend the space lower bound of [17], who showed that any streaming algorithm that computes a -approximation of the Hamming distance between a pattern and a streaming text must use bits for all such that for some constant (condition inherited from [28]). We show the following result:
Lemma 1.4**.**
Let and . Any -approximation algorithm that computes the distance between a pattern and a streaming text for each alignment, must use bits of space.
Proof.
Let us first show the lower bound for , i.e., for Hamming distance. We show the lower bound by reduction to a two-party communication complexity problem called GAP-Hamming-distance. In this problem, the two parties, Alice and Bob are given two binary words of length and a parameter , . Alice sends Bob a message, and Bobâs task is to output if the Hamming distance between his and Aliceâs word is larger than , and zero if it is at most . Otherwise, he can output âdonât knowâ. By Proposition 4.4Â [10], the communication complexity of this problem is .
We can now show a space lower bound for any -approximate algorithm for computing the Hamming distance between the pattern and the text by a standard reduction. Suppose that there is an algorithm that uses bits of space. Let be Aliceâs word, Bobâs word. After reading , the algorithm stores all the information about it in bits of space. We construct the communication protocol as follows: Alice sends the information about to Bob. Using it, Bob can continue running the algorithm and compute the approximation of the Hamming distance between and . We have thus developed a communication protocol with complexity , a contradiction.
We can now show the lower bound for . We immediately obtain a space lower bound for any -approximate algorithm for computing the âth moment between the pattern and the text at every alignment. Indeed, on binary words the âth moment is equal to the Hamming distance for all . The lower bound for the distance follows by Observation 1.5. â
1.2 Techniques
At a very high level, the structure of all algorithms presented in this paper is similar to that of [17] (in fact, such approach in similar context was also used independently in [18]). We process the text by blocks of length . To compute an approximation of the distance / the âth moment at a particular alignment, we divide the pattern into two parts: a prefix of length aligned with a suffix of some block of the text, and the remaining suffix (see Fig. 1). We compute an approximation of the distance / the âth moment for both of the parts and sum them up to obtain the final answer. Our main contribution is a set of new tools that allows computing the approximations efficiently.
To be able to compute the approximation of the distance / the âth moment between the prefix and the corresponding block of the text, we compute, while reading each block of the text, its compact lossy description that we refer to as prefix encoding. The prefix encoding captures the relation between the read block and the prefix of the pattern of length . To compute the distance / the âth moment between the suffix and the text, we will use suffix sketches. For each position of the text, the suffix sketch describes the subword of the text where is the smallest integer such that (see Fig. 1).
For the Hamming distance, we define the prefix encodings in Section 2.1 and the suffix sketches in Section 3.1. Our Hamming prefix encoding introduces a novel use of a known technique called subsampling. The prefix encodings are used to approximate the distance between any suffix of one word and the prefix of another word of the same length. In brief, the idea is to replace each character of the two words by the donât care character â?â, a special character that matches any other character of the alphabet. We repeat the process a logarithmic number of times to create a logarithmic number of pairs of âsubsamplesâ. For each pair, we find the longest suffix of one subsample that matches the prefix of the second subsample up to at most mismatches. We then show that this information can be used to approximate the Hamming distance between any suffix-prefix pair. Similar techniques were used in [3, 20, 23, 25, 29, 38] for estimating the Hamming norm in streams. The crucial difference with our approach is that we must be able to compute the Hamming norm of any suffix-prefix pair of the two words, and we must be able to do it efficiently. As for the suffix sketches, for the binary alphabet we use the sketches introduced in [17]. We then show a reduction from arbitrary alphabets to the binary alphabet, which improves the space consumption of Hamming suffix sketches by a factor of .
We can solve the problem of (Manhattan distance) pattern matching by replacing each character of the pattern and of the stream with its unary encoding and running the solution for the Hamming distance. However, this would introduce a multiplicative factor of (the size of the alphabet) to the time complexity. We show efficient randomised reductions from the Manhattan to Hamming distance that allow simulating the solution for the Hamming distance without a significant overhead. In particular, to design the prefix encodings we use random shifting and rounding, while for the suffix sketches we use range-summable hash functions [9]. We show the Manhattan prefix encodings in Section 2.2 and the Manhattan suffix sketches in Section 3.2.
For generic distances, , we discuss the prefix encodings in Section 2.4 and the suffix sketches in Section 3.3. Our approach to prefix encodings is rather involved. In the case of , we construct a novel embedding from space into the Hamming space, which might be of independent interest. While the target dimension of the Hamming space is large, we construct the embedding in such a way that each value is mapped into a compressible sequence of form for some small value of , and where values of are constant across all input values. Such compressed representation allows us to efficiently apply the subsampling framework and reduce the problem to the Hamming distance case. For , we identify a logarithmic number of anchor suffixes, and partition each of them into words of roughly even contribution to the distance. We then use the partition to decode prefix-suffix distance queries for arbitrary length queries. Such construction is a generalization and improvement of the approach presented in [17]. For suffix sketches, we simply use the -stable distributions [27].
Finally, we combine the prefix encodings and the suffix sketches to prove Theorem 1.3 in Section 4. To simplify the notation, we use x\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}y to denote from now on. We will also use the fact that for we can speak of approximating the âth moment of differences between the pattern and the -length substrings of the text and the distances between the pattern and the -length substrings of the text interchangeably, it changes the complexities up to a constant factor only:
Observation 1.5**.**
For any constant and , there is a constant such that finding a approximation of the âth moment of a vector suffices for -approximating its âth norm, and finding a approximation of its âth norm suffices for -approximating its âth moment.
2 Prefix encodings
In this section we present a solution to the following problem. Imagine we have a block of text and a prefix of the pattern . We want to find a compressed representation (encoding) of so that the following is possible: given any , the compressed representation of , and (explicitly), we can approximate , where is a suffix of and is a prefix of .
We start by presenting a solution to the Hamming distance case, which is a basis to our solution for all other norms for .
2.1 Hamming () distance
Recall that â?â is the donât care character, a special character that matches any other character of the alphabet.
Definition 2.1** (Hamming subsampling).**
Consider a word of length . Let and let be a function drawn at random from a pairwise independent family. For , we define the -th level Hamming subsample of , , as follows:
[TABLE]
In particular, .
Fix an integer large enough. For two words , consider the following estimation procedure:
Algorithm 2.2**.**
Denote to be the Hamming distance between and and let .222We emphasize that contains donât care characters, so the Hamming distance is defined as the number of pairs of characters of and that do not match. 2. 2.
Output as an estimate of .
The following lemma is a rephrasing of a known result regarding subsampling in estimation of the Hamming norm (cf. [3, Theorem 3], or [25, Theorem 2]).
Lemma 2.3**.**
For as in Algorihtm 2.2 there is Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert U-V\rVert_{H} with probability at least .
Proof.
Denote . Consider a fixed value . Let be binary variables indicating existence of a mismatch between and at positions , so that . We observe that and therefore , because each of the positions with mismatch between and generates a mismatch between and with probability .
Furthermore, as the function in Definition 2.1 is drawn from a pairwise independent family, there is . Let . By Chebyshevâs inequality, we have
[TABLE]
We estimate . Assume w.l.o.g. that . Observe that , which implies, for , . By Equation 1, there is
[TABLE]
It follows that . Hence, we obtain
[TABLE]
It follows that we can choose large enough so that Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert U-V\rVert_{H} with probability . â
Since the subsampling is performed independently for each position, one can use subsampling to approximate the Hamming distance between any suffix of and any prefix of of equal lengths in a similar fashion.
We are now ready to define the Hamming prefix encoding of a block. For brevity, let and (the same for all ). Furthermore, given two words of equal length, define the mismatch information .
Definition 2.4**.**
Consider a -length block of the text . For each , let be the maximal integer such that the Hamming distance between and is at most . We define the Hamming prefix encoding of to be a tuple of pairs .
Note that the prefix encoding of uses space. We can compute it efficiently:
Lemma 2.5**.**
Assume constant-time random access to . Given a -length block of the text , its Hamming prefix encoding can be computed in time.
Proof.
To compute the encoding, we use the algorithm of [14]. Formally, for each we create a word by appending donât care characters to the subsample . The algorithm of  [14] can be used to find all -length subwords of that match with up to mismatches, moreover for each of these subwords the algorithm outputs the mismatch information. We take the leftmost subword only, which corresponds to because of the donât care characters. In total, our algorithm uses time. â
We now show how to compute the Hamming distance between any -length suffix of and any -length prefix of given and the Hamming prefix encoding of a block .
Lemma 2.6**.**
Given the prefix encoding of a -length block of the text , there is an algorithm that computes, for any , a -approximation of the Hamming distance between the -length suffix of and the -length prefix of in time.
Proof.
Denote to be the Hamming distance between and . We compute the smallest such that in the following way. For each , we use to restore . We then append with donât care characters and run the algorithm of [14] for the resulting text and the pattern. This allows to compute for all , and if , then by definition. In total, the algorithm takes time. â
2.2 Manhattan () distance
Recall a word morphism , . Our goal in this section is to simulate implicitly procedures from Lemma 2.5 and Lemma 2.6 on words and without introducing any significant overhead.
Definition 2.7** (Manhattan scaling).**
Consider a word of length . Let and let be a function drawn at random from a -wise independent family. For , we define the -th level Manhattan subsample of , , as a word of length such that . In particular, .
Fix an integer large enough. For words , consider for all , and the following estimation procedure:
Algorithm 2.8**.**
Denote and let . 2. 2.
Output as an estimate of .
Lemma 2.9**.**
For as in Algorihtm 2.8 there is Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert U-V\rVert_{1} with probability .
Proof.
Take some position and denote for short and and . There is |a-b|\in\left\{\big{\lfloor}|c|\big{\rfloor},\big{\lceil}|c|\big{\rceil}\right\} and Since |a-b|-\big{\lfloor}|c|\big{\rfloor} is a variable, there is \mathrm{Var}\left[|a-b|\right]=\mathrm{Var}\left[\left(|a-b|-\big{\lfloor}|c|\big{\rfloor}\right)\right]\leq\mathbb{E}\left[\left(|a-b|-\big{\lfloor}|c|\big{\rfloor}\right)\right]\leq\mathbb{E}\left[|a-b|\right]. Summing for all values of , we reach that
[TABLE]
Since we have reached an identical variance bound, the proof follows step-by-step the proof of Lemma 2.3. â
To approximate the Manhattan distance between any suffix of and any prefix of of equal lengths, we define the encoding similar to the Hamming distance case. Specifically, we still use the mismatch information, building on the fact that for any two words and from the mismatch information the exact value of can be found. We define as before, but change the definition of slightly. Intuitively, we define to be the -length prefix of subsampled in a synchronized way with . Formally, .
Definition 2.10**.**
Consider a -length block of the text . For each , let be the maximal integer such that the Manhattan distance between and is at most . We define the Manhattan prefix encoding of to be a tuple of pairs .
Note that the prefix encoding of uses space.
Lemma 2.11**.**
Assume constant-time random access to . Given a -length block of the text , its Manhattan prefix encoding can be computed in time and space.
Proof.
Let . For each and we compare and character by character in time to find and the corresponding mismatch information. The claim follows. â
Lemma 2.12**.**
Given the prefix encoding of a -length block of the text , there is an algorithm that computes, for all , a -approximation of the Manhattan distance between the -length suffix of and the -length prefix of in time.
Proof.
Denote . We compute the smallest such that in the following way. For each , we use to restore . If , the Manhattan distance between and is at least . Otherwise, we compare and character by character to compute the Manhattan distance in time. The claim follows. â
2.3 Generic () distance for
Our goal is to construct a morphism (parametrised by ) acting as a randomized embedding of into the Hamming distance. The intuition behind our approach is as follows. Let be a sequence of real numbers picked independently and u.a.r. Define a sequence of values
[TABLE]
and for a character consider sequence of characters where (similarly, a character defines a sequence ). Now consider two characters such that for some integer and a random variable . There is
[TABLE]
We thus see that an idealized morphism of the form would have the property that on words of length . But there are the following issues: (i) characters are mapped into infinite length words, (ii) number of repetitions of characters () is fractional, (iii) we cannot guarantee that character distance is always of form and (iv) the distance is preserved only in expectation. We show how to overcome these issues to achieve the following result:
Theorem 2.13**.**
Given and there is a word morphism such that:
* when , when and when .* 2. 2.
values of and do not depend on , 3. 3.
there exists a constant such that for any two words of length at most , we have \lVert U-V\rVert_{p}^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\alpha\cdot\lVert\varphi(U)-\varphi(V)\rVert_{H} with probability at least , 4. 4.
it is enough for the randomness to be realized by a hash function from a -independent hash function family for some , which can be generated from a bits size seed.
Proof.
We will consider three cases: , , and .
Case . Our plan is to build upon the scheme highlighted earlier in this section. Specifically, we preserve the values of .
Consider a pair of characters . First, note that is an increasing function of . From this and Equation 2 we obtain that \mathbb{E}\left[x\right]\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}|c-c^{\prime}|^{p}\left(\frac{(1+\varepsilon)^{p}}{(1+\varepsilon)^{p}-1}+\frac{1}{(1+\varepsilon)^{1-p}-1}\right)\varepsilon^{-1} for all values of .
Second, fix and observe that truncating the sum after the -th term introduces an additional factor to the approximation, since for we have
[TABLE]
We also round down to the nearest integer, which introduces an additional relative error, since . Finally, we set . We then have
To guarantee that the equality holds with probability at least and not just in expectation, we repeat the scheme several times, with independent random seeds. That is, consider morphisms and define a morphism with property:
[TABLE]
Assume w.l.o.g. that . We proceed to bound
[TABLE]
We set for the claim to hold via Chebyshevâs inequality. The error probability coming from Chebyshevâs inequality can be made arbitrarily small constant by fixing the constant factor in to be large enough. We finally set .
Case . Note that for such that we have |x|^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}|x|^{p^{\prime}} for all . We can therefore reduce this case to . However, we have to take into account that the asymptotic growth of hides dependency on for , hence for .
Case . The proof follows the steps of the case . We first bound the variance:
[TABLE]
We set , so that by Chebyshevâs inequality, the probability of obtaining \lVert U-V\rVert_{p}^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\alpha\cdot\lVert\varphi(U)-\varphi(V)\rVert_{H} is an arbitrarily small constant (by setting to be large enough).
Randomness. The only source of randomness in the description are the values picked u.a.r. and independently. We note that the values can be picked instead as a finite precision floating-point numbers. Since all the values we are working with are bounded by , it is enough to set precision accordingly. We also observe that our concentration argument involves only Chebyshevâs inequality and thus only the variance and the expected value, so it suffices to require that are -wise independent. â
We now describe how to use the morphism to approximate the distances in a small space. To design an efficient algorithm, we take advantage of the fact that has a compressed representation of size comparable with the length of (at least when ).
Definition 2.14** ( scaling).**
Consider a word of length . Let be a function drawn at random from a -wise independent family, where . For , we define the -th level subsample of ,
[TABLE]
In particular, .
Consider two words of form and . Fix an integer large enough and consider for all , where .
Algorithm 2.15**.**
Denote and let . 2. 2.
Output as an estimate of .
Lemma 2.16**.**
For as in Algorihtm 2.15 there is Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert S-Q\rVert_{H} with probability .
Proof.
Consider a fixed subsampling level . For simplicity, let and . Define a random variable to be the contribution of of to the Hamming distance , i.e.
[TABLE]
Since and , we have and
[TABLE]
Summing over all values of , we reach and . These bounds are identical to that of Lemma 2.3 and we can proceed in a similar fashion to obtain the claim. â
We are now ready to define prefix encodings. Consider a -length block of the text and define ( is defined as in Theorem 2.13). Also, define to be the -length prefix of subsampled in a synchronized way with .
Definition 2.17**.**
Consider a -length block of the text . For each , where , let be the maximal integer such that the Hamming distance between and is at most . We define the prefix encoding of to be a tuple of pairs .
The prefix encoding of uses space.
Lemma 2.18**.**
Assume constant-time random access to . Given a -length block of the text , its prefix encoding can be computed in time and space.
Proof.
For each and , we compute the Hamming distance between and in time using the compressed representation to find and the corresponding mismatch information. The claim follows. â
Lemma 2.19**.**
Given the prefix encoding of a -length block of the text , there is an algorithm that computes, for all , a -approximation of the distance between the -length suffix of and the -length prefix of in time and space.
Proof.
Denote . We compute the smallest such that in the following way. For each , we use to restore . If , the Hamming distance between and is at least . Otherwise, we compare and to compute the Hamming distance in time. The claim follows. â
2.4 Generic () distance for .
For , we use a scheme similar to the one developped in [17] for the Hamming distance, but adapt it to generic distances. Particularly, we plug in a standard tool used in this situation, the -stable distribution. We additionally have to adapt the scheme a bit, taking into account that norm is sub-additive under concatenation when .
Definition 2.20** (-stable distribution [41]).**
For a parameter , we say that a distribution is -stable if for all and random variables drawn independently from , the variable is distributed as , where is a random variable with distribution .
Consider a word , and let be independent random variables drawn from a -stable distribution with expected value . By Definition 2.20, we have . The -stable distributions exist for all , and a random variable from a -stable distribution can be generated using the formula  [11, 41], where is uniform on and is uniform on .
However, to be able to design an efficient sketching scheme that allows to approximate the norm with high probability, there are three technicalities to be overcome: First, one must show that concentrates well, second, the formula above assumes infinite precision of computation, and finally, one cannot use fully independent random variables as above as this would require much space. To overcome these issues, Indyk [27] combined -stable distributions and pseudorandom generators for bounded space computation [37]. We restate the final result of Indyk below, in the form that will be convenient for us later.
Theorem 2.21** (cf. Theorem 2, Theorem 4Â [27]).**
For any , there is a non-uniform streaming algorithm that maintains a sketch of a word of length over an alphabet of size such that:
when a new character of arrives, the sketch can be updated in time; 2. 2.
the algorithm and the sketch use bits of space.
Given the sketches of two words of length , one can estimate up to a factor with probability at least in time .
We now proceed to building the prefix encoding by using and the landmarking technique.
Definition 2.22** ( prefix encoding).**
Let . Consider a word of length on the alphabet of size . Define . For , let be the leftmost position such that the âth moment of the difference between and , i.e. , is at most .
Further, divide into blocks such that each block is either a single character, or the âth moment of the difference between each block and the corresponding subword of is at most . Let be the block borders. We choose from left to right, and each position is chosen to be the rightmost possible.
The prefix encoding of is defined to contain sorted lists of the positions and , characters , and sketches for -approximating the âth norm of , for all and as in Observation 1.5, see also Theorem 2.21.
The encoding takes bits of space. We now show that given the prefix encoding of a block of the text of length , one can compute a -approximation of the distance between any prefix of the pattern and the corresponding suffix of .
Lemma 2.23**.**
Let . For any two vectors of equal length, \Big{|}\lVert X+Y\rVert_{p}^{p}-\lVert X\rVert_{p}^{p}\Big{|}=\mathcal{O}(\lVert Y\rVert_{p}^{p}+\lVert Y\rVert_{p}\cdot\lVert X\rVert_{p}^{p-1}).
Proof.
Consider . If , then by Taylor expansion, . If , then . Thus for any real values, we have
[TABLE]
Denote and . There is
[TABLE]
Pick so that . By Hölderâs inequality:
[TABLE]
â
Lemma 2.24**.**
Let . Given the prefix encoding of a block of the text of length , one can find -approximation of the âth moment of the difference between any prefix of the pattern and the corresponding suffix of in time .
Proof.
Let be the position that is closest to from the left, and (see Fig. 2). We can find , in time by iterating over the sorted lists.
The position divides into two parts, and . Denote and the respective subwords of they are aligned with (see Fig. 2). Let and . Then , being the value we need to approximate, is equal to .
We can find m^{\prime}_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}m_{2} using the sketches for and in time . Furthermore, if , then we can compute exactly as we store . Otherwise, we consider the subword of the pattern . Denote and use it as our estimation of .
Since , by definition, , and . By Lemma 2.23 with and ,
[TABLE]
and finally . â
Lemma 2.25**.**
Let . The prefix encoding of a -length block of the text can be computed in time and space .
Proof.
For , we naively compute the distance between the suffix of and the prefix of in time. We then find the positions . For each , we can find the positions in time and compute the sketches in time by Theorem 2.21. â
3 Suffix sketches
In this section, we give the definitions and explain how we maintain the suffix sketches for each of the distances.
3.1 Hamming distance
We first recall Euclidean suffix sketches as presented in [17]. In fact, we will not use them for the Euclidean distance as for it we can use the generic solution of Section 3.3, but they will serve as a foundation of Hamming suffix sketches.
All sketches presented in this section are correct with constant probability, which can be amplified to for arbitrarily small by a standard method of repeating sketching independently times and taking the median of the estimates.
Lemma 3.1** (Euclidean sketches [2]).**
Let be a random matrix of size filled with 4-wise independent random variables, for chosen big enough. For a vector there is \frac{1}{\sqrt{d}}\lVert MX\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert X\rVert_{2} with constant probability , taken over all possible choices of . We say that a vector of dimension is a Euclidean sketch of .
Definition 3.2** (Euclidean suffix sketches [17]).**
Consider a word of length . We define its Euclidean suffix sketch as follows.
Let be the block length. Let be a random matrix of size filled with 4-wise independent random variables and let be 4-wise independent random coefficients with values as well. We define a matrix of size such that .
Let be a word of length obtained from by appending an appropriate number of zeroes. The Euclidean suffix sketch of is defined as , where is considered as a vector.
Observe that the matrix does not need to be accessed explicitly. Indeed, from it follows that the Euclidean suffix sketch can be computed by first sketching each block of using the matrix , and then taking a linear combination of the sketches of the blocks (using the random coefficients ).
Lemma 3.3** ([17]).**
Selecting gives \frac{1}{\sqrt{d}}\lVert\mathsf{eSketch}(X)\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert X\rVert_{2} with probability at least (taken over all possible choices of ).
By linearity of sketches, we obtain \lVert X-Y\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\frac{1}{\sqrt{d}}\lVert\mathsf{eSketch}(X)-\mathsf{eSketch}(Y)\rVert_{2} with probability at least as well.
We now define Hamming suffix sketches. First note that for binary words there is , and therefore in the case of the binary alphabet we can use the Euclidean suffix sketches. We will now show how to reduce the case of arbitrary polynomial-size alphabets to the case of the binary alphabet.
To this end, [17] used a random mapping of Karloff [30] as a black-box reduction, which led to sketches of size . We now show a more careful reduction to avoid this overhead and to achieve dependency in total. Consider a word morphism defined on alphabet as , (and acting on words by concatenating the images of each character of the input word). Note that , thus using the Euclidean suffix sketches on top of and allows computation of the respective Hamming distance. Formally,
Definition 3.4** (Hamming suffix sketches [17]).**
Consider a word of length on the alphabet of size . We define its Hamming suffix sketch as follows.
Let be the block length, be a random matrix of size filled with 4-wise independent random variables, and be 4-wise independent random coefficients with values as well. We define a matrix of size such that .
Let be a word of length obtained from by appending an appropriate number of zeroes. The Hamming suffix sketch of is defined as , where is considered as a vector.
Lemma 3.5**.**
Selecting gives \frac{1}{2d}\lVert\mathsf{hSketch}(X)\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert X\rVert_{H} with probability at least (taken over all possible choices of ).
Proof.
Follows immediately as a corollary of Lemma 3.3 and the properties of the embedding . In more detail, the following holds with probability at least :
[TABLE]
â
As are sparse, there is an efficient streaming algorithm for maintaining the Hamming suffix sketches of a text:
Lemma 3.6**.**
Given a text , there is a streaming algorithm that for every position outputs the Hamming suffix sketch of a word , where is the largest integer such that . The algorithm takes space and time per character.
Proof.
We fix the matrix and the random coefficients from Definition 3.4. We do not store and explicitly, but generate them using two hash functions drawn at random from a -wise independent family. For example, to generate we can consider a family of polynomials , with parameters chosen u.a.r. from the prime field for , and can be generated in a similar fashion. This way, we need to store only random bits that define the coefficients of two polynomials to generate and .
We process the text by blocks of length . For each block we compute its sketch using the matrix . That is, at the beginning of each block we initialize its sketch as a zero vector of length . When a new character of a block arrives, we compute and add to the sketch, which takes time. We store the sketch of until the block and use it to compute the suffix sketches for the positions in this block.
Consider now a block . We first compute the suffix sketch for the position , which is the position preceding the block . The suffix sketch for it is simply a linear combination of the sketches of the blocks with coefficients . Since each sketch is a vector of length , we can compute the linear combination in time. To make this computation time-efficient, we start it positions before position arrives, and de-amortise the computation over these positions. This way, we use only time per character.
Now, using the suffix sketch for the position , we can compute the suffix sketches for all positions in the block one-by-one, using only time per character: When a new character arrives, we add to the suffix sketch to update it.
Note that at any time we store sketches of the blocks, so the algorithm uses space in total. â
3.2 Manhattan () distance
To show efficient suffix sketches for the Manhattan distance, we consider a word morphism , . Note that , thus using the Hamming suffix sketches on top of and allows computation of the respective Manhattan distance.
However, if we apply the morphism straightforwardly, we will have to pay an extra factor per character to compute the Manhattan suffix sketches. To improve the running time, we will use range-summable hash functions. Range-summable hash functions were introduced by Feigenbaum et al. [21], and later their construction was improved by Calderbank et al. [9].
Definition 3.7** (cf. [9]).**
A family of hash functions (here is the argument and is the seed) is called -independent, range-summable if it satisfies the following properties for any :
(-independent)* for all distinct and all ,*
[TABLE] 2. 2.
(range-summable)* there exists a function such that given a pair of integers , and a seed , the value can be computed in time polynomial in .333In [9], the function was defined to take values in . We can change the range of values to by taking while preserving the properties.*
Corollary 3.8** (cf. Theorem 3.1Â [9]).**
There is a -independent, range-summable family of hash functions with a random seed of length such that any range-sum can be computed in time.
Observation 3.9**.**
For a word , let . Let be as in Corollary 3.8 with . Then .
Thus, we see that range-summable hash functions can be used to efficiently simulate .
Definition 3.10** (Manhattan suffix sketches).**
Consider be a word of length . We define its Manhattan suffix sketch as follows.
Let be the block length. Let be as in Corollary 3.8 with . Let be a random matrix of size filled with 4-wise independent random variables, such that and let be 4-wise independent random coefficients with values as well. We define a matrix of size such that .
Let be a word of length obtained from by appending an appropriate number of zeroes. The Manhattan suffix sketch of is defined as , where is considered as a vector.
Lemma 3.11**.**
Selecting gives \frac{1}{d}\lVert\mathsf{mSketch}(X)\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{\varepsilon}}}}}{{=}}\lVert X\rVert_{1} with probability at least (taken over all possible choices of and ).
Proof.
Follows immediately as a corollary of Lemma 3.3 and the properties of the embedding . In more detail, the following holds with probability at least :
[TABLE]
â
Lemma 3.12**.**
Given a text , there is a streaming algorithm that for every position outputs the Manhattan suffix sketch of a word , where is the smallest integer such that . The algorithm takes space, and time per character.
Proof.
The proof mirrors the proof of Lemma 3.6, and we describe the key elements. We fix the random coefficients and the hash function from Definition 3.10. As previously, we do not store the coefficients explicitly, but generate them using a hash function drawn at random from a -wise independent family. The matrix is already defined by , with the following parameters: it requires bits of seed, and range-sum queries are answered in time .
In the sketching of blocks, we proceed in the same manner, except that when a new character of a block arrives, we compute and add to the sketch, which takes time ( times slower as the corresponding step in Lemma 3.6).
Consider now a block . When a new character arrives, we update the suffix sketch by adding to it.
All of the operations are time slower than the corresponding steps in Lemma 3.6, and the memory complexity is increased by the seed size term ( and terms get absorbed). â
3.3 Generic () distance for .
For generic distances, we use the approach of [27] based on -stable distributions.
Corollary 3.13**.**
Given a text , there is a streaming algorithm that for every position outputs the suffix sketch of a word , where is the smallest integer such that . The algorithm takes bits of space and time per character.
Proof.
We start a new instance of the sketching algorithm of Theorem 2.21 at every block border and continue running it for the next blocks. At each moment, there are active instances of the algorithm. The bounds follow. â
4 Proof of Theorem 1.3
Recall the structure of the algorithms. During the preprocessing, we compute the suffix sketches of suffixes of . During the main stage, the text is processed by blocks of length . To compute an approximation of the distance / the âth moment at a particular alignment, we divide the pattern into two parts: a prefix of length at most , and the remaining suffix. We compute an approximation of the distance / the âth moment for both of the parts and sum them up to obtain the final answer. To compute an approximation of the distance / the âth moment between the prefix and the corresponding block of the text, we compute, while reading each block of the text, its prefix encoding, and to compute an approximation of the distance / the âth moment between the suffix and the text, we use the suffix sketches.
Hamming () distance. When we receive a new block of the text, we compute its Hamming prefix encoding using the algorithm of Lemma 2.5 in space. We de-amortize the computation over the subsequent block and spend time per character. We store the resulting encoding for the next blocks. In total, the encodings require space. The Hamming suffix sketches of occupy space. The algorithm of Lemma 3.6 that computes the suffix sketches takes space and time per character. Consider a block starting with position . To compute the Hamming distances between -length subwords that end in this block and the pattern, we apply the following approach. First, while reading the block preceding the current one, we decode the Hamming prefix encoding of the block that starts at position using Lemma 2.6. We de-amortize the algorithm to spend time per character. Hence, at the position , we know the -approximation between the prefixes of the pattern and the corresponding subwords of the text. At each position, we can compute the Hamming distance between the corresponding suffix of the pattern and the text in time using the Hamming suffix sketch. By taking , we obtain the claim. 2. 2.
Manhattan () distance. We proceed analogously to the Hamming distance case. The Manhattan prefix encoding of each block is computed using Lemma 2.11, in time per character. We store the resulting encoding for the next blocks, giving in total space. The Manhattan suffix sketches of occupy space. Algorithm of Lemma 3.12 takes space and time per character. For decoding the prefix encoding we use Lemma 2.12, spending time per character. Once again we take , and assume w.l.o.g. (as otherwise we can use a naive algorithm with space and time per character). 3. 3.
Generic () distance for . The prefix encodings of the blocks are computed using Lemma 2.18, using time per character. We store the resulting encodings for the next blocks, giving in total space. The suffix sketches of occupy space. Algorithm of Corollary 3.13 computes the suffix sketches for the text in space and time per character. For decoding the prefix encoding we use Lemma 2.19, spending time per character. We take , and substitute accordingly to Theorem 2.13. 4. 4.
Generic () distance for . Note that for we can use a naive algorithm, that is to store itself in space. The update takes constant time, and computing the norm takes time which is better than the guarantees of the theorem for such values of . For , the algorithm of Lemma 2.25 computes the prefix encodings of the blocks in space and time per character. The encodings occupy space. The suffix sketches of occupy space. Algorithm of Corollary 3.13 computes the suffix sketches for the text in space and time per character. Taking and assuming w.l.o.g. , we obtain the claim.
5 Conclusion
We pose several open questions. First is whether the time-complexity for can be improved to not involve any dependency on . For this we need a better technique than bounding variance of the embedding into Hamming distance: in our technique, the tail gets âtoo heavyâ. Another pressing question is whether for all values of we could improve upon time per character. We also remark that it seems unlikely that an embedding to Hamming space could be used to reduce space complexity for : does not admit the triangle inequality while the Hamming distance does, and the distance is not additive with respect to concatenation, while the Hamming distance is.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Karl R. Abrahamson. Generalized string matching. SIAM J. Comput. , 16(6):1039â1051, 1987.
- 2[2] Dimitris Achlioptas. Database-friendly random projections: JohnsonâLindenstrauss with binary coins. J. Comput. Syst. Sci. , 66(4):671â687, 2003. doi:10.1016/S 0022-0000(03)00025-4 . · doi â
- 3[3] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM 2002 , pages 1â10. doi:10.1007/3-540-45726-7\_1 . · doi â
- 4[4] Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch L 2 subscript đż 2 L_{2} -heavy-hitters on sliding windows. Theor. Comput. Sci. , 554:82â94, 2014.
- 5[5] Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In FOCS 2007 , pages 283â293.
- 6[6] Vladimir Braverman and Rafail Ostrovsky. Effective computations on sliding windows. SIAM J. Comput. , 39(6):2113â2131, 2010.
- 7[7] Vladimir Braverman, Rafail Ostrovsky, and Alan Roytman. Zero-one laws for sliding windows and universal sketches. In APPROX-RANDOM 2015 , pages 573â590.
- 8[8] Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. J. Comput. Syst. Sci. , 78(1):260â272, 2012.
