$L_p$ Pattern Matching in a Stream

Tatiana Starikovskaya; Michal Svagerka; Przemys{\l}aw Uzna\'nski

arXiv:1907.04405·cs.DS·November 10, 2020

$L_p$ Pattern Matching in a Stream

Tatiana Starikovskaya, Michal Svagerka, Przemys{\l}aw Uzna\'nski

PDF

TL;DR

This paper develops new streaming algorithms for approximate pattern matching under various $L_p$ distances, significantly improving space efficiency for large-scale, noisy data such as biological sequences.

Contribution

It introduces a suite of streaming algorithms for $L_p$ pattern matching with improved space complexity, extending previous work to broader $L_p$ norms and approximation guarantees.

Findings

01

Achieved $ ilde{O}(rac{1}{ ext{ extsterling}^2}\sqrt{n})$ space algorithms for $L_p$ distances with $0 < p \,\leq 1$.

02

Extended streaming pattern matching algorithms to $L_1$, $L_2$, and general $L_p$ norms.

03

Significantly improved space efficiency over previous algorithms for large-scale, noisy data.

Abstract

We consider the problem of computing distance between a pattern of length $n$ and all $n$ -length subwords of a text in the streaming model. In the streaming setting, only the Hamming distance ( $L_{0}$ ) has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires $Ω (n)$ space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold $k$ , see~[SODA'19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small. On the other hand, the main application of the streaming setting is processing large-scale…

Equations57

hSub_{r} (U) [i] = {U [i], ?, if the r lowest bits of h (i) are all 0; otherwise. .

hSub_{r} (U) [i] = {U [i], ?, if the r lowest bits of h (i) are all 0; otherwise. .

Pr [∣ Z_{r} - m ∣ \geq 4 m 2^{c + 1}] = Pr [∣ X_{r} - m / 2^{r} ∣ \geq 2^{2 + (c + 1 - r) /2} m / 2^{r}] \leq 1/ 2^{4 + (c + 1 - r)}

Pr [∣ Z_{r} - m ∣ \geq 4 m 2^{c + 1}] = Pr [∣ X_{r} - m / 2^{r} ∣ \geq 2^{2 + (c + 1 - r) /2} m / 2^{r}] \leq 1/ 2^{4 + (c + 1 - r)}

Pr [X_{c + 1} > k] \leq Pr [X_{c + 1} \geq m / 2^{c + 1} + 4 m / 2^{c + 1}] \leq 1/16.

Pr [X_{c + 1} > k] \leq Pr [X_{c + 1} \geq m / 2^{c + 1} + 4 m / 2^{c + 1}] \leq 1/16.

Pr [∣ Z_{f} - m ∣ \geq 4 2/ k \cdot m]

Pr [∣ Z_{f} - m ∣ \geq 4 2/ k \cdot m]

\leq Pr [f > c + 1] + r = 0 \sum c + 1 Pr [∣ Z_{f} - m ∣ \geq 4 m 2^{c + 1} \mbox an d f = r] \leq

\leq Pr [f > c + 1] + r = 0 \sum c + 1 Pr [∣ Z_{r} - m ∣ \geq 4 m 2^{c + 1}] \leq

\leq 1/16 + r = 1 \sum c + 1 1/ 2^{4 + (c + 1 - r)} < 1/4.

Var [X_{r}] = Var [∥ mSub_{r} (U) - mSub_{r} (V) ∥_{1}] \leq E [∥ mSub_{r} (U) - mSub_{r} (V) ∥_{1}] = E [X_{r}] .

Var [X_{r}] = Var [∥ mSub_{r} (U) - mSub_{r} (V) ∥_{1}] \leq E [∥ mSub_{r} (U) - mSub_{r} (V) ∥_{1}] = E [X_{r}] .

d_{i} = {ε^{- 1} \cdot (1 + ε)^{p i} ε^{- 1} \cdot \frac{( 1 + ε ) ^{p}}{( 1 + ε ) ^{p} - 1} when i > 0 when i = 0

d_{i} = {ε^{- 1} \cdot (1 + ε)^{p i} ε^{- 1} \cdot \frac{( 1 + ε ) ^{p}}{( 1 + ε ) ^{p} - 1} when i > 0 when i = 0

E [x]

E [x]

\displaystyle=\varepsilon^{-1}\frac{(1+\varepsilon)^{p}}{(1+\varepsilon)^{p}-1}+\varepsilon^{-1}\sum_{i=1}^{\ell}\big{(}(1+\varepsilon)^{p}\big{)}^{i}+\varepsilon^{-1}|c-c^{\prime}|\sum_{i=\ell+1}^{\infty}\left((1+\varepsilon)^{p-1}\right)^{i}=

= ε^{- 1} \frac{(( 1 + ε ) ^{ℓ + 1} ) ^{p}}{( 1 + ε ) ^{p} - 1} + ∣ c - c^{'} ∣ ε^{- 1} \frac{(( 1 + ε ) ^{ℓ + 1} ) ^{p - 1}}{1 - ( 1 + ε ) ^{p - 1}} =

= ∣ c - c^{'} ∣^{p} (\frac{( 1 + ε ) ^{p}}{( 1 + ε ) ^{p} - 1} + \frac{1}{( 1 + ε ) ^{1 - p} - 1}) ε^{- 1} \approx ε^{- 2} ∣ c - c^{'} ∣^{p} \frac{1}{p ( 1 - p )} .

i \geq q \sum d_{i} \cdot \frac{∣ c - c ^{'} ∣}{( 1 + ε ) ^{i}} = ε^{- 1} ∣ c - c^{'} ∣ \frac{(( 1 + ε ) ^{q} ) ^{p - 1}}{1 - ( 1 + ε ) ^{p - 1}} \leq \frac{ε ^{- 1} σ}{( 1 - ( 1 + ε ) ^{p - 1} ) σ ε ^{- 3}} = Θ (ε) .

i \geq q \sum d_{i} \cdot \frac{∣ c - c ^{'} ∣}{( 1 + ε ) ^{i}} = ε^{- 1} ∣ c - c^{'} ∣ \frac{(( 1 + ε ) ^{q} ) ^{p - 1}}{1 - ( 1 + ε ) ^{p - 1}} \leq \frac{ε ^{- 1} σ}{( 1 - ( 1 + ε ) ^{p - 1} ) σ ε ^{- 3}} = Θ (ε) .

E [∥ φ (c) - φ (c^{'}) ∥_{H}] = s \cdot E [∥ φ_{i} (c) - φ_{i} (c^{'}) ∥_{H}] = s \cdot Θ (ε^{- 2} ∣ c - c^{'} ∣^{p} \frac{1}{p ( 1 - p )}) .

E [∥ φ (c) - φ (c^{'}) ∥_{H}] = s \cdot E [∥ φ_{i} (c) - φ_{i} (c^{'}) ∥_{H}] = s \cdot Θ (ε^{- 2} ∣ c - c^{'} ∣^{p} \frac{1}{p ( 1 - p )}) .

Var [∥ φ (c) - φ (c^{'}) ∥_{H}]

Var [∥ φ (c) - φ (c^{'}) ∥_{H}]

\leq s \cdot i = ℓ + 1 \sum q ε^{- 2} ((1 + ε)^{2 p})^{i} \frac{∣ c - c ^{'} ∣}{( 1 + ε ) ^{i}} \leq

\leq s \cdot ε^{- 2} ∣ c - c^{'} ∣ i = ℓ + 1 \sum \infty ((1 + ε)^{2 p - 1})^{i} \leq

\leq s \cdot ε^{- 2} ∣ c - c^{'} ∣^{2 p} \frac{( 1 + ε ) ^{2 p - 1}}{1 - ( 1 + ε ) ^{2 p - 1}} \leq

= s \cdot O (∣ c - c^{'} ∣^{2 p} ε^{- 3} \frac{1}{1 - 2 p}) .

Var [∥ φ (c) - φ (c^{'}) ∥_{H}]

Var [∥ φ (c) - φ (c^{'}) ∥_{H}]

= s \cdot ε^{- 2} ∣ c - c^{'} ∣ i = ℓ + 1 \sum q ((1 + ε)^{2 p - 1})^{i} =

= s \cdot O (ε^{- 3} ∣ c - c^{'} ∣ ((1 + ε)^{q})^{2 p - 1}) =

= s \cdot O (ε^{- 3} ∣ c - c^{'} ∣ σ^{\frac{2 p - 1}{1 - p}} / ε^{3 \cdot \frac{2 p - 1}{1 - p}}) .

Sub_{r} (S) = (s_{1})^{⌊ \frac{e _{1} + ( h ( 1 ) mod 2 ^{r} )}{2 ^{r}} ⌋} (s_{2})^{⌊ \frac{e _{2} + ( h ( 2 ) mod 2 ^{r} )}{2 ^{r}} ⌋} \dots (s_{m})^{⌊ \frac{e _{m} + ( h ( m ) mod 2 ^{r} )}{2 ^{r}} ⌋}

Sub_{r} (S) = (s_{1})^{⌊ \frac{e _{1} + ( h ( 1 ) mod 2 ^{r} )}{2 ^{r}} ⌋} (s_{2})^{⌊ \frac{e _{2} + ( h ( 2 ) mod 2 ^{r} )}{2 ^{r}} ⌋} \dots (s_{m})^{⌊ \frac{e _{m} + ( h ( m ) mod 2 ^{r} )}{2 ^{r}} ⌋}

x_{i} = ∥ s_{i}^{e_{i}^{'}} - q_{i}^{e_{i}^{'}} ∥_{H} = e_{i}^{'} \cdot ∥ s_{i} - q_{i} ∥_{H} .

x_{i} = ∥ s_{i}^{e_{i}^{'}} - q_{i}^{e_{i}^{'}} ∥_{H} = e_{i}^{'} \cdot ∥ s_{i} - q_{i} ∥_{H} .

Var [x_{i}] = Var [x_{i} - ⌊ e_{i} / 2^{r} ⌋] \leq E [x_{i} - ⌊ e_{i} / 2^{r} ⌋] \leq E [x_{i}] .

Var [x_{i}] = Var [x_{i} - ⌊ e_{i} / 2^{r} ⌋] \leq E [x_{i} - ⌊ e_{i} / 2^{r} ⌋] \leq E [x_{i}] .

∣ x + y ∣^{p} = ∣ x ∣^{p} + O (∣ y ∣^{p} + ∣ y ∣ \cdot ∣ x ∣^{p - 1}) .

∣ x + y ∣^{p} = ∣ x ∣^{p} + O (∣ y ∣^{p} + ∣ y ∣ \cdot ∣ x ∣^{p - 1}) .

∥ X + Y ∥_{p}^{p} = i \sum ∣ x_{i} + y_{i} ∣^{p} = i \sum ∣ x_{i} ∣^{p} \pm O (i \sum ∣ y_{i} ∣^{p} + i \sum ∣ y_{i} ∣∣ x_{i} ∣^{p - 1}) .

∥ X + Y ∥_{p}^{p} = i \sum ∣ x_{i} + y_{i} ∣^{p} = i \sum ∣ x_{i} ∣^{p} \pm O (i \sum ∣ y_{i} ∣^{p} + i \sum ∣ y_{i} ∣∣ x_{i} ∣^{p - 1}) .

i \sum ∣ y_{i} ∣∣ x_{i} ∣^{p - 1} \leq (i \sum ∣ y_{i} ∣^{p})^{1/ p} (i \sum ∣ x_{i} ∣^{(p - 1) q})^{1/ q} = ∥ Y ∥_{p} ∥ X ∥_{p}^{p - 1} .

i \sum ∣ y_{i} ∣∣ x_{i} ∣^{p - 1} \leq (i \sum ∣ y_{i} ∣^{p})^{1/ p} (i \sum ∣ x_{i} ∣^{(p - 1) q})^{1/ q} = ∥ Y ∥_{p} ∥ X ∥_{p}^{p - 1} .

∣ m_{1}^{'} - m_{1} ∣ = O (∥ B_{1} - \tilde{P} ∥_{p}^{p} + ∥ B_{1} - \tilde{P} ∥_{p} ∥ P_{1} - B_{1} ∥_{p}^{p - 1}) = O (ε^{p} 2^{k} + ε (2^{k})^{\frac{1}{p}} (2^{k})^{\frac{p - 1}{p}}) = O (ε 2^{k})

∣ m_{1}^{'} - m_{1} ∣ = O (∥ B_{1} - \tilde{P} ∥_{p}^{p} + ∥ B_{1} - \tilde{P} ∥_{p} ∥ P_{1} - B_{1} ∥_{p}^{p - 1}) = O (ε^{p} 2^{k} + ε (2^{k})^{\frac{1}{p}} (2^{k})^{\frac{p - 1}{p}}) = O (ε 2^{k})

\frac{1}{2 d} \cdot ∥ hSketch (X) ∥_{2}^{2} =

\frac{1}{2 d} \cdot ∥ hSketch (X) ∥_{2}^{2} =

\displaystyle=\frac{1}{2d}\lVert M\mu(X^{\prime})\rVert_{2}^{2}=\frac{1}{2d}\lVert M\mu(X)\rVert_{2}^{2}=\frac{1}{2d}\lVert\mathsf{eSketch}(\mu(X))\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$\varepsilon$}}}}}{{=}}\frac{1}{2}\lVert\mu(X)\rVert_{2}^{2}=\lVert X\rVert_{H}.

ξ \in {0, 1}^{s} Pr [h (x_{1}; ξ) = b_{1} \land \dots \land h (x_{k}; ξ) = b_{k}] = 2^{- k}

ξ \in {0, 1}^{s} Pr [h (x_{1}; ξ) = b_{1} \land \dots \land h (x_{k}; ξ) = b_{k}] = 2^{- k}

\frac{1}{d} \cdot ∥ mSketch (X) ∥_{2}^{2} =

\frac{1}{d} \cdot ∥ mSketch (X) ∥_{2}^{2} =

\displaystyle=\frac{1}{d}\lVert M\nu(X^{\prime})\rVert_{2}^{2}=\frac{1}{d}\lVert M\nu(X)\rVert_{2}^{2}=\frac{1}{d}\lVert\mathsf{eSketch}(\nu(X))\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$\varepsilon$}}}}}{{=}}\lVert\nu(X)\rVert_{2}^{2}=\lVert X\rVert_{1}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

$L_{p}$ Pattern Matching in a Stream

Tatiana Starikovskaya This work was partially funded by the grant ANR-19-CE48-0016 from the French National Research Agency (ANR). DIENS, École normale supérieure, PSL Research University, France

Michal Svagerka

ETH Zürich, Switzerland

Przemysław Uznański Supported by Polish National Science Centre grant 2019/33/B/ST6/00298. Institute of Computer Science, University of Wrocław, Poland

Abstract

We consider the problem of computing distance between a pattern of length $n$ and all $n$ -length subwords of a text in the streaming model.

In the streaming setting, only the Hamming distance ( $L_{0}$ ) has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires $\Omega(n)$ space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold $k$ , see [SODA’19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small.

On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. Recent advances in hardware technology allow generating such data at a very high speed, but unfortunately, the produced data may contain about 10% of noise [Biol. Direct.’07, Klebanov and Yakovlev]. To analyse such data, it is not sufficient to consider small distances only. A possible workaround for this issue is the $(1\pm\varepsilon)$ -approximation. This line of research was initiated in [ICALP’16, Clifford and Starikovskaya] who gave a $(1\pm\varepsilon)$ -approximation algorithm with space $\widetilde{\mathcal{O}}(\varepsilon^{-5}\sqrt{n})$ .

In this work, we show a suite of new streaming algorithms for computing the Hamming, $L_{1}$ , $L_{2}$ and general $L_{p}$ ( $0<p<2$ ) distances between the pattern and the text. Our results significantly extend over the previous result in this setting. In particular, for the Hamming distance and for the $L_{p}$ distance when $0<p\leq 1$ we show a streaming algorithm that uses $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n})$ space for polynomial-size alphabets.

1 Introduction

In the problem of pattern matching, we are given a pattern $P$ of length $n$ and a text $T$ and must find all occurrences of $P$ in $T$ . A particularly relevant variant of this fundamental question is approximate pattern matching, where the goal is to find all subwords of the text that are similar to the pattern. This can be restated in the following way: given a pattern $P$ , a text $T$ , and a distance function, compute the distance between $P$ and every $n$ -length subword of $T$ . A very natural similarity measure for words is the Hamming distance. Furthermore, if both $P$ and $T$ are over an integer alphabet $\Sigma$ , one can consider the Manhattan distance or the Euclidean distance.

Definition 1.1 (Hamming, Manhattan and Euclidean distances).

For a vector $U=u_{1}u_{2}\ldots u_{n}$ , its Hamming norm is defined as $\lVert U\rVert_{H}=|\{i:u_{i}\not=0\}|$ , Manhattan norm is defined as $\lVert U\rVert_{1}=\sum_{i}|u_{i}|$ and Euclidean norm is defined as $\lVert U\rVert_{2}=\left(\sum_{i}u_{i}^{2}\right)^{1/2}$ . For two words $V=v_{1}v_{2}\ldots v_{n}$ and $W=w_{1}w_{2}\ldots w_{n}$ , their Hamming distance is defined as $\lVert V-W\rVert_{H}$ , their Manhattan distance as $\lVert V-W\rVert_{1}$ , and their Euclidean distance as $\lVert V-W\rVert_{2}$ .

Those distance functions naturally generalize to the so called $L_{p}$ distances, where $p>0$ is the exponent.

Definition 1.2 ( $p$ ’th moment, $p$ ’th norm).

For a vector $U=u_{1}u_{2}\ldots u_{n}$ and $p\geq 0$ , its $p$ ’th moment $F_{p}$ is defined as $F_{p}(U)=\sum_{i}|u_{i}|^{p}$ , and for $p>0$ its $L_{p}$ norm is defined as $\lVert U\rVert_{p}=F_{p}(U)^{1/p}=\left(\sum_{i}|u_{i}|^{p}\right)^{1/p}$ . For two words $V=v_{1}v_{2}\ldots v_{n}$ and $W=w_{1}w_{2}\ldots w_{n}$ considered as vectors, the $p$ ’th moment of their difference is $F_{p}(V-W)$ and their $L_{p}$ distance is defined as $\lVert V-W\rVert_{p}=F_{p}(V-W)^{1/p}=\left(\sum_{i}|v_{i}-w_{i}|^{p}\right)^{1/p}$ .

In other words, the Manhattan distance is the $L_{1}$ distance, the Euclidean distance is the $L_{2}$ distance, and the Hamming distance can be considered as the $L_{0}$ distance.

Below we assume that the length of the text is $2n$ , as any algorithm on a text of larger length can be reduced to repeated application of an algorithm that runs on texts of length $2n$ . This is done by splitting the text into blocks of length $2n$ which overlap by $n$ characters.

Offline setting.

For the Hamming distance, the problem has been extensively studied in the offline setting, where we assume random access to the input. The first algorithm, for a constant-size alphabet, was shown by Fischer and Paterson [22]. The algorithm uses $\mathcal{O}(n\log n)$ time and in substance computes the Boolean convolution of two vectors a constant number of times. This was later extended to polynomial-size alphabets in [1, 34]. With a somewhat similar approach, the same complexity can be achieved for the $L_{1}$ distance in [13]. Later, in [35, 36] the authors proved that these problems must have equal (up to polylogarithmic factors) complexities by showing reductions from the Hamming to the $L_{1}$ distance and back.

To improve the complexity for large alphabets, the natural next step was to study approximation algorithms. Until very recently, the fastest $(1\pm\varepsilon)$ -approximation algorithm for computing the Hamming distances was by Karloff [30]. The algorithm combines random projections from an arbitrary alphabet to the binary one and Boolean convolution to solve the problem in $\mathcal{O}(\varepsilon^{-2}n\log^{3}n)$ time. In a breakthrough paper Kopelowitz and Porat [32] gave a new approximation algorithm improving the time complexity to $\mathcal{O}(\varepsilon^{-1}n\log^{3}{n}\log{\varepsilon^{-1}})$ , which was later significantly simplified [33]. Using a similar technique, Gawrychowski and Uznański [24] showed an approximation algorithm for computing the $L_{1}$ distance in $\mathcal{O}(\varepsilon^{-1}n\log^{4}n)$ (randomized) time, later made deterministic in time $\mathcal{O}(\varepsilon^{-1}n\log^{2}n)$ in [40]. Using similar techniques, the authors of [40] gave $\widetilde{\mathcal{O}}(\varepsilon^{-1}n)$ -time $(1+\varepsilon)$ -approximation algorithm for $L_{p}$ distances for any constant positive $p$ .111Across the paper we use $\widetilde{\mathcal{O}}$ to indicate that we are suppressing poly-log(n) factors.

Streaming setting.

In the streaming setting, we assume that the pattern and the text arrive as streams, one character at a time (the pattern arrives before the text). The main objective is to design algorithms that use as little space as possible, and we must account for all the space used by the algorithm, including the space required to store the input, in full or in part. It is also often the case that the text arrives at a very high speed and we must be able to process it faster than it arrives to fulfil the space guarantees, preferably, in real time. To this aim, the time complexity of streaming algorithms is defined as the worst-case amount of time spent on processing one character of the text, i.e. per arrival.

In the streaming setting, only the Hamming distance ( $L_{0}$ ) has been studied. It is known that computing the Hamming distance between a pattern and a streaming text exactly requires $\Omega(n)$ space, even for the binary alphabet and with a small probability error allowed, which can be shown by a straightforward reduction to communication complexity (folklore).

Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold $k$ . This variant of the problem is often reffered to as $k$ -mismatch problem. The $k$ -mismatch problem has been extensively studied in the literature [15, 16, 26, 39], with this line of work reaching $\widetilde{\mathcal{O}}(k)$ memory complexity and $\widetilde{\mathcal{O}}(\sqrt{k})$ time per input character. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, in other words, the distance between the pattern and the text should be relatively small. On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. To decrease the cost of generating such data, recently new hardware approaches have been developed. They have become widely used due to cost efficiency, but unfortunately, the produced data may contain about 10% of noise [31]. To analyse such data, it is not sufficient to consider small distances only, and a possible workaround for this issue is $(1\pm\varepsilon)$ -approximation. This line of research was initiated by Clifford and Starikovskaya [17] who gave a $(1\pm\varepsilon)$ -approximation algorithm with space $\widetilde{\mathcal{O}}(\varepsilon^{-5}\sqrt{n})$ that uses $\widetilde{\mathcal{O}}(\varepsilon^{-4})$ time per arriving character of the text.

Independently and in parallel with this work, authors of [12] showed a $(1\pm\varepsilon)$ -approximation streaming algorithm for the $k$ -mismatch problem that uses $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{k})$ space. For a special case of $k=n$ , they show how to reduce the space further to $\widetilde{\mathcal{O}}(\varepsilon^{-1.5}\sqrt{n})$ . Compared to our solution, their algorithm has worse time complexity of $\widetilde{\mathcal{O}}(\varepsilon^{-3})$ per arrival, and more importantly, it is not obvious whether it can be generalised to other $L_{p}$ norms as it uses a very different set of techniques.

Sliding window.

The problem of computing distance between $P$ and every $n$ -length subword of $T$ in the streaming setting resembles the problem of maintaining the $L_{p}$ norm of a $n$ -length suffix of a streaming text, also referred to as sliding window. In fact, the latter is a simplification of the former, with setting $P=[0,0,\ldots,0]$ . There is an extensive line of work on maintaining the $L_{p}$ norm of a sliding window, refer to [4, 5, 6, 7, 8, 19] and references therein. The main message is that the norm of a sliding window can be maintained efficiently, e.g. for $1\leq p\leq 2$ the $L_{p}$ norms can be maintained $(1\pm\varepsilon)$ -approximately in space $\widetilde{\mathcal{O}}(\varepsilon^{-1})$ . However, those results do not translate to our case: in the sliding window, one can easily isolate “heavy hitters”, that is updates with a significant contribution to the output. In our case, the contribution of an update depends on its relative position to the pattern, and one can easily construct instances where a contribution of a position in the text changes drastically relative to its alignment with the pattern, which necessitates a significantly different approach.

1.1 Our results

In this work, we show a suite of new streaming algorithms for computing the Hamming, $L_{1}$ , $L_{2}$ and general $L_{p}$ ( $0<p\leq 2$ ) distances between the pattern and the text. Our results significantly improve and extend the results of [17].

Theorem 1.3.

Given a pattern $P$ of length $n$ and a text $T$ over an alphabet $\Sigma=[1,2,\ldots,\sigma]$ , where $\sigma=n^{\mathcal{O}(1)}$ , there is a streaming algorithm that computes a $(1\pm\varepsilon)$ -approximation of the $L_{p}$ distance between $P$ and every $n$ -length subword of $T$ correctly w.h.p.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n}+\log\sigma)$ space, and $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ time per arrival when $p=0$ (Hamming distance); 2. 2.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n}+\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(\sqrt{n}\log\sigma)$ time per arrival when $p=1$ (Manhattan distance); 3. 3.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n}+\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n})$ time per arrival when $0<p<1/2$ ; 4. 4.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n}+\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(\varepsilon^{-3}\sqrt{n})$ time per arrival when $p=1/2$ ; 5. 5.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2}\sqrt{n}+\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(\sigma^{\frac{2p-1}{1-p}}\sqrt{n}/\varepsilon^{2+3\cdot\frac{2p-1}{1-p}})$ time per arrival when $1/2<p<1$ ; 6. 6.

in $\widetilde{\mathcal{O}}(\varepsilon^{-2-p/2}\sqrt{n}\log^{2}\sigma)$ space and $\mathcal{O}(\varepsilon^{-p/2}\sqrt{n}+\varepsilon^{-2}\log\sigma)$ time per arrival for $1<p\leq 2$ .

We also improve and extend the space lower bound of [17], who showed that any streaming algorithm that computes a $(1\pm\varepsilon)$ -approximation of the Hamming distance between a pattern and a streaming text must use $\Omega(\varepsilon^{-2}\log^{2}n)$ bits for all $\varepsilon$ such that $1/\varepsilon<n^{1/2-\gamma}$ for some constant $\gamma$ (condition inherited from [28]). We show the following result:

Lemma 1.4.

Let $2\leq 1/\varepsilon<n$ and $0\leq p\leq 2$ . Any $(1\pm\varepsilon)$ -approximation algorithm that computes the $L_{p}$ distance between a pattern and a streaming text for each alignment, must use $\Omega(\min(1/\varepsilon^{2},n))$ bits of space.

Proof.

Let us first show the lower bound for $p=0$ , i.e., for Hamming distance. We show the lower bound by reduction to a two-party communication complexity problem called GAP-Hamming-distance. In this problem, the two parties, Alice and Bob are given two binary words of length $n$ and a parameter $g=\varepsilon n$ , $1\leq g\leq n/2$ . Alice sends Bob a message, and Bob’s task is to output $1$ if the Hamming distance between his and Alice’s word is larger than $n/2+g$ , and zero if it is at most $n/2-g$ . Otherwise, he can output “don’t know”. By Proposition 4.4 [10], the communication complexity of this problem is $\Omega(\min\{1/\varepsilon^{2},n\})$ .

We can now show a space lower bound for any $(1\pm\varepsilon)$ -approximate algorithm for computing the Hamming distance between the pattern and the text by a standard reduction. Suppose that $2\leq 1/\varepsilon\leq n$ there is an algorithm that uses $o(\min\{1/\varepsilon^{2},n\})$ bits of space. Let $P$ be Alice’s word, $T$ Bob’s word. After reading $P$ , the algorithm stores all the information about it in $o(\min\{1/\varepsilon^{2},n\})$ bits of space. We construct the communication protocol as follows: Alice sends the information about $P$ to Bob. Using it, Bob can continue running the algorithm and compute the approximation of the Hamming distance between $P$ and $T$ . We have thus developed a communication protocol with complexity $o(\min\{1/\varepsilon^{2},n\})$ , a contradiction.

We can now show the lower bound for $0<p\leq 2$ . We immediately obtain a space lower bound for any $(1\pm\varepsilon)$ -approximate algorithm for computing the $p$ ’th moment between the pattern and the text at every alignment. Indeed, on binary words the $p$ ’th moment is equal to the Hamming distance for all $0<p\leq 2$ . The lower bound for the $L_{p}$ distance follows by Observation 1.5. ∎

1.2 Techniques

At a very high level, the structure of all algorithms presented in this paper is similar to that of [17] (in fact, such approach in similar context was also used independently in [18]). We process the text by blocks of length $b\approx\sqrt{n}$ . To compute an approximation of the distance / the $p$ ’th moment at a particular alignment, we divide the pattern into two parts: a prefix of length $\leq b$ aligned with a suffix of some block of the text, and the remaining suffix (see Fig. 1). We compute an approximation of the distance / the $p$ ’th moment for both of the parts and sum them up to obtain the final answer. Our main contribution is a set of new tools that allows computing the approximations efficiently.

To be able to compute the approximation of the distance / the $p$ ’th moment between the prefix and the corresponding block of the text, we compute, while reading each block of the text, its compact lossy description that we refer to as prefix encoding. The prefix encoding captures the relation between the read block and the prefix of the pattern of length $b$ . To compute the distance / the $p$ ’th moment between the suffix and the text, we will use suffix sketches. For each position $i$ of the text, the suffix sketch describes the subword $T[b\cdot k+1,i]$ of the text where $k$ is the smallest integer such that $i-b\cdot k\leq n$ (see Fig. 1).

For the Hamming distance, we define the prefix encodings in Section 2.1 and the suffix sketches in Section 3.1. Our Hamming prefix encoding introduces a novel use of a known technique called subsampling. The prefix encodings are used to approximate the distance between any suffix of one word and the prefix of another word of the same length. In brief, the idea is to replace each character of the two words by the don’t care character “?”, a special character that matches any other character of the alphabet. We repeat the process a logarithmic number of times to create a logarithmic number of pairs of “subsamples”. For each pair, we find the longest suffix of one subsample that matches the prefix of the second subsample up to at most $\Theta(1/\varepsilon^{2})$ mismatches. We then show that this information can be used to approximate the Hamming distance between any suffix-prefix pair. Similar techniques were used in [3, 20, 23, 25, 29, 38] for estimating the Hamming norm in streams. The crucial difference with our approach is that we must be able to compute the Hamming norm of any suffix-prefix pair of the two words, and we must be able to do it efficiently. As for the suffix sketches, for the binary alphabet we use the sketches introduced in [17]. We then show a reduction from arbitrary alphabets to the binary alphabet, which improves the space consumption of Hamming suffix sketches by a factor of $1/\varepsilon^{2}$ .

We can solve the problem of $L_{1}$ (Manhattan distance) pattern matching by replacing each character of the pattern and of the stream with its unary encoding and running the solution for the Hamming distance. However, this would introduce a multiplicative factor of $\sigma$ (the size of the alphabet) to the time complexity. We show efficient randomised reductions from the Manhattan to Hamming distance that allow simulating the solution for the Hamming distance without a significant overhead. In particular, to design the prefix encodings we use random shifting and rounding, while for the suffix sketches we use range-summable hash functions [9]. We show the Manhattan prefix encodings in Section 2.2 and the Manhattan suffix sketches in Section 3.2.

For generic $L_{p}$ distances, $0<p\leq 2$ , we discuss the prefix encodings in Section 2.4 and the suffix sketches in Section 3.3. Our approach to $L_{p}$ prefix encodings is rather involved. In the case of $0<p<1$ , we construct a novel embedding from $L_{p}^{p}$ space into the Hamming space, which might be of independent interest. While the target dimension of the Hamming space is large, we construct the embedding in such a way that each value is mapped into a compressible sequence of form $c_{1}^{d_{1}}\ldots c_{t}^{d_{t}}$ for some small value of $t$ , and where values of $d_{1},\ldots,d_{t}$ are constant across all input values. Such compressed representation allows us to efficiently apply the subsampling framework and reduce the problem to the Hamming distance case. For $1<p\leq 2$ , we identify a logarithmic number of anchor suffixes, and partition each of them into $\varepsilon^{-p}$ words of roughly even contribution to the distance. We then use the partition to decode prefix-suffix distance queries for arbitrary length queries. Such construction is a generalization and improvement of the approach presented in [17]. For suffix sketches, we simply use the $p$ -stable distributions [27].

Finally, we combine the prefix encodings and the suffix sketches to prove Theorem 1.3 in Section 4. To simplify the notation, we use $x\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}y$ to denote $(1-\varepsilon)y\leq x\leq(1+\varepsilon)y$ from now on. We will also use the fact that for $p>0$ we can speak of approximating the $p$ ’th moment of differences between the pattern and the $n$ -length substrings of the text and the $L_{p}$ distances between the pattern and the $n$ -length substrings of the text interchangeably, it changes the complexities up to a constant factor only:

Observation 1.5.

For any constant $p>0$ and $\varepsilon<1/2$ , there is a constant $C_{p}$ such that finding a $(1\pm C_{p}\cdot p\varepsilon)$ approximation of the $p$ ’th moment of a vector suffices for $(1\pm\varepsilon)$ -approximating its $p$ ’th norm, and finding a $(1\pm C_{p}\cdot\varepsilon/p)$ approximation of its $p$ ’th norm suffices for $(1\pm\varepsilon)$ -approximating its $p$ ’th moment.

2 Prefix encodings

In this section we present a solution to the following problem. Imagine we have a block of text $T^{\prime}[1,b]=T[i+1,i+b]$ and a prefix of the pattern $P^{\prime}=P[1,b]$ . We want to find a compressed representation (encoding) of $T^{\prime}$ so that the following is possible: given any $1\leq d\leq b$ , the compressed representation of $T^{\prime}$ , and $P^{\prime}$ (explicitly), we can $1\pm\varepsilon$ approximate $\lVert T^{\prime\prime}-P^{\prime\prime}\rVert_{p}$ , where $T^{\prime\prime}=T^{\prime}[b-d+1,b]$ is a suffix of $T^{\prime}$ and $P^{\prime\prime}=P^{\prime}[1,d]$ is a prefix of $P^{\prime}$ .

We start by presenting a solution to the Hamming distance case, which is a basis to our solution for all other $L_{p}$ norms for $0<p\leq 2$ .

2.1 Hamming ( $L_{0}$ ) distance

Recall that “?” is the don’t care character, a special character that matches any other character of the alphabet.

Definition 2.1 (Hamming subsampling).

Consider a word $U$ of length $n$ . Let $q=\lceil 3\log n\rceil$ and let $h(i):[n]\rightarrow\{0,1\}^{q}$ be a function drawn at random from a pairwise independent family. For $r=0,\ldots,q$ , we define the $r$ -th level Hamming subsample of $U$ , $\mathsf{hSub}_{r}(U)$ , as follows:

[TABLE]

In particular, $\mathsf{hSub}_{0}(U)=U$ .

Fix an integer $k=\Theta(1/\varepsilon^{2})$ large enough. For two words $U,V$ , consider the following estimation procedure:

Algorithm 2.2.

Denote $X_{r}$ to be the Hamming distance between $\mathsf{hSub}_{r}(U)$ and $V$ and let $f=\min\{i:X_{i}\leq k\}$ .222We emphasize that $\mathsf{hSub}_{r}(U)$ contains don’t care characters, so the Hamming distance is defined as the number of pairs of characters of $\mathsf{hSub}_{r}(U)$ and $V$ that do not match. 2. 2.

Output $Z_{f}=2^{f}\cdot X_{f}$ as an estimate of $\lVert U-V\rVert_{H}$ .

The following lemma is a rephrasing of a known result regarding subsampling in estimation of the Hamming norm (cf. [3, Theorem 3], or [25, Theorem 2]).

Lemma 2.3.

For $Z_{f}$ as in Algorihtm 2.2 there is $Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert U-V\rVert_{H}$ with probability at least $3/4$ .

Proof.

Denote $m=\lVert U-V\rVert_{H}$ . Consider a fixed value $r$ . Let $I_{1},I_{2},\ldots,I_{n}$ be binary variables indicating existence of a mismatch between $\mathsf{hSub}_{r}(U)$ and $V$ at positions $1,\ldots,n$ , so that $X_{r}=\sum_{j}I_{j}$ . We observe that $\mathbb{E}\left[X_{r}\right]=m/2^{r}$ and therefore $\mathbb{E}\left[Z_{r}\right]=m$ , because each of the $m$ positions with mismatch between $U$ and $V$ generates a mismatch between $\mathsf{hSub}_{r}(U)$ and $V$ with probability $1/2^{r}$ .

Furthermore, as the function $h$ in Definition 2.1 is drawn from a pairwise independent family, there is $\mathrm{Var}\left[X_{r}\right]=\sum_{j}\mathrm{Var}\left[I_{j}\right]\leq\sum_{j}\mathbb{E}\left[(I_{j})^{2}\right]=\sum_{j}\mathbb{E}\left[I_{j}\right]=\mathbb{E}\left[X_{r}\right]=m/2^{r}$ . Let $c=\min\{i:\mathbb{E}\left[X_{i}\right]\leq k\}=\lceil\log_{2}\left(\frac{m}{k}\right)\rceil$ . By Chebyshev’s inequality, we have

[TABLE]

We estimate $\Pr[f>c+1]=\Pr[X_{c+1}>k]$ . Assume w.l.o.g. that $k\geq 32$ . Observe that $m/2^{c}\leq k$ , which implies, for $k\geq 32$ , $m/2^{c+1}+4\sqrt{m/2^{c+1}}\leq k/2+4\sqrt{k/2}\leq k$ . By Equation 1, there is

[TABLE]

It follows that $\Pr[f>c+1]=\Pr[X_{c+1}>k]\leq 1/16$ . Hence, we obtain

[TABLE]

It follows that we can choose $k=\Theta(1/\varepsilon^{2})$ large enough so that $Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert U-V\rVert_{H}$ with probability $\geq 3/4$ . ∎

Since the subsampling is performed independently for each position, one can use subsampling to approximate the Hamming distance between any suffix of $B$ and any prefix of $P$ of equal lengths in a similar fashion.

We are now ready to define the Hamming prefix encoding of a block. For brevity, let $B_{r}^{j}=\mathsf{hSub}_{r}(B)[b-j+1,b]$ and $P_{r}^{j}=P[1,j]$ (the same for all $r$ ). Furthermore, given two words $U,V$ of equal length, define the mismatch information $\mathsf{MI}(U,V)=\{(i,U[i],V[i]):U[i]\mbox{ does not match }V[i]\}$ .

Definition 2.4.

Consider a $b$ -length block $B$ of the text $T$ . For each $0\leq r\leq\lceil 3\log n\rceil$ , let $j^{\ast}(r)$ be the maximal integer such that the Hamming distance between $B_{r}^{j^{\ast}(r)}$ and $P_{r}^{j^{\ast}(r)}$ is at most $k=\Theta(\varepsilon^{-2})$ . We define the Hamming prefix encoding of $B$ to be a tuple of pairs $j^{\ast}(r),\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ .

Note that the prefix encoding of $B$ uses $\mathcal{O}(k\log n)=\mathcal{O}(\varepsilon^{-2}\log n)$ space. We can compute it efficiently:

Lemma 2.5.

Assume constant-time random access to $P[1,b]$ . Given a $b$ -length block $B$ of the text $T$ , its Hamming prefix encoding can be computed in $\widetilde{\mathcal{O}}(kb)=\widetilde{\mathcal{O}}(b\varepsilon^{-2})$ time.

Proof.

To compute the encoding, we use the algorithm of [14]. Formally, for each $r$ we create a word $T^{\prime}$ by appending $b$ don’t care characters to the subsample $\mathsf{hSub}_{r}$ . The algorithm of [14] can be used to find all $b$ -length subwords of $T^{\prime}$ that match $P[1,b]$ with up to $k$ mismatches, moreover for each of these subwords the algorithm outputs the mismatch information. We take the leftmost subword only, which corresponds to $j^{\ast}(r)$ because of the don’t care characters. In total, our algorithm uses $\widetilde{\mathcal{O}}(kb)=\widetilde{\mathcal{O}}(\varepsilon^{-2}b)$ time. ∎

We now show how to compute the Hamming distance between any $j$ -length suffix of $B$ and any $j$ -length prefix of $P$ given $P[1,b]$ and the Hamming prefix encoding of a block $B$ .

Lemma 2.6.

Given the prefix encoding of a $b$ -length block $B$ of the text $T$ , there is an algorithm that computes, for any $j=1,\ldots,b$ , a $(1+\varepsilon)$ -approximation of the Hamming distance between the $j$ -length suffix of $B$ and the $j$ -length prefix of $P$ in $\widetilde{\mathcal{O}}(kb)=\widetilde{\mathcal{O}}(b\varepsilon^{-2})$ time.

Proof.

Denote $X_{r}$ to be the Hamming distance between $P_{r}^{j}$ and $B_{r}^{j}$ . We compute the smallest $f$ such that $X_{f}\leq k$ in the following way. For each $r$ , we use $\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ to restore $B_{r}^{j^{\ast}(r)}$ . We then append $P_{r}^{j^{\ast}(r)}$ with $b$ don’t care characters and run the algorithm of [14] for the resulting text and the pattern. This allows to compute $X_{r}$ for all $j\leq j^{\ast}(r)$ , and if $j>j^{\ast}(r)$ , then $X_{f}>k$ by definition. In total, the algorithm takes $\widetilde{\mathcal{O}}(kb)=\widetilde{\mathcal{O}}(\varepsilon^{-2}b)$ time. ∎

2.2 Manhattan ( $L_{1}$ ) distance

Recall a word morphism $\nu:\Sigma\to\{0,1\}^{\sigma}$ , $\nu(a)=1^{a}0^{\sigma-a}$ . Our goal in this section is to simulate implicitly procedures from Lemma 2.5 and Lemma 2.6 on words $\nu(B)$ and $\nu(T)$ without introducing any significant overhead.

Definition 2.7 (Manhattan scaling).

Consider a word $U$ of length $n$ . Let $q=\lceil 3\log n\sigma\rceil$ and let $h:[n]\rightarrow 2^{q}$ be a function drawn at random from a $4$ -wise independent family. For $r=0,\ldots,q$ , we define the $r$ -th level Manhattan subsample of $U$ , $\mathsf{mSub}_{r}(U)$ , as a word of length $n$ such that $\mathsf{mSub}_{r}(U)[i]=\left\lfloor\frac{U[i]+(h(i)\bmod 2^{r})}{2^{r}}\right\rfloor$ . In particular, $\mathsf{mSub}_{0}(U)=U$ .

Fix an integer $k=\Theta(1/\varepsilon^{2})$ large enough. For words $U,V$ , consider $\mathsf{mSub}_{r}(U),\mathsf{mSub}_{r}(V)$ for all $r=0,\ldots,q$ , and the following estimation procedure:

Algorithm 2.8.

Denote $X_{r}=\lVert\mathsf{mSub}_{r}(U)-\mathsf{mSub}_{r}(V)\rVert_{1}$ and let $f=\min\{i:X_{i}\leq k\}$ . 2. 2.

Output $Z_{f}=2^{f}\cdot X_{f}$ as an estimate of $\lVert U-V\rVert_{1}$ .

Lemma 2.9.

For $Z_{f}$ as in Algorihtm 2.8 there is $Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert U-V\rVert_{1}$ with probability $\geq 3/4$ .

Proof.

Take some position $i$ and denote for short $a=\mathsf{mSub}_{r}(U)[i]$ and $b=\mathsf{mSub}_{r}(V)[i]$ and $c=\frac{U[i]-V[i]}{2^{r}}$ . There is $|a-b|\in\left\{\big{\lfloor}|c|\big{\rfloor},\big{\lceil}|c|\big{\rceil}\right\}$ and $\mathbb{E}\left[|a-b|\right]=|c|.$ Since $|a-b|-\big{\lfloor}|c|\big{\rfloor}$ is a $0/1$ variable, there is $\mathrm{Var}\left[|a-b|\right]=\mathrm{Var}\left[\left(|a-b|-\big{\lfloor}|c|\big{\rfloor}\right)\right]\leq\mathbb{E}\left[\left(|a-b|-\big{\lfloor}|c|\big{\rfloor}\right)\right]\leq\mathbb{E}\left[|a-b|\right]$ . Summing for all values of $i$ , we reach that

[TABLE]

Since we have reached an identical variance bound, the proof follows step-by-step the proof of Lemma 2.3. ∎

To approximate the Manhattan distance between any suffix of $B$ and any prefix of $P$ of equal lengths, we define the encoding similar to the Hamming distance case. Specifically, we still use the mismatch information, building on the fact that for any two words $\lVert U-V\rVert_{H}\leq\lVert U-V\rVert_{1}$ and from the mismatch information the exact value of $\lVert U-V\rVert_{1}$ can be found. We define $B_{r}^{j}=\mathsf{mSub}_{r}(B)[b-j+1,b]$ as before, but change the definition of $P_{r}^{j}$ slightly. Intuitively, we define $P_{r}^{j}$ to be the $j$ -length prefix of $P$ subsampled in a synchronized way with $B_{r}^{j}$ . Formally, $P_{r}^{j}[i]=\left\lfloor\frac{P[i]+(h(b-j+i)\bmod 2^{r})}{2^{r}}\right\rfloor$ .

Definition 2.10.

Consider a $b$ -length block $B$ of the text $T$ . For each $0\leq r\leq\lceil 3\log n\sigma\rceil$ , let $j^{\ast}(r)$ be the maximal integer such that the Manhattan distance between $B_{r}^{j^{\ast}(r)}$ and $P_{r}^{j^{\ast}(r)}$ is at most $k=\Theta(\varepsilon^{-2})$ . We define the Manhattan prefix encoding of $B$ to be a tuple of pairs $j^{\ast}(r),\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ .

Note that the prefix encoding of $B$ uses $\mathcal{O}(k\log n\sigma)=\mathcal{O}(\varepsilon^{-2}\log n)$ space.

Lemma 2.11.

Assume constant-time random access to $P[1,b]$ . Given a $b$ -length block $B$ of the text $T$ , its Manhattan prefix encoding can be computed in $\widetilde{\mathcal{O}}(b^{2})$ time and $\widetilde{\mathcal{O}}(b)$ space.

Proof.

Let $q=\lceil 3\log n\sigma\rceil$ . For each $r=0,\ldots,q$ and $j=1,\ldots,b$ we compare $B_{r}^{j}$ and $P_{r}^{j}$ character by character in $\mathcal{O}(b)$ time to find $j^{\ast}(r)$ and the corresponding mismatch information. The claim follows. ∎

Lemma 2.12.

Given the prefix encoding of a $b$ -length block $B$ of the text $T$ , there is an algorithm that computes, for all $j=1,\ldots,b$ , a $(1\pm\varepsilon)$ -approximation of the Manhattan distance between the $j$ -length suffix of $B$ and the $j$ -length prefix of $P$ in $\widetilde{\mathcal{O}}(b^{2})$ time.

Proof.

Denote $X_{r}=\lVert P_{r}^{j}-B_{r}^{j}\rVert_{H}$ . We compute the smallest $f$ such that $X_{f}\leq k$ in the following way. For each $r$ , we use $\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ to restore $B_{r}^{j^{\ast}(r)}$ . If $j>j^{\ast}(r)$ , the Manhattan distance between $P_{r}^{j}$ and $B_{r}^{j}$ is at least $k$ . Otherwise, we compare $P_{r}^{j}$ and $B_{r}^{j}$ character by character to compute the Manhattan distance in $\mathcal{O}(b)$ time. The claim follows. ∎

2.3 Generic ( $L_{p}$ ) distance for $0<p<1$

Our goal is to construct a morphism (parametrised by $p$ ) acting as a randomized embedding of $(L_{p})^{p}$ into the Hamming distance. The intuition behind our approach is as follows. Let $r_{0},r_{1},\ldots\in[0,1]$ be a sequence of real numbers picked independently and u.a.r. Define a sequence of values

[TABLE]

and for a character $c\in\Sigma$ consider sequence of characters $c_{0},c_{1},\ldots$ where $c_{i}=\lfloor\frac{c}{(1+\varepsilon)^{i}}+r_{i}\rfloor$ (similarly, a character $c^{\prime}$ defines a sequence $c^{\prime}_{0},c^{\prime}_{1},\ldots$ ). Now consider two characters $c,c^{\prime}\in\Sigma$ such that $|c-c^{\prime}|=(1+\varepsilon)^{\ell}$ for some integer $\ell$ and a random variable $x=\sum_{i=0}^{\infty}d_{i}\cdot\lVert c_{i}-c^{\prime}_{i}\rVert_{H}$ . There is

[TABLE]

We thus see that an idealized morphism of the form $\varphi:c\to c_{0}^{d_{0}}c_{1}^{d_{1}}\ldots$ would have the property that $\lVert U-V\rVert_{p}^{p}\sim\lVert\varphi(U)-\varphi(V)\rVert_{H}$ on words of length $n$ . But there are the following issues: (i) characters are mapped into infinite length words, (ii) number of repetitions of characters ( $d_{i}$ ) is fractional, (iii) we cannot guarantee that character distance is always of form $(1+\varepsilon)^{i}$ and (iv) the distance is preserved only in expectation. We show how to overcome these issues to achieve the following result:

Theorem 2.13.

Given $0<p<1$ and $\varepsilon>0$ there is a word morphism $\varphi:c\in\Sigma\to c_{0}^{d_{0}}c_{2}^{d_{2}}\ldots c_{t-1}^{d_{t-1}}$ such that:

$t=\widetilde{\mathcal{O}}(\varepsilon^{-2})$ * when $0<p<1/2$ , $t=\widetilde{\mathcal{O}}(\varepsilon^{-3})$ when $p=1/2$ and $t=\widetilde{\mathcal{O}}(\sigma^{\frac{2p-1}{1-p}}/\varepsilon^{2+3\cdot\frac{2p-1}{1-p}})$ when $1/2<p<1$ .* 2. 2.

values of $t$ and $d_{0},\ldots,d_{t-1}$ do not depend on $c$ , 3. 3.

there exists a constant $\alpha=\alpha(p,\varepsilon)$ such that for any two words $U,V$ of length at most $n$ , we have $\lVert U-V\rVert_{p}^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\alpha\cdot\lVert\varphi(U)-\varphi(V)\rVert_{H}$ with probability at least $9/10$ , 4. 4.

it is enough for the randomness to be realized by a hash function $r:[t]\to[D]$ from a $4$ -independent hash function family for some $D=\textrm{poly}(n\sigma\varepsilon^{-1})$ , which can be generated from a $\widetilde{\mathcal{O}}(\log\sigma)$ bits size seed.

Proof.

We will consider three cases: $0<p<1/2$ , $p=1/2$ , and $1/2<p<1$ .

Case $0<p<1/2$ . Our plan is to build upon the scheme highlighted earlier in this section. Specifically, we preserve the values of $c_{i}$ .

Consider a pair of characters $c,c^{\prime}$ . First, note that $\mathbb{E}\left[x\right]$ is an increasing function of $|c-c^{\prime}|$ . From this and Equation 2 we obtain that $\mathbb{E}\left[x\right]\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}|c-c^{\prime}|^{p}\left(\frac{(1+\varepsilon)^{p}}{(1+\varepsilon)^{p}-1}+\frac{1}{(1+\varepsilon)^{1-p}-1}\right)\varepsilon^{-1}$ for all values of $|c-c^{\prime}|$ .

Second, fix $q=\lceil\frac{1}{1-p}\log_{1+\varepsilon}(\sigma\varepsilon^{-3})\rceil$ and observe that truncating the sum after the $(q-1)$ -th term introduces an additional factor $1\pm\Theta(\varepsilon)$ to the approximation, since for $c\not=c^{\prime}$ we have

[TABLE]

We also round $d_{i}$ down to the nearest integer, which introduces an additional $1\pm\Theta(\varepsilon)$ relative error, since $\forall_{i}d_{i}\geq\varepsilon^{-1}$ . Finally, we set $\varphi(c)=c_{0}^{d_{0}}\ldots c_{q-1}^{d_{q-1}}$ . We then have $\mathbb{E}\left[\lVert\varphi(c)-\varphi(c^{\prime})\rVert_{H}\right]=\Theta(\varepsilon^{-2}|c-c^{\prime}|^{p}\frac{1}{p(1-p)}).$

To guarantee that the equality holds with probability at least $9/10$ and not just in expectation, we repeat the scheme several times, with independent random seeds. That is, consider morphisms $\varphi_{1}(c),\varphi_{2}(c),\ldots,\varphi_{s}(c)$ and define a morphism $\varphi(c)=\varphi_{1}(c)\varphi_{2}(c)\ldots\varphi_{s}(c)$ with property:

[TABLE]

Assume w.l.o.g. that $(1+\varepsilon)^{\ell-1}<|c-c^{\prime}|\leq(1+\varepsilon)^{\ell}$ . We proceed to bound

[TABLE]

We set $s=\Theta(\frac{|c-c^{\prime}|^{2p}\varepsilon^{-3}(p(p-1))^{2}}{\varepsilon^{2}(|c-c^{\prime}|^{p}\varepsilon^{-2})^{2}(1-2p)})=\mathcal{O}(\varepsilon^{-1}\frac{1}{1-2p})$ for the claim to hold via Chebyshev’s inequality. The error probability coming from Chebyshev’s inequality can be made arbitrarily small constant by fixing the constant factor in $s$ to be large enough. We finally set $t=sq$ .

Case $p=1/2$ . Note that for $p,p^{\prime}$ such that $|p-p^{\prime}|\leq\log_{\sigma}(1+\varepsilon)$ we have $|x|^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}|x|^{p^{\prime}}$ for all $-\sigma\leq x\leq\sigma$ . We can therefore reduce this case to $p=1/2-\log_{\sigma}(1+\varepsilon)$ . However, we have to take into account that the asymptotic growth of $t$ hides $1/(1-2p)$ dependency on $p$ for $0<p<1/2$ , hence $t=\widetilde{\mathcal{O}}(\varepsilon^{-3})$ for $p=1/2$ .

Case $1/2<p<1$ . The proof follows the steps of the case $0<p<1/2$ . We first bound the variance:

[TABLE]

We set $s=\Theta\left(\frac{\varepsilon^{-3}|c-c^{\prime}|\sigma^{\frac{2p-1}{1-p}}/\varepsilon^{3\cdot\frac{2p-1}{1-p}}}{\varepsilon^{-2}|c-c^{\prime}|^{2p}}\right)=\mathcal{O}(\sigma^{\frac{2p-1}{1-p}}/\varepsilon^{1+3\cdot\frac{2p-1}{1-p}})$ , so that by Chebyshev’s inequality, the probability of obtaining $\lVert U-V\rVert_{p}^{p}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\alpha\cdot\lVert\varphi(U)-\varphi(V)\rVert_{H}$ is an arbitrarily small constant (by setting $s$ to be large enough).

Randomness. The only source of randomness in the description are the values $r_{i}\in[0,1]$ picked u.a.r. and independently. We note that the values $r_{i}$ can be picked instead as a finite precision floating-point numbers. Since all the values we are working with are bounded by $\textrm{poly}(n\sigma\varepsilon^{-1})$ , it is enough to set precision accordingly. We also observe that our concentration argument involves only Chebyshev’s inequality and thus only the variance and the expected value, so it suffices to require that $r_{i}$ are $4$ -wise independent. ∎

We now describe how to use the morphism $\varphi$ to approximate the $L_{p}$ distances in a small space. To design an efficient algorithm, we take advantage of the fact that $\varphi(U)$ has a compressed representation of size comparable with the length of $U$ (at least when $p\leq 1/2$ ).

Definition 2.14 ( $L_{p}$ scaling).

Consider a word $S=s_{1}^{e_{1}}s_{2}^{e_{2}}\ldots s_{m}^{e_{m}}$ of length $m^{\prime}=\sum_{i}e_{i}$ . Let $h:[m]\to 2^{q}$ be a function drawn at random from a $4$ -wise independent family, where $q=\lceil 3\log m^{\prime}\rceil$ . For $r=0,\ldots,q$ , we define the $r$ -th level subsample of $S$ ,

[TABLE]

In particular, $\mathsf{Sub}_{0}(U)=U$ .

Consider two words $S,Q$ of form $S=s_{1}^{e_{1}}\ldots s_{m}^{e_{m}}$ and $Q=q_{1}^{e_{1}}\ldots q_{m}^{e_{n}}$ . Fix an integer $k=\Theta(1/\varepsilon^{2})$ large enough and consider $\mathsf{Sub}_{r}(S),\mathsf{Sub}_{r}(Q)$ for all $r=0,1,\ldots,\lceil 3\log m^{\prime}\rceil$ , where $m^{\prime}=\sum_{i}e_{i}$ .

Algorithm 2.15.

Denote $X_{r}=\lVert\mathsf{Sub}_{r}(S)-\mathsf{Sub}_{r}(Q)\rVert_{H}$ and let $f=\min\{i:X_{i}\leq k\}$ . 2. 2.

Output $Z_{f}=2^{f}\cdot X_{f}$ as an estimate of $\lVert S-Q\rVert_{H}$ .

Lemma 2.16.

For $Z_{f}$ as in Algorihtm 2.15 there is $Z_{f}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert S-Q\rVert_{H}$ with probability $\geq 3/4$ .

Proof.

Consider a fixed subsampling level $r$ . For simplicity, let $\mathsf{Sub}_{r}(S)=s_{1}^{e_{1}^{\prime}}s_{2}^{e_{2}^{\prime}}\ldots s_{m}^{e_{m}^{\prime}}$ and $\mathsf{Sub}_{r}(Q)=q_{1}^{e_{1}^{\prime}}q_{2}^{e_{2}^{\prime}}\ldots q_{m}^{e_{m}^{\prime}}$ . Define a random variable $x_{i}$ to be the contribution of of $s_{i}^{e^{\prime}_{i}},q_{i}^{e^{\prime}_{i}}$ to the Hamming distance $X_{r}$ , i.e.

[TABLE]

Since $e^{\prime}_{i}\in\{\lceil e_{i}/2^{r}\rceil,\lfloor e_{i}/2^{r}\rfloor\}$ and $\mathbb{E}\left[e^{\prime}_{i}\right]=e_{i}/2^{r}$ , we have $\mathbb{E}\left[x_{i}\right]=e_{i}\cdot\lVert s_{i}-q_{i}\rVert_{H}$ and

[TABLE]

Summing over all values of $i$ , we reach $\mathbb{E}\left[X_{r}\right]=\lVert S-Q\rVert_{H}$ and $\mathrm{Var}\left[X_{r}\right]\leq\mathbb{E}\left[X_{r}\right]$ . These bounds are identical to that of Lemma 2.3 and we can proceed in a similar fashion to obtain the claim. ∎

We are now ready to define $L_{p}$ prefix encodings. Consider a $b$ -length block $B$ of the text and define $B_{r}^{j}=\mathsf{Sub}_{r}(\varphi(B))[(b-j)t+1,bt]$ ( $t$ is defined as in Theorem 2.13). Also, define $P_{r}^{j}$ to be the $(tj)$ -length prefix of $\varphi(P)$ subsampled in a synchronized way with $B_{r}^{j}$ .

Definition 2.17.

Consider a $b$ -length block $B$ of the text $T$ . For each $r=0,\ldots,\lceil 3\log n^{\prime}\rceil$ , where $n^{\prime}=|\varphi(B)|$ , let $j^{*}(r)$ be the maximal integer such that the Hamming distance between $B_{r}^{j^{*}(r)}$ and $P_{r}^{j^{*}(r)}$ is at most $k=\Theta(\varepsilon^{-2})$ . We define the $L_{p}$ prefix encoding of $B$ to be a tuple of pairs $j^{*}(r),\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ .

The $L_{p}$ prefix encoding of $B$ uses $\mathcal{O}(k\log n^{\prime})=\mathcal{O}(\varepsilon^{-2}\log(n\sigma\varepsilon^{-1}))$ space.

Lemma 2.18.

Assume constant-time random access to $P[1,b]$ . Given a $b$ -length block $B$ of the text $T$ , its $L_{p}$ prefix encoding can be computed in $\mathcal{O}(b^{2}\cdot t\log{n\sigma\varepsilon^{-1}})$ time and $\mathcal{O}(b+\varepsilon^{-2}\log{n\sigma\varepsilon^{-1}})$ space.

Proof.

For each $r=0,\ldots,\lceil 3\log n^{\prime}\rceil$ and $j=1,\ldots,b$ , we compute the Hamming distance between $B_{r}^{j}$ and $P_{r}^{j}$ in $\mathcal{O}(bt)$ time using the compressed representation to find $j^{\ast}(r)$ and the corresponding mismatch information. The claim follows. ∎

Lemma 2.19.

Given the $L_{p}$ prefix encoding of a $b$ -length block $B$ of the text $T$ , there is an algorithm that computes, for all $j=1,\ldots,b$ , a $(1\pm\varepsilon)$ -approximation of the $L_{p}$ distance between the $j$ -length suffix of $B$ and the $j$ -length prefix of $P$ in $\widetilde{\mathcal{O}}(b^{2}\cdot t\log{n\sigma\varepsilon^{-1}})$ time and $\mathcal{O}(b+\varepsilon^{-2}\log{n\sigma\varepsilon^{-1}})$ space.

Proof.

Denote $X_{r}=\lVert P_{r}^{j}-B_{r}^{j}\rVert_{H}$ . We compute the smallest $f$ such that $X_{f}\leq k$ in the following way. For each $r$ , we use $\mathsf{MI}(B_{r}^{j^{\ast}(r)},P_{r}^{j^{\ast}(r)})$ to restore $B_{r}^{j^{\ast}(r)}$ . If $j>j^{\ast}(r)$ , the Hamming distance between $P_{r}^{j}$ and $B_{r}^{j}$ is at least $k$ . Otherwise, we compare $P_{r}^{j}$ and $B_{r}^{j}$ to compute the Hamming distance in $\mathcal{O}(bt)$ time. The claim follows. ∎

2.4 Generic ( $L_{p}$ ) distance for $1<p\leq 2$ .

For $1<p\leq 2$ , we use a scheme similar to the one developped in [17] for the Hamming distance, but adapt it to generic $L_{p}$ distances. Particularly, we plug in a standard tool used in this situation, the $p$ -stable distribution. We additionally have to adapt the scheme a bit, taking into account that $L_{p}$ norm is sub-additive under concatenation when $p>1$ .

Definition 2.20 ( $p$ -stable distribution [41]).

For a parameter $p>0$ , we say that a distribution $\mathcal{D}$ is $p$ -stable if for all $a,b\in\mathbb{R}$ and random variables $X,Y$ drawn independently from $\mathcal{D}$ , the variable $aX+bY$ is distributed as $\left(|a|^{p}+|b|^{p}\right)^{1/p}Z$ , where $Z$ is a random variable with distribution $\mathcal{D}$ .

Consider a word $X=x_{1}x_{2}\ldots x_{n}$ , and let $\alpha_{1},\alpha_{2},\ldots,\alpha_{n}$ be independent random variables drawn from a $p$ -stable distribution $\mathcal{D}$ with expected value $\mu_{\mathcal{D}}$ . By Definition 2.20, we have $\mathbb{E}\left[\sum_{i}\alpha_{i}x_{i}\right]/\mu_{\mathcal{D}}=\lVert X\rVert_{p}$ . The $p$ -stable distributions exist for all $0<p\leq 2$ , and a random variable $X$ from a $p$ -stable distribution can be generated using the formula $X=\frac{\sin(p\Theta)}{\cos^{1/p}(\Theta)}\left(\frac{\cos(\Theta(1-p))}{\ln(1/r)}\right)^{(1-p)/p}$ [11, 41], where $\Theta$ is uniform on $[-\pi/2,\pi/2]$ and $r$ is uniform on $[0,1]$ .

However, to be able to design an efficient sketching scheme that allows to approximate the $L_{p}$ norm with high probability, there are three technicalities to be overcome: First, one must show that $\sum_{i}\alpha_{i}x_{i}$ concentrates well, second, the formula above assumes infinite precision of computation, and finally, one cannot use fully independent random variables $\alpha_{i}$ as above as this would require much space. To overcome these issues, Indyk [27] combined $p$ -stable distributions and pseudorandom generators for bounded space computation [37]. We restate the final result of Indyk below, in the form that will be convenient for us later.

Theorem 2.21 (cf. Theorem 2, Theorem 4 [27]).

For any $0<p\leq 2$ , there is a non-uniform streaming algorithm that maintains a sketch $\mathsf{Sketch_{p}}(S)$ of a word $S$ of length $n$ over an alphabet of size $\sigma$ such that:

when a new character of $S$ arrives, the sketch can be updated in $\mathcal{O}(\varepsilon^{-2}\log(n/\varepsilon))$ time; 2. 2.

the algorithm and the sketch use $\mathcal{O}(\varepsilon^{-2}\log(\sigma n/\varepsilon)\log(n/\varepsilon))$ bits of space.

Given the sketches $\mathsf{Sketch_{p}}(X),\mathsf{Sketch_{p}}(Y)$ of two words $X,Y$ of length $n$ , one can estimate $\lVert X-Y\rVert_{p}$ up to a factor $1\pm\varepsilon$ with probability at least $9/10$ in time $\widetilde{\mathcal{O}}(1/\varepsilon^{2})$ .

We now proceed to building the $L_{p}$ prefix encoding by using $\mathsf{Sketch_{p}}$ and the landmarking technique.

Definition 2.22 ( $L_{p}$ prefix encoding).

Let $1<p\leq 2$ . Consider a word $S$ of length $b$ on the alphabet of size $\sigma$ . Define $q_{0}=b$ . For $k=0,\ldots,\lceil\log b\sigma^{p}\rceil$ , let $q_{k}\leq q_{k-1}$ be the leftmost position such that the $p$ ’th moment of the difference between $S[q_{k},b]$ and $P[1,b-q_{k}+1]$ , i.e. $\lVert S[q_{k},b]-P[1,b-q_{k}+1]\rVert_{p}^{p}$ , is at most $2^{k}$ .

Further, divide $S[q_{k},b]$ into $\Theta(1/\varepsilon^{p})$ blocks such that each block is either a single character, or the $p$ ’th moment of the difference between each block and the corresponding subword of $P[1,b-p_{k}+1]$ is at most $\varepsilon^{p}\cdot 2^{k}$ . Let $q_{k}=q_{k}^{0}\leq q_{k}^{1}\leq\ldots q_{k}^{\ell_{k}}=b$ be the block borders. We choose $q_{k}^{1},q_{k}^{2},\ldots,q_{k}^{\ell_{k}}$ from left to right, and each position $q_{k}^{i}$ is chosen to be the rightmost possible.

The $L_{p}$ prefix encoding of $S$ is defined to contain sorted lists of the positions $q_{k}$ and $q_{k}^{i}$ , characters $S[q_{k}^{i}]$ , and sketches for $(1\pm C_{p}\cdot\varepsilon/p)$ -approximating the $p$ ’th norm of $S[q_{k}^{j},b]$ , for all $k,j$ and $C_{p}$ as in Observation 1.5, see also Theorem 2.21.

The encoding takes $\widetilde{\mathcal{O}}(\varepsilon^{-2-p}\log\sigma\log(\sigma n/\varepsilon)\log(n/\varepsilon))$ bits of space. We now show that given the $L_{p}$ prefix encoding of a block $B$ of the text of length $b$ , one can compute a $(1\pm\varepsilon)$ -approximation of the $L_{p}$ distance between any prefix $P[1,b-j+1]$ of the pattern $P$ and the corresponding suffix $B[j,b]$ of $B$ .

Lemma 2.23.

Let $1<p\leq 2$ . For any two vectors $X,Y$ of equal length, $\Big{|}\lVert X+Y\rVert_{p}^{p}-\lVert X\rVert_{p}^{p}\Big{|}=\mathcal{O}(\lVert Y\rVert_{p}^{p}+\lVert Y\rVert_{p}\cdot\lVert X\rVert_{p}^{p-1})$ .

Proof.

Consider $x,y\in\mathbb{R}$ . If $|x|\geq|y|$ , then by Taylor expansion, $|x+y|^{p}=|x|^{p}(1+y/|x|)^{p}=|x|^{p}(1+\mathcal{O}(|y/x|))=|x|^{p}\pm\mathcal{O}(|y||x|^{p-1})$ . If $|x|<|y|$ , then $|x+y|^{p}=\mathcal{O}(|y|^{p})$ . Thus for any real values, we have

[TABLE]

Denote $X=[x_{1},x_{2},\ldots,x_{n}]^{T}$ and $Y=[y_{1},y_{2},\ldots,y_{n}]^{T}$ . There is

[TABLE]

Pick $q=p/(p-1)$ so that $1/p+1/q=1$ . By Hölder’s inequality:

[TABLE]

∎

Lemma 2.24.

Let $1<p\leq 2$ . Given the $L_{p}$ prefix encoding of a block $B$ of the text $T$ of length $b$ , one can find $(1\pm\varepsilon)$ -approximation of the $p$ ’th moment of the difference between any prefix $P[1,b-j+1]$ of the pattern $P$ and the corresponding suffix $B[j,b]$ of $B$ in time $\widetilde{\mathcal{O}}(\varepsilon^{-2}+\log\sigma)$ .

Proof.

Let $q_{k}$ be the position that is closest to $i$ from the left, and $q_{k}^{i}\leq j<q_{k}^{i+1}$ (see Fig. 2). We can find $q_{k}$ , $q_{k}^{i},q_{k}^{i+1}$ in time $\mathcal{O}(\log(b\sigma^{p})+1/\varepsilon^{p})$ by iterating over the sorted lists.

The position $q_{k}^{i+1}$ divides $P[1,b-j+1]$ into two parts, $P_{1}$ and $P_{2}$ . Denote $B_{1}$ and $B_{2}$ the respective subwords of $B$ they are aligned with (see Fig. 2). Let $m_{1}=F_{p}(P_{1}-B_{1})$ and $m_{2}=F_{p}(P_{2}-B_{2})$ . Then $m=F_{p}(P[1,b-j+1]-B[j,b])$ , being the value we need to approximate, is equal to $m_{1}+m_{2}$ .

We can find $m^{\prime}_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}m_{2}$ using the sketches for $B_{2}=B[q_{k}^{i+1},b]$ and $P_{2}$ in time $\widetilde{\mathcal{O}}(1/\varepsilon^{2})$ . Furthermore, if $q_{k}^{i}=q_{k}^{i+1}-1$ , then we can compute $m_{1}$ exactly as we store $B[q_{k}^{i}]$ . Otherwise, we consider the subword $\tilde{P}=P[j-q_{k}+1,q_{k}^{i+1}-q_{k}+1]$ of the pattern $P$ . Denote $m^{\prime}_{1}=F_{p}(P_{1}-\tilde{P})$ and use it as our estimation of $m_{1}$ .

Since $1\leq p\leq 2$ , by definition, $F_{p}(B_{1}-\tilde{P})\leq\varepsilon^{p}\cdot 2^{k-1}$ , and $F_{p}(P_{1}-B_{1})\leq 2^{k}$ . By Lemma 2.23 with $X=P_{1}-B_{1}$ and $Y=B_{1}-\tilde{P}$ ,

[TABLE]

and finally $|(m_{1}+m_{2})-(m^{\prime}_{1}+m^{\prime}_{2})|\leq\mathcal{O}(\varepsilon m)+\varepsilon m_{2}=\mathcal{O}(\varepsilon m)$ . ∎

Lemma 2.25.

Let $1<p\leq 2$ . The $L_{p}$ prefix encoding of a $b$ -length block $B$ of the text can be computed in time $\tilde{\mathcal{O}}(b^{2}+\varepsilon^{-2}b\log\sigma)$ and space $\tilde{\mathcal{O}}(b+\varepsilon^{-2-p}\log^{2}\sigma)$ .

Proof.

For $j=1,\ldots,b$ , we naively compute the $L_{p}$ distance between the suffix of $B$ and the prefix of $P$ in $\mathcal{O}(b)$ time. We then find the positions $q_{k}$ . For each $k=0,\ldots,\lceil\log b\sigma^{p}\rceil$ , we can find the positions $q_{k}^{i}$ in $\mathcal{O}(b)$ time and compute the sketches in $\tilde{\mathcal{O}}(\varepsilon^{-2}b)$ time by Theorem 2.21. ∎

3 Suffix sketches

In this section, we give the definitions and explain how we maintain the suffix sketches for each of the distances.

3.1 Hamming distance

We first recall Euclidean suffix sketches as presented in [17]. In fact, we will not use them for the Euclidean distance as for it we can use the generic solution of Section 3.3, but they will serve as a foundation of Hamming suffix sketches.

All sketches presented in this section are correct with constant probability, which can be amplified to $1-\delta$ for arbitrarily small $\delta$ by a standard method of repeating sketching independently $\Theta(\log\delta^{-1})$ times and taking the median of the estimates.

Lemma 3.1 (Euclidean sketches [2]).

Let $M$ be a random matrix of size $d\times n$ filled with 4-wise independent random $\pm 1$ variables, for $d=\Theta(\varepsilon^{-2})$ chosen big enough. For a vector $X\in\mathbb{R}^{n}$ there is $\frac{1}{\sqrt{d}}\lVert MX\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert X\rVert_{2}$ with constant probability $9/10$ , taken over all possible choices of $M$ . We say that a vector $MX$ of dimension $d$ is a Euclidean sketch of $X$ .

Definition 3.2 (Euclidean suffix sketches [17]).

Consider a word $X$ of length $n$ . We define its Euclidean suffix sketch as follows.

Let $b$ be the block length. Let $\mathcal{R}$ be a random matrix of size $d\times b$ filled with 4-wise independent random $\pm 1$ variables and let $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil}$ be 4-wise independent random coefficients with values $\pm 1$ as well. We define a matrix $M$ of size $d\times n$ such that $M_{i,jb+k}=\alpha_{j}\cdot\mathcal{R}_{i,k}$ .

Let $X^{\prime}$ be a word of length $\lceil n/b\rceil\cdot b$ obtained from $X$ by appending an appropriate number of zeroes. The Euclidean suffix sketch of $X$ is defined as $\mathsf{eSketch}(X)=MX^{\prime}$ , where $X^{\prime}$ is considered as a vector.

Observe that the matrix $M$ does not need to be accessed explicitly. Indeed, from $MX^{\prime}=\sum_{i}\alpha_{i}\cdot\mathcal{R}\cdot\begin{bmatrix}X^{\prime}[bi],\ \ldots,\ X^{\prime}[bi+b-1]\end{bmatrix}^{T}$ it follows that the Euclidean suffix sketch can be computed by first sketching each block of $X^{\prime}$ using the matrix $\mathcal{R}$ , and then taking a linear combination of the sketches of the blocks (using the random $\pm 1$ coefficients $\alpha_{i}$ ).

Lemma 3.3 ([17]).

Selecting $d=\Theta(\varepsilon^{-2})$ gives $\frac{1}{\sqrt{d}}\lVert\mathsf{eSketch}(X)\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert X\rVert_{2}$ with probability at least $9/10$ (taken over all possible choices of $\mathcal{R},\alpha_{i}$ ).

By linearity of sketches, we obtain $\lVert X-Y\rVert_{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\frac{1}{\sqrt{d}}\lVert\mathsf{eSketch}(X)-\mathsf{eSketch}(Y)\rVert_{2}$ with probability at least $9/10$ as well.

We now define Hamming suffix sketches. First note that for binary words $X,Y$ there is $\mathsf{Ham}(X,Y)=\lVert X-Y\rVert_{2}$ , and therefore in the case of the binary alphabet we can use the Euclidean suffix sketches. We will now show how to reduce the case of arbitrary polynomial-size alphabets to the case of the binary alphabet.

To this end, [17] used a random mapping of Karloff [30] as a black-box reduction, which led to sketches of size $\sim\varepsilon^{-4}$ . We now show a more careful reduction to avoid this overhead and to achieve dependency $\varepsilon^{-2}$ in total. Consider a word morphism defined on alphabet as $\mu:\Sigma\to\{0,1\}^{\sigma}$ , $\mu(a)=0^{a}10^{\sigma-a-1}$ (and acting on words by concatenating the images of each character of the input word). Note that $\lVert\mu(X)-\mu(Y)\rVert_{2}^{2}=2\cdot\lVert X-Y\rVert_{H}$ , thus using the Euclidean suffix sketches on top of $\mu(X)$ and $\mu(Y)$ allows computation of the respective Hamming distance. Formally,

Definition 3.4 (Hamming suffix sketches [17]).

Consider a word $X$ of length $n$ on the alphabet of size $\sigma$ . We define its Hamming suffix sketch as follows.

Let $b$ be the block length, $\mathcal{R}$ be a random matrix of size $d\times\sigma b$ filled with 4-wise independent random $\pm 1$ variables, and $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil}$ be 4-wise independent random coefficients with values $\pm 1$ as well. We define a matrix $M$ of size $d\times\sigma n$ such that $M_{i,\sigma jb+k}=\alpha_{j}\cdot\mathcal{R}_{i,k}$ .

Let $X^{\prime}$ be a word of length $\lceil n/b\rceil\cdot b$ obtained from $X$ by appending an appropriate number of zeroes. The Hamming suffix sketch of $X$ is defined as $\mathsf{hSketch}(X)=M\mu(X^{\prime})$ , where $\mu(X^{\prime})$ is considered as a vector.

Lemma 3.5.

Selecting $d=\Theta(\varepsilon^{-2})$ gives $\frac{1}{2d}\lVert\mathsf{hSketch}(X)\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert X\rVert_{H}$ with probability at least $9/10$ (taken over all possible choices of $\mathcal{R},\alpha_{i}$ ).

Proof.

Follows immediately as a corollary of Lemma 3.3 and the properties of the embedding $\mu$ . In more detail, the following holds with probability at least $9/10$ :

[TABLE]

∎

As $\mu(X),\mu(Y)$ are sparse, there is an efficient streaming algorithm for maintaining the Hamming suffix sketches of a text:

Lemma 3.6.

Given a text $T$ , there is a streaming algorithm that for every position $i$ outputs the Hamming suffix sketch of a word $T[b\cdot k+1,i]$ , where $k$ is the largest integer such that $i-b\cdot k\leq n$ . The algorithm takes $\mathcal{O}(dn/b+\log d\sigma n)$ space and $\mathcal{O}(d(1+n/b^{2}))$ time per character.

Proof.

We fix the matrix $\mathcal{R}$ and the random coefficients $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil}$ from Definition 3.4. We do not store $\mathcal{R}$ and $\alpha_{i}$ explicitly, but generate them using two hash functions drawn at random from a $4$ -wise independent family. For example, to generate $\mathcal{R}$ we can consider a family of polynomials $2((ax^{3}+bx^{2}+cx+d\bmod p)\bmod 2)-1$ , with parameters $a,b,c,d$ chosen u.a.r. from the prime field $\mathbb{F}_{p}$ for $p\geq db$ , and $\alpha_{i}$ can be generated in a similar fashion. This way, we need to store only $\mathcal{O}(\log(d\sigma b)+\log(n/b))=\mathcal{O}(\log d\sigma n)$ random bits that define the coefficients of two polynomials to generate $\mathcal{R}$ and $\alpha_{i}$ .

We process the text $T$ by blocks $B_{1},B_{2},\ldots$ of length $b$ . For each block $B_{k}$ we compute its sketch using the matrix $\mathcal{R}$ . That is, at the beginning of each block we initialize its sketch as a zero vector of length $d$ . When a new character $T[i]$ of a block $B_{k}$ arrives, we compute and add $[M[1,i\cdot b\sigma+T[i]],M[2,i\cdot b\sigma+T[i]],\ldots,M[d,i\cdot b\sigma+T[i]]]^{T}$ to the sketch, which takes $\mathcal{O}(d)$ time. We store the sketch of $B_{k}$ until the block $B_{k+\lceil n/b\rceil}$ and use it to compute the suffix sketches for the positions in this block.

Consider now a block $B_{k+\lceil n/b\rceil}$ . We first compute the suffix sketch for the position $b\cdot(k+\lceil n/b\rceil)$ , which is the position preceding the block $B_{k+\lceil n/b\rceil}$ . The suffix sketch for it is simply a linear combination of the sketches of the blocks $B_{k+\lceil n/b\rceil-1},B_{k+\lceil n/b\rceil-2},\ldots,B_{k}$ with coefficients $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil-1}$ . Since each sketch is a vector of length $d$ , we can compute the linear combination in $\mathcal{O}(dn/b)$ time. To make this computation time-efficient, we start it $b$ positions before position $b\cdot(k+\lceil n/b\rceil)$ arrives, and de-amortise the computation over these $b$ positions. This way, we use only $\mathcal{O}(dn/b^{2})$ time per character.

Now, using the suffix sketch for the position $b\cdot(k+\lceil n/b\rceil)$ , we can compute the suffix sketches for all positions in the block $B_{k+\lceil n/b\rceil}$ one-by-one, using only $\mathcal{O}(d)$ time per character: When a new character $T[i]$ arrives, we add $[\alpha_{\lceil n/b\rceil}M[1,i\cdot b\sigma+T[i]],\alpha_{\lceil n/b\rceil}M[2,i\cdot b\sigma+T[i]],\ldots,\alpha_{\lceil n/b\rceil}M[d,i\cdot b\sigma+T[i]]]^{T}$ to the suffix sketch to update it.

Note that at any time we store $\mathcal{O}(n/b)$ sketches of the blocks, so the algorithm uses $\mathcal{O}(dn/b+\log d\sigma n)$ space in total. ∎

3.2 Manhattan ( $L_{1}$ ) distance

To show efficient suffix sketches for the Manhattan distance, we consider a word morphism $\nu:\Sigma\to\{0,1\}^{\sigma}$ , $\nu(a)=1^{a}0^{\sigma-a}$ . Note that $\lVert\nu(X)-\nu(Y)\rVert_{2}^{2}=\lVert\nu(X)-\nu(Y)\rVert_{H}=\lVert X-Y\rVert_{1}$ , thus using the Hamming suffix sketches on top of $\nu(X)$ and $\nu(Y)$ allows computation of the respective Manhattan distance.

However, if we apply the morphism straightforwardly, we will have to pay an extra $\sigma$ factor per character to compute the Manhattan suffix sketches. To improve the running time, we will use range-summable hash functions. Range-summable hash functions were introduced by Feigenbaum et al. [21], and later their construction was improved by Calderbank et al. [9].

Definition 3.7 (cf. [9]).

A family $\mathcal{H}$ of hash functions $h(x;\xi):[t]\times\{0,1\}^{s}\rightarrow\{-1,1\}$ (here $x$ is the argument and $\xi$ is the seed) is called $k$ -independent, range-summable if it satisfies the following properties for any $h\in\mathcal{H}$ :

( $k$ -independent)* for all distinct $0\leq x_{1},\ldots,x_{k}<t$ and all $b_{1},\ldots,b_{k}\in\{-1,+1\}$ ,*

[TABLE] 2. 2.

(range-summable)* there exists a function $g$ such that given a pair of integers $0\leq\alpha,\beta\leq\sigma$ , and a seed $\xi$ , the value $g(\alpha,\beta;\xi)=\sum_{\alpha\leq x<\beta}h(x;\xi)$ can be computed in time polynomial in $\log t$ .333In [9], the function $h$ was defined to take values in $\{0,1\}$ . We can change the range of values to $\{-1,+1\}$ by taking $h^{\prime}=1-2h$ while preserving the properties.*

Corollary 3.8 (cf. Theorem 3.1 [9]).

There is a $4$ -independent, range-summable family of hash functions $h(x;\xi):[t]\times\{0,1\}^{s}\rightarrow\{-1,+1\}$ with a random seed $\xi$ of length $s=\mathcal{O}(\log^{2}t)$ such that any range-sum $g(\alpha,\beta;\xi)$ can be computed in $\mathcal{O}(\log^{3}t)$ time.

Observation 3.9.

For a word $X=x_{1}x_{2}\ldots x_{n}$ , let $Y=\nu(X)=y_{1}y_{2}\ldots y_{n\sigma}$ . Let $h,g$ be as in Corollary 3.8 with $t=n\sigma$ . Then $\sum_{i=1}^{n}g(i\sigma,i\sigma+x_{i};\xi)=\sum_{i=1}^{n\sigma}y_{i}h(i;\xi)$ .

Thus, we see that range-summable hash functions can be used to efficiently simulate $\nu$ .

Definition 3.10 (Manhattan suffix sketches).

Consider $X$ be a word of length $n$ . We define its Manhattan suffix sketch as follows.

Let $b$ be the block length. Let $h,g$ be as in Corollary 3.8 with $t=bd\sigma$ . Let $\mathcal{R}$ be a random matrix of size $d\times\sigma b$ filled with 4-wise independent random $\pm 1$ variables, such that $\mathcal{R}_{i,k}=h(ib\sigma+k;\xi)$ and let $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil}$ be 4-wise independent random coefficients with values $\pm 1$ as well. We define a matrix $M$ of size $d\times\sigma n$ such that $M_{i,\sigma jb+k}=\alpha_{j}\cdot\mathcal{R}_{i,k}=\alpha_{j}\cdot h(dk+i;\xi)$ .

Let $X^{\prime}$ be a word of length $\lceil n/b\rceil\cdot b$ obtained from $X$ by appending an appropriate number of zeroes. The Manhattan suffix sketch of $X$ is defined as $\mathsf{mSketch}(X)=M\nu(X^{\prime})$ , where $\nu(X^{\prime})$ is considered as a vector.

Lemma 3.11.

Selecting $d=\Theta(1/\varepsilon^{2})$ gives $\frac{1}{d}\lVert\mathsf{mSketch}(X)\rVert_{2}^{2}\stackrel{{\scriptstyle\mathclap{{\mbox{$ \varepsilon $}}}}}{{=}}\lVert X\rVert_{1}$ with probability at least $9/10$ (taken over all possible choices of $\alpha_{i}$ and $\xi$ ).

Proof.

Follows immediately as a corollary of Lemma 3.3 and the properties of the embedding $\nu$ . In more detail, the following holds with probability at least $9/10$ :

[TABLE]

∎

Lemma 3.12.

Given a text $T$ , there is a streaming algorithm that for every position $i$ outputs the Manhattan suffix sketch of a word $T[b\cdot k+1,i]$ , where $k$ is the smallest integer such that $i-b\cdot k\leq n$ . The algorithm takes $\mathcal{O}(d\cdot(n/b)+\log^{2}\sigma)$ space, and $\mathcal{O}(d(1+n/b^{2})\cdot\log^{3}(bd\sigma))$ time per character.

Proof.

The proof mirrors the proof of Lemma 3.6, and we describe the key elements. We fix the random coefficients $\alpha_{1},\ldots,\alpha_{\lceil n/b\rceil}$ and the hash function $h$ from Definition 3.10. As previously, we do not store the coefficients $\alpha_{i}$ explicitly, but generate them using a hash function drawn at random from a $4$ -wise independent family. The matrix $\mathcal{R}$ is already defined by $h$ , with the following parameters: it requires $\mathcal{O}(\log^{2}(bd\sigma))$ bits of seed, and range-sum queries are answered in time $\mathcal{O}(\log^{3}(bd\sigma))$ .

In the sketching of blocks, we proceed in the same manner, except that when a new character $T[i]$ of a block $B_{k}$ arrives, we compute and add $\sum_{0\leq j<T[i]}[M[1,i\cdot b\sigma+j],\ldots,M[d,i\cdot b\sigma+j]]^{T}=\alpha_{i}\cdot[g(b\sigma,b\sigma+T[i];\xi),g(2b\sigma,2b\sigma+T[i];\xi),\ldots,g(d\cdot b\sigma,d\cdot b\sigma+T[i];\xi)]^{T}$ to the sketch, which takes $\mathcal{O}(d\cdot\log^{3}(bd\sigma))$ time ( $\log^{3}(bd\sigma)$ times slower as the corresponding step in Lemma 3.6).

Consider now a block $B_{k+\lceil n/b\rceil}$ . When a new character $T[i]$ arrives, we update the suffix sketch by adding $\alpha_{\lceil n/b\rceil}\cdot[g(b\sigma,b\sigma+T[i];\xi),g(2b\sigma,2b\sigma+T[i];\xi),\ldots,g(d\cdot b\sigma,d\cdot b\sigma+T[i];\xi)]^{T}$ to it.

All of the operations are $\mathcal{O}(\log^{3}(bd\sigma))$ time slower than the corresponding steps in Lemma 3.6, and the memory complexity is increased by the seed size $\mathcal{O}(\log^{2}(bd\sigma))$ term ( $\log^{2}b$ and $\log^{2}d$ terms get absorbed). ∎

3.3 Generic ( $L_{p}$ ) distance for $0<p\leq 2$ .

For generic $L_{p}$ distances, we use the approach of [27] based on $p$ -stable distributions.

Corollary 3.13.

Given a text $T$ , there is a streaming algorithm that for every position $i$ outputs the $L_{p}$ suffix sketch of a word $T[b\cdot k+1,i]$ , where $k$ is the smallest integer such that $i-b\cdot k\leq n$ . The algorithm takes $\mathcal{O}(\varepsilon^{-2}(n/b)\cdot\log(\sigma n/\varepsilon)\log(n/\varepsilon))$ bits of space and $\mathcal{O}(\varepsilon^{-2}(n/b)\log(n))$ time per character.

Proof.

We start a new instance of the sketching algorithm of Theorem 2.21 at every block border and continue running it for the next $\lceil n/b\rceil$ blocks. At each moment, there are $\mathcal{O}(n/b)$ active instances of the algorithm. The bounds follow. ∎

4 Proof of Theorem 1.3

Recall the structure of the algorithms. During the preprocessing, we compute the suffix sketches of suffixes $P[1,n],P[2,n],\ldots,P[b,n]$ of $P$ . During the main stage, the text is processed by blocks of length $b$ . To compute an approximation of the distance / the $p$ ’th moment at a particular alignment, we divide the pattern into two parts: a prefix of length at most $b$ , and the remaining suffix. We compute an approximation of the distance / the $p$ ’th moment for both of the parts and sum them up to obtain the final answer. To compute an approximation of the distance / the $p$ ’th moment between the prefix and the corresponding block of the text, we compute, while reading each block of the text, its prefix encoding, and to compute an approximation of the distance / the $p$ ’th moment between the suffix and the text, we use the suffix sketches.

Hamming ( $L_{0}$ ) distance. When we receive a new block of the text, we compute its Hamming prefix encoding using the algorithm of Lemma 2.5 in $\mathcal{O}(b)$ space. We de-amortize the computation over the subsequent block and spend $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ time per character. We store the resulting encoding for the next $\mathcal{O}(n/b)$ blocks. In total, the encodings require $\widetilde{\mathcal{O}}(\varepsilon^{-2}n/b)$ space. The Hamming suffix sketches of $P[1,n],P[2,n],\ldots,P[b,n]$ occupy $\mathcal{O}(\varepsilon^{-2}b)$ space. The algorithm of Lemma 3.6 that computes the suffix sketches takes $\mathcal{O}(\varepsilon^{-2}n/b+\log(\varepsilon^{-2}\sigma n))$ space and $\mathcal{O}(\varepsilon^{-2}(1+n/b^{2}))$ time per character. Consider a block starting with position $p$ . To compute the Hamming distances between $n$ -length subwords that end in this block and the pattern, we apply the following approach. First, while reading the block preceding the current one, we decode the Hamming prefix encoding of the block that starts at position $p-n$ using Lemma 2.6. We de-amortize the algorithm to spend $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ time per character. Hence, at the position $i$ , we know the $(1\pm\varepsilon)$ -approximation between the prefixes of the pattern and the corresponding subwords of the text. At each position, we can compute the Hamming distance between the corresponding suffix of the pattern and the text in $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ time using the Hamming suffix sketch. By taking $b=\sqrt{n}$ , we obtain the claim. 2. 2.

Manhattan ( $L_{1}$ ) distance. We proceed analogously to the Hamming distance case. The Manhattan prefix encoding of each block is computed using Lemma 2.11, in $\widetilde{\mathcal{O}}(b)$ time per character. We store the resulting encoding for the next $\mathcal{O}(n/b)$ blocks, giving in total $\widetilde{\mathcal{O}}(\varepsilon^{-2}n/b)$ space. The Manhattan suffix sketches of $P[1,n],P[2,n],\ldots,P[b,n]$ occupy $\mathcal{O}(\varepsilon^{-2}b)$ space. Algorithm of Lemma 3.12 takes $\widetilde{\mathcal{O}}(\varepsilon^{-2}(b+n/b)+\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(\varepsilon^{-2}(1+n/b^{2}))$ time per character. For decoding the prefix encoding we use Lemma 2.12, spending $\widetilde{\mathcal{O}}(b)$ time per character. Once again we take $b=\sqrt{n}$ , and assume w.l.o.g. $\varepsilon^{-1}\leq\sqrt{n}$ (as otherwise we can use a naive algorithm with $\mathcal{O}(n)$ space and $\mathcal{O}(n)$ time per character). 3. 3.

Generic ( $L_{p}$ ) distance for $0<p<1$ . The $L_{p}$ prefix encodings of the blocks are computed using Lemma 2.18, using $\widetilde{\mathcal{O}}(t\cdot b)$ time per character. We store the resulting encodings for the next $\mathcal{O}(n/b)$ blocks, giving in total $\widetilde{\mathcal{O}}(\varepsilon^{-2}n/b)$ space. The $L_{p}$ suffix sketches of $P[1,n],P[2,n],\ldots,P[b,n]$ occupy $\widetilde{\mathcal{O}}(\varepsilon^{-2}b\log\sigma)$ space. Algorithm of Corollary 3.13 computes the $L_{p}$ suffix sketches for the text in $\widetilde{\mathcal{O}}(\varepsilon^{-2}(n/b)\log\sigma)$ space and $\widetilde{\mathcal{O}}(\varepsilon^{-2}n/b)$ time per character. For decoding the prefix encoding we use Lemma 2.19, spending $\widetilde{\mathcal{O}}(t\cdot b)$ time per character. We take $b=\sqrt{n}$ , and substitute $t$ accordingly to Theorem 2.13. 4. 4.

Generic ( $L_{p}$ ) distance for $1<p<2$ . Note that for $\varepsilon<1/n$ we can use a naive algorithm, that is to store $S$ itself in $\mathcal{O}(n)$ space. The update takes constant time, and computing the $L_{p}$ norm takes $\mathcal{O}(n)$ time which is better than the guarantees of the theorem for such values of $\varepsilon$ . For $\varepsilon\geq 1/n$ , the algorithm of Lemma 2.25 computes the $L_{p}$ prefix encodings of the blocks in $\widetilde{\mathcal{O}}(b+\varepsilon^{-2-p}\log^{2}\sigma)$ space and $\widetilde{\mathcal{O}}(b+\varepsilon^{-2}\log\sigma)$ time per character. The encodings occupy $\widetilde{\mathcal{O}}(\varepsilon^{-2-p}(n/b)\log^{2}\sigma)$ space. The $L_{p}$ suffix sketches of $P[1,n],P[2,n],\ldots,P[b,n]$ occupy $\widetilde{\mathcal{O}}(\varepsilon^{-2}b\log\sigma)$ space. Algorithm of Corollary 3.13 computes the $L_{p}$ suffix sketches for the text in $\widetilde{\mathcal{O}}(\varepsilon^{-2}(n/b)\log\sigma)$ space and $\widetilde{\mathcal{O}}(\varepsilon^{-2}n/b)$ time per character. Taking $b=\varepsilon^{-p/2}\sqrt{n}$ and assuming w.l.o.g. $\varepsilon^{-1}<\sqrt{n}$ , we obtain the claim.

5 Conclusion

We pose several open questions. First is whether the time-complexity for $1/2<p<1$ can be improved to not involve any dependency on $\sigma$ . For this we need a better technique than bounding variance of the embedding into Hamming distance: in our technique, the tail gets ”too heavy”. Another pressing question is whether for all values of $p>0$ we could improve upon $\sqrt{n}$ time per character. We also remark that it seems unlikely that an embedding to Hamming space could be used to reduce space complexity for $p>1$ : $L_{p}^{p}$ does not admit the triangle inequality while the Hamming distance does, and the $L_{p}$ distance is not additive with respect to concatenation, while the Hamming distance is.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Karl R. Abrahamson. Generalized string matching. SIAM J. Comput. , 16(6):1039–1051, 1987.
2[2] Dimitris Achlioptas. Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. , 66(4):671–687, 2003. doi:10.1016/S 0022-0000(03)00025-4 . · doi ↗
3[3] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM 2002 , pages 1–10. doi:10.1007/3-540-45726-7\_1 . · doi ↗
4[4] Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch L 2 subscript 𝐿 2 L_{2} -heavy-hitters on sliding windows. Theor. Comput. Sci. , 554:82–94, 2014.
5[5] Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In FOCS 2007 , pages 283–293.
6[6] Vladimir Braverman and Rafail Ostrovsky. Effective computations on sliding windows. SIAM J. Comput. , 39(6):2113–2131, 2010.
7[7] Vladimir Braverman, Rafail Ostrovsky, and Alan Roytman. Zero-one laws for sliding windows and universal sketches. In APPROX-RANDOM 2015 , pages 573–590.
8[8] Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. J. Comput. Syst. Sci. , 78(1):260–272, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

LpL_{p}Lp​ Pattern Matching in a Stream

Abstract

1 Introduction

Definition 1.1** (Hamming, Manhattan and Euclidean distances).**

Definition 1.2** (ppp’th moment, ppp’th norm).**

Offline setting.

Streaming setting.

Sliding window.

1.1 Our results

Theorem 1.3**.**

Lemma 1.4**.**

Proof.

1.2 Techniques

Observation 1.5**.**

2 Prefix encodings

2.1 Hamming (L0L_{0}L0​) distance

Definition 2.1** (Hamming subsampling).**

Algorithm 2.2**.**

Lemma 2.3**.**

Proof.

Definition 2.4**.**

Lemma 2.5**.**

Proof.

Lemma 2.6**.**

Proof.

2.2 Manhattan (L1L_{1}L1​) distance

Definition 2.7** (Manhattan scaling).**

Algorithm 2.8**.**

Lemma 2.9**.**

Proof.

Definition 2.10**.**

Lemma 2.11**.**

Proof.

Lemma 2.12**.**

Proof.

2.3 Generic (LpL_{p}Lp​) distance for 0<p<10<p<10<p<1

Theorem 2.13**.**

Proof.

Definition 2.14** (LpL_{p}Lp​ scaling).**

Algorithm 2.15**.**

Lemma 2.16**.**

Proof.

Definition 2.17**.**

Lemma 2.18**.**

Proof.

Lemma 2.19**.**

Proof.

2.4 Generic (LpL_{p}Lp​) distance for 1<p≤21<p\leq 21<p≤2.

Definition 2.20** (ppp-stable distribution [41]).**

Theorem 2.21** (cf. Theorem 2, Theorem 4 [27]).**

Definition 2.22** (LpL_{p}Lp​ prefix encoding).**

Lemma 2.23**.**

Proof.

Lemma 2.24**.**

Proof.

Lemma 2.25**.**

Proof.

3 Suffix sketches

3.1 Hamming distance

Lemma 3.1** (Euclidean sketches [2]).**

Definition 3.2** (Euclidean suffix sketches [17]).**

Lemma 3.3** ([17]).**

Definition 3.4** (Hamming suffix sketches [17]).**

Lemma 3.5**.**

Proof.

Lemma 3.6**.**

Proof.

3.2 Manhattan (L1L_{1}L1​) distance

Definition 3.7** (cf. [9]).**

Corollary 3.8** (cf. Theorem 3.1 [9]).**

Observation 3.9**.**

Definition 3.10** (Manhattan suffix sketches).**

Lemma 3.11**.**

Proof.

$L_{p}$ Pattern Matching in a Stream

Definition 1.1 (Hamming, Manhattan and Euclidean distances).

Definition 1.2 ( $p$ ’th moment, $p$ ’th norm).

Theorem 1.3.

Lemma 1.4.

Observation 1.5.

2.1 Hamming ( $L_{0}$ ) distance

Definition 2.1 (Hamming subsampling).

Algorithm 2.2.

Lemma 2.3.

Definition 2.4.

Lemma 2.5.

Lemma 2.6.

2.2 Manhattan ( $L_{1}$ ) distance

Definition 2.7 (Manhattan scaling).

Algorithm 2.8.

Lemma 2.9.

Definition 2.10.

Lemma 2.11.

Lemma 2.12.

2.3 Generic ( $L_{p}$ ) distance for $0<p<1$

Theorem 2.13.

Definition 2.14 ( $L_{p}$ scaling).

Algorithm 2.15.

Lemma 2.16.

Definition 2.17.

Lemma 2.18.

Lemma 2.19.

2.4 Generic ( $L_{p}$ ) distance for $1<p\leq 2$ .

Definition 2.20 ( $p$ -stable distribution [41]).

Theorem 2.21 (cf. Theorem 2, Theorem 4 [27]).

Definition 2.22 ( $L_{p}$ prefix encoding).

Lemma 2.23.

Lemma 2.24.

Lemma 2.25.

Lemma 3.1 (Euclidean sketches [2]).

Definition 3.2 (Euclidean suffix sketches [17]).

Lemma 3.3 ([17]).

Definition 3.4 (Hamming suffix sketches [17]).

Lemma 3.5.

Lemma 3.6.

3.2 Manhattan ( $L_{1}$ ) distance

Definition 3.7 (cf. [9]).

Corollary 3.8 (cf. Theorem 3.1 [9]).

Observation 3.9.

Definition 3.10 (Manhattan suffix sketches).

Lemma 3.11.

Lemma 3.12.

3.3 Generic ( $L_{p}$ ) distance for $0<p\leq 2$ .

Corollary 3.13.