Constant factor approximations to edit distance on far input pairs in   nearly linear time

Michal Kouck\'y; Michael E. Saks

arXiv:1904.05459·cs.DS·May 10, 2019

Constant factor approximations to edit distance on far input pairs in nearly linear time

Michal Kouck\'y, Michael E. Saks

PDF

TL;DR

This paper presents a randomized algorithm that approximates the edit distance between two strings in nearly linear time, providing constant factor guarantees for inputs with sufficiently large edit distance.

Contribution

It introduces a nearly linear time algorithm that achieves constant factor approximation for edit distance on far input pairs, improving efficiency.

Findings

01

Runs in time $O(n^{1+1/T})$ for any $T \\geq 1$

02

Provides a constant factor approximation when edit distance is large

03

Achieves high probability guarantees for the approximation

Abstract

For any $T \geq 1$ , there are constants $R = R (T) \geq 1$ and $ζ = ζ (T) > 0$ and a randomized algorithm that takes as input an integer $n$ and two strings $x, y$ of length at most $n$ , and runs in time $O (n^{1 + \frac{1}{T}})$ and outputs an upper bound $U$ on the edit distance $E D (x, y)$ that with high probability, satisfies $U \leq R (E D (x, y) + n^{1 - ζ})$ . In particular, on any input with $E D (x, y) \geq n^{1 - ζ}$ the algorithm outputs a constant factor approximation with high probability. A similar result has been proven independently by Brakensiek and Rubinstein (2019).

Equations48

w_{1} = ⌊ n ⌋_{2} < w_{2} < \dots < w_{k} < w_{k + 1} = n .

w_{1} = ⌊ n ⌋_{2} < w_{2} < \dots < w_{k} < w_{k + 1} = n .

d_{0} = ⌊ n ⌋_{2} > d_{1} > \dots > d_{k} = 1.

d_{0} = ⌊ n ⌋_{2} > d_{1} > \dots > d_{k} = 1.

\frac{n}{w _{j}}

\frac{n}{w _{j}}

\frac{w _{j}}{w _{j + 1}}

\frac{w _{j}}{w _{j + 1}}

q_{0}

q_{0}

q_{j}

cost_{R (j - 1, I)} (I \times J) \leq 8 ε (i) w_{j} + I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'}) - q_{j - 1} - 1) w_{j - 1} .

cost_{R (j - 1, I)} (I \times J) \leq 8 ε (i) w_{j} + I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'}) - q_{j - 1} - 1) w_{j - 1} .

h \sum (ε (i_{h} - q_{j - 1}) + 4 δ_{h}) w_{j - 1}

h \sum (ε (i_{h} - q_{j - 1}) + 4 δ_{h}) w_{j - 1}

∣ M_{i^{'}} ∣ \leq \frac{1}{2} ∣ S_{i^{'}} ∖ M_{i^{'}} ∣

∣ M_{i^{'}} ∣ \leq \frac{1}{2} ∣ S_{i^{'}} ∖ M_{i^{'}} ∣

i^{'} \leq i \sum I^{'} \in M_{i^{'}} \sum ε (i^{'}) \leq \frac{1}{2} i^{'} \leq i \sum I^{'} \in S_{i^{'}} ∖ M_{i^{'}} \sum ε (i^{'}) .

i^{'} \leq i \sum I^{'} \in M_{i^{'}} \sum ε (i^{'}) \leq \frac{1}{2} i^{'} \leq i \sum I^{'} \in S_{i^{'}} ∖ M_{i^{'}} \sum ε (i^{'}) .

I^{'} \in I^{'} \sum i^{'} : I^{'} \in M_{i^{'}} \sum ε (i^{'}) \leq \frac{1}{2} I^{'} \in I^{'} \sum i^{'} : I^{'} \in S_{i^{'}} ∖ M_{i^{'}} \sum ε (i^{'}) .

I^{'} \in I^{'} \sum i^{'} : I^{'} \in M_{i^{'}} \sum ε (i^{'}) \leq \frac{1}{2} I^{'} \in I^{'} \sum i^{'} : I^{'} \in S_{i^{'}} ∖ M_{i^{'}} \sum ε (i^{'}) .

ε (i^{'}) < 2 ncost (τ_{I^{'}}) + ε (i + 3) .

ε (i^{'}) < 2 ncost (τ_{I^{'}}) + ε (i + 3) .

I^{'} \in G (I) \sum ε (t (I^{'}) + 1) \leq \frac{1}{2} \cdot I^{'} \in G (I) \sum ε (t (I^{'}) + 1) + I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2) .

I^{'} \in G (I) \sum ε (t (I^{'}) + 1) \leq \frac{1}{2} \cdot I^{'} \in G (I) \sum ε (t (I^{'}) + 1) + I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2) .

I^{'} \in G (I) \sum ε (t (I^{'}) + 1) \leq I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2)

I^{'} \in G (I) \sum ε (t (I^{'}) + 1) \leq I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2)

I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'}) + 1) \leq I^{'} \neq \in G (I) \sum ε (t (I^{'}) + 1) + I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2)

I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'}) + 1) \leq I^{'} \neq \in G (I) \sum ε (t (I^{'}) + 1) + I^{'} \neq \in G (I) \sum 4 ncost (τ_{I^{'}}) + ε (i + 2)

I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'})) w_{j - 1}

I^{'} \in Intervals (w_{j - 1}; I) \sum ε (t (I^{'})) w_{j - 1}

cost_{R (j - 1, I)} (I \times J)

cost_{R (j - 1, I)} (I \times J)

O (j = 1 \sum k \frac{n}{θ ^{2} w _{j} d _{j}} (h = 1 \sum j d_{h - 1} w_{h}^{1 + 1/ T^{'}})) .

O (j = 1 \sum k \frac{n}{θ ^{2} w _{j} d _{j}} (h = 1 \sum j d_{h - 1} w_{h}^{1 + 1/ T^{'}})) .

O (\frac{p}{θ} h = 1 \sum j d_{h - 1} w_{h}^{1 + 1/ T^{'}}) .

O (\frac{p}{θ} h = 1 \sum j d_{h - 1} w_{h}^{1 + 1/ T^{'}}) .

χ_{j} = O (\frac{n}{θ ^{2} w _{j} d _{j}} (i = 1 \sum j d_{i - 1} w_{i}^{\frac{T ^{'} + 1}{T ^{'}}})) .

χ_{j} = O (\frac{n}{θ ^{2} w _{j} d _{j}} (i = 1 \sum j d_{i - 1} w_{i}^{\frac{T ^{'} + 1}{T ^{'}}})) .

χ_{j} = O (i = 1 \sum j n^{α_{j} + ν_{i}}) .

χ_{j} = O (i = 1 \sum j n^{α_{j} + ν_{i}}) .

γ_{i}

γ_{i}

δ_{i}

B = B_{k} (T^{'}) = \frac{( T ^{'} + 1 ) ^{k}}{2 (( T ^{'} + 1 ) ^{k} - T ^{' k} )}

B = B_{k} (T^{'}) = \frac{( T ^{'} + 1 ) ^{k}}{2 (( T ^{'} + 1 ) ^{k} - T ^{' k} )}

j = 1 \sum k χ_{j}

j = 1 \sum k χ_{j}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Constant factor approximations to edit distance on far input pairs in nearly linear time

Michal Koucký Email: [email protected]. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement no. 616787. Partially supported by the Grant Agency of the Czech Republic under the grant agreement no. 19-27871X. Computer Science Institute of Charles University, Malostranské náměstí 25, 118 00 Praha 1, Czech Republic

Michael Saks Email: [email protected]. Supported in part by Simons Foundation under award 332622. Department of Mathematics, Rutgers University, Piscataway, NJ, USA

Abstract

For any $T\geq 1$ , there are constants $R=R(T)\geq 1$ and $\zeta=\zeta(T)>0$ and a randomized algorithm that takes as input an integer $n$ and two strings $x,y$ of length at most $n$ , and runs in time $O(n^{1+\frac{1}{T}})$ and outputs an upper bound $U$ on the edit distance of $d_{\textrm{edit}}(x,y)$ that with high probability, satisfies $U\leq R(d_{\textrm{edit}}(x,y)+n^{1-\zeta})$ . In particular, on any input with $d_{\textrm{edit}}(x,y)\geq n^{1-\zeta}$ the algorithm outputs a constant factor approximation with high probability. A similar result has been proven independently by Brakensiek and Rubinstein [14].

1 Introduction

The edit distance (or Levenshtein distance) [21] between strings $x,y$ , denoted by $d_{\textrm{edit}}(x,y)$ , is the minimum number of character insertions, deletions, and substitutions needed to convert $x$ into $y$ . It was recently shown independently that edit distance can be approximated within a constant factor in truly subquadratic time in the quantum computation model [12, 13]. and in the classical model [16, 17]. The running time for a classical algorithm obtained in [16, 17] is $\widetilde{O}(n^{12/7})$ , which was improved by Andoni [4] to $O(n^{3/2+\epsilon})$ .

This raises the natural question: what is the best possible running time of a constant factor approximation classical algorithm. We make progress on this problem by developing a nearly linear time algorithm that gives a constant factor approximation when restricted to inputs whose edit distance is not too small:

Theorem 1.1.

For every $T\geq 1$ there are constants $\zeta=\zeta(T)$ and $R=R(T)$ and a randomized algorithm $\textrm{\bf{FAST-ED-UB}}^{T}$ that takes as input an integer $n$ and two strings $x$ and $y$ , with $|x|,|y|\leq n$ , over an (arbitrary) alphabet $\Sigma$ , and runs in time $\widetilde{O}(n^{1+\frac{1}{T}})$ and outputs an upper bound $U$ on $d_{\textrm{edit}}(x,y)$ , such that with probability at least $1-1/n$ , $U\leq R\cdot(d_{\textrm{edit}}(x,y)+n^{1-\zeta})$ .

In particular, on any input $x,y$ with $d_{\textrm{edit}}(x,y)\geq n^{1-\zeta}$ the algorithm gives a constant factor approximation. The additive $n^{1-\zeta}$ term arises from some technical limitations in our algorithm and analysis, but since known algorithms for exact edit distance problem run faster on instances $x,y$ with small edit distance ([20, 24]) we expect that it should be possible to extend our result to give a nearly linear constant approximation algorithm for all ranges of edit distance.

Brakensiak and Rubinstein [14] independently obtained essentially the same theorem. While both our work and theirs builds on the techniques of [16, 17, 12, 13], the algorithms have quite different structure.

Other prior work (quoted from [16].) Edit distance can be evaluated exactly in quadratic time via dynamic programming (Wagner and Fischer [25]). Masek and Paterson[22] obtained the first (slightly) sub-quadratic $O(n^{2}/\log n)$ time algorithm, and the current asymptotically fastest algorithm (Grabowski [19]) runs in time $O(n^{2}\log\log n/\log^{2}n)$ . Backurs and Indyk [8] showed that a truly sub-quadratic algorithm ( $O(n^{2-\delta})$ for some $\delta>0$ ) would imply a $2^{(1-\gamma)n}$ time algorithm for CNF-satisfiabilty, contradicting the Strong Exponential Time Hypothesis (SETH). Abboud et al. [3] showed that even shaving an arbitrarily large polylog factor from $n^{2}$ would have the plausible, but apparently hard-to-prove, consequence that NEXP does not have non-uniform ${NC}^{1}$ circuits. For further “barrier” results, see [2, 15].

There is a long line of work on approximating edit distance. The exact $O(n+k^{2})$ time algorithm (where $k$ is the edit distance of the input) of Landau et al. [20] yields a linear time $\sqrt{n}$ -factor approximation. This approximation factor was improved, first to $n^{3/7}$ [9], then to $n^{1/3+o(1)}$ [11] and later to $2^{\widetilde{O}(\sqrt{\log n})}$ [7], all with slightly superlinear runtime. Batu et al. [10] provided an $O(n^{1-\alpha})$ -approximation algorithm with runtime $O(n^{\max\{\frac{\alpha}{2},2\alpha-1\}})$ . The strongest result of this type is the $(\log n)^{O(1/\epsilon)}$ factor approximation (for every $\epsilon>0$ ) with running time $n^{1+\epsilon}$ of Andoni et al. [5]. Abboud and Backurs [1] showed that a truly sub-quadratic deterministic time $1+o(1)$ -factor approximation algorithm for edit distance would imply new circuit lower bounds.

Andoni and Nguyen [6] found a randomized algorithm that approximates Ulam distance of two permutations of $\{1,\ldots,n\}$ (edit distance with only insertions and deletions) within a (large) constant factor in time $\widetilde{O}(\sqrt{n}+n/k)$ , where $k$ is the Ulam distance of the input; this was improved by Naumovitz et al. [23] to a $(1+\varepsilon)$ -factor approximation (for any $\varepsilon>0$ ) with similar runtime.

Reduction to a Gap-Algorithm. For simplicity we will assume that the bound $n$ on the length $\max(|x|,|y|)$ is a power of 2 and $|x|=|y|=n$ . It is easy to reduce the general case to this case: on input $x^{\prime},y^{\prime}$ , let $n$ be the least power of 2 that is at least $\max(|x^{\prime}|,|y^{\prime}|)$ and pad both $x^{\prime}$ and $y^{\prime}$ using a single new symbol to obtain strings $x$ , $y$ of length $n$ . It is easy to verify that $d_{\textrm{edit}}(x^{\prime},y^{\prime})\leq d_{\textrm{edit}}(x,y)\leq 2d_{\textrm{edit}}(x^{\prime},y^{\prime})$ , and so it suffices to approximate $d_{\textrm{edit}}(x,y)$ .

Following a common paradigm for approximation algorithms, our approximation algorithm is built by reducing to a gap algorithm. In this paper, we consider randomized gap algorithms for edit distance. These algorithms take as input $(n,\theta,\delta;x,y)$ where $n$ is an integral power of 2, $x$ and $y$ are strings of length $n$ , $\theta\in(0,1]$ is a nonnegative power of 1/2 and $\delta\in(0,1)$ . The triple $(n,\theta,\delta)$ are referred to as the input parameters and $x,y$ as the input strings. We say that the algorithm has quality $Q$ with respect to $(n,\theta,\delta)$ provided that for all strings $x,y$ of length $n$ :

Gap Algorithm Soundness.

If $d_{\textrm{edit}}(x,y)>Q\theta n$ then the algorithm returns reject.

Gap Algorithm Completeness.

If $d_{\textrm{edit}}(x,y)\leq\theta n$ then the algorithm returns accept with probability at least $1-\delta$ .

We say that the algorithm satisfies $\textrm{gap-condition}(T,\zeta,Q)$ , where $T,Q\geq 1$ and $\zeta\geq 0$ provided that for $n$ a power of 2, and for all $\theta\geq n^{-\zeta}$

•

The algorithm has quality $Q$ with respect to $(n,\theta,\delta)$ ,

•

The running time of the algorithm on any input $(n,\theta,\delta;x,y)$ is $\widetilde{O}(n^{1+1/T}\log(1/\delta))$ with probability 1. Here $\widetilde{O}$ hides powers of $\log(n)$ whose exponent may depend on $T$ .

We will prove:

Theorem 1.2.

For every $T\geq 1$ there are constants $\zeta=\zeta(T)>0$ and $Q=Q(T)\geq 1$ , and a gap-algorithm $\textrm{\bf{GAP-ED}}^{T}$ that satisfies $\textrm{gap-condition}(T,\zeta,Q)$ .

In Section 5 we present the (routine) construction of the algorithm $\textrm{\bf{FAST-ED-UB}}^{T}$ from $\textrm{\bf{GAP-ED}}^{T}$ , which proves Theorem 1.1. The focus of the paper is on proving Theorem 1.2.

1.1 Speed-up routines

Our algorithm, like that of [16] is built from a core speed-up algorithm having access to an existing "slow" gap algorithm. The speed-up algorithm produces a faster gap algorithm, with worse (but still constant) approximation quality, while making queries to the slow algorithm on pairs of "short" substrings. Given such a speed-up algorithm, one can build up a sequence of increasingly faster gap algorithms $A_{0},A_{1},\ldots,$ where $A_{0}$ is just the quadratic exact edit distance algorithm, and $A_{j}$ is obtained by using the core speed-up algorithm with $A_{j-1}$ playing the role of the "slow" algorithm. If the core speed-up algorithm involves some free parameters that may be optimized for best performance, this optimization can be done separately for each $A_{j}$

The core speed-up algorithm designed in [16], gives an algorithm $A_{1}$ that has running time $\widetilde{O}(n^{12/7})$ . The algorithms $A_{j}$ are successively faster, but do not get below $n^{\phi}$ where $\phi=1.61...$ . The core speed-up algorithm we design in this paper gives a sequence of gap-algorithms where the exponent of $n$ in the run-time converges to 1.

2 Preliminaries

Many definitions and routine claims are adapted (with some modifications) from [17]. The edit distance of strings $u,v$ is denoted $d_{\textrm{edit}}(u,v)$ and the normalized edit distance of $u,v$ , denoted $\Delta_{\textrm{edit}}(u,v)$ is defined to be $d_{\textrm{edit}}(u,v)/|u|$ .

Throughout the paper $x$ , $y$ denote two input strings of length $n$ , where $n$ is a power of 2, and $z$ denotes the concatenation $xy$ .

Intervals, Decompositions, aligned intervals, and $\delta$ -aligned intervals. We consider intervals in $\{0,\ldots,2n\}$ which are as usual, subsets consisting of consecutive integers. The width of interval $I$ , $\mu(I)$ is equal to $\max(I)-\min(I)=|I|-1$ . Most intervals we consider have width a power of 2. An interval of width $w$ is a $w$ -interval. Intervals index substrings of $z$ , where $z_{I}$ denotes the substring indexed by the set $I\setminus\{\min(I)\}$ , (Note that $z_{\min(I)}$ is not part of $z_{I}$ . In particular, $z=z_{\{0,\ldots,2n\}}$ , and $x=z_{\{0,\ldots,n\}}$ and $y=z_{\{n,\ldots,2n\}}$ .

A decomposition of an interval $I$ is a sequence $I_{1},\ldots,I_{k}$ of intervals with $\min(I_{1})=\min(I)$ , $\max(I_{k})=\max(I)$ and $\min(I_{j+1})=\max(I_{j})$ for $j\in\{1,\ldots,k-1\}$ . Note that $z_{I_{1}},\ldots,z_{I_{k}}$ partitions the string $z_{I}$ .

Let $w$ be a power of 2 that is at most $n$ , and let $\delta$ be a power of 2 that is at most 1. An interval of width $w$ is aligned if $\min(I)$ is a multiple of $w$ (and consequently $\max(I)$ is also a multiple of $w$ ). The interval is $\delta$ -aligned if $\min(I)$ is a multiple of $\max(\delta w,1)$ (and consequently so is $\max(I)$ ). In particular a 1-aligned interval is aligned. We define:

•

$\textrm{Intervals}(w)$ is the set of aligned intervals of width $w$ , subsets of $\{0,\dots,n\}$ .

•

$\textrm{Intervals}(w,\delta)$ to be the set of $\delta$ -aligned intervals of width $w$ , subsets of $\{0,\dots,2n\}$ .

•

For an interval $I$ , $\textrm{Intervals}(w;I)=\{I^{\prime}\in\textrm{Intervals}(w):I^{\prime}\subseteq I\}$ , and $\textrm{Intervals}(w,\delta;I)=\{I^{\prime}\in\textrm{Intervals}(w,\delta):I^{\prime}\subseteq I\}$ .

Since $n$ and $w$ are powers of 2, $\textrm{Intervals}(w)$ is a decomposition of $\{0,\ldots,n\}$ . When we use the notation $\textrm{Intervals}(w;I)$ , $I$ will be an aligned interval of width a power of 2, so that $\textrm{Intervals}(w;I)$ is a decomposition of $I$ .

The grid $\{0,\ldots,n\}\times\{0,\ldots,2n\}$ , boxes and stacks. Consider the grid $\{0,\ldots,n\}\times\{0,\ldots,2n\}$ lying in the coordinate plane. For $S\subseteq\{0,\ldots,n\}\times\{0,\ldots,n\}$ , the horizontal projection $\pi_{H}(S)$ is the set of first coordinates of elements of $S$ , and the vertical projection of $S$ , $\pi_{V}(S)$ is the set of second coordinates.

A box is a set $I\times J\subseteq\{0,\ldots,n\}\times\{0,\ldots,2n\}$ for intervals $I,J$ , and it represents the pair $x_{I},z_{J}$ of substrings. Since $I\subseteq\{0,\ldots,n\}$ , $z_{I}=x_{I}$ . Note that if $J\subseteq\{0,\ldots,n\}$ then $z_{I},z_{J}$ is a pair of substrings of $x$ and if $J\subseteq\{n,\ldots,2n\}$ , it is a pair (substring of $x$ , substring of $y$ ). $I\times J$ is a $w$ -box if $\mu(I)=\mu(J)=w$ . The lower left hand corner is $(\min(I),\min(J))$ and the upper right hand corner is $(\max(I),\max(J))$ . Note that $\pi_{H}(I\times J)=I$ and $\pi_{V}(I\times J)=J$ . Box $I\times J$ is horizontally aligned if $I$ is aligned, and it is vertically $\delta$ -aligned or simply $\delta$ -aligned if $J$ is $\delta$ -aligned; we have no need to refer to horizontally $\delta$ -aligned boxes. Box $I\times J$ is square if $\mu(I)=\mu(J)$ .

A stack is a set of boxes all having the same horizontal projection. For interval $I$ and set of intervals $\mathcal{J}$ , $I\times\mathcal{J}$ is the stack $\{I\times J:J\in\mathcal{J}\}$ .

Grid graphs. The grid graph of $z$ , $G_{z}$ , is a directed graph with edge costs, having vertex set $\{0,\ldots,n\}\times\{0,\ldots,2n\}$ and all edges of the form $(i-1,j)\to(i,j)$ ( $H$ -edges), $(i,j-1)\to(i,j)$ ( $V$ -edges) and $(i-1,j-1)\to(i,j)$ ( $D$ -edges). Every H-edge and V-edge costs 1, and a D-edge has cost 1 if $z_{i}\neq z_{j}$ and 0 otherwise. $G_{z}$ is acyclic, with edges moving "up and to the right". A directed path $\tau$ joins a pair of vertices ${\textrm{source}}(\tau)$ and ${\textrm{sink}}(\tau)$ with ${\textrm{source}}(\tau)\leq{\textrm{sink}}(\tau)$ . The box spanned by $\tau$ is the unique minimal box $I\times J$ that contains $\tau$ ; this is equal to $\pi_{H}(\tau)\times\pi_{V}(\tau)$ . We say $\tau$ traverses $I\times J$ if $I\times J$ is the box spanned by $\tau$ , which is equivalent to ${\textrm{source}}(\tau)=(\min(I),\min(J))$ and ${\textrm{sink}}(\tau)=(\max(I),\max(J))$ . A traversal of $I\times J$ is any path that traverses $I\times J$ .

For $I\subseteq\pi_{H}(\tau)$ , let $\tau_{I}$ denote the minimal subpath of $\tau$ whose horizontal projection is $I$ .

Cost and normalized cost. The cost of a directed path $\tau$ , $\textrm{cost}(\tau)$ is the sum of the edge costs, and the normalized cost is $\textrm{ncost}(\tau)=\frac{\textrm{cost}(\tau)}{\mu(\pi_{H}(\tau))}$ . The cost of box $I\times J$ , $\textrm{cost}(I\times J)$ , is the min-cost of a traversal of $I\times J$ and $\textrm{ncost}(I\times J)=\frac{1}{\mu(I)}\textrm{cost}(I\times J)$ .

It is well known (and easy to see) that for any box $I\times J$ , a traversal of $I\times J$ corresponds to an alignment from $a=z_{I}$ to $b=z_{J}$ , i.e. a set of character deletions, insertions and substitutions that changes $a$ to $b$ , where an H-edge $(i-1,j)\to(i,j)$ corresponds to "delete $a_{i}$ ", a V-edge $(i,j-1)\to(i,j)$ corresponds to "insert $b_{j}$ between $a_{i}$ and $a_{i+1}$ " and a D-edge $(i-1,j-1)\to(i,j)$ corresponds to replace $a_{i}$ by $b_{j}$ , unless they are already equal. Thus:

Proposition 2.1.

The cost of an alignment corresponding to path $\tau$ is $\textrm{cost}(\tau)$ . Thus for any $I,J\subseteq\{0,\ldots,2n\}$ , $d_{\textrm{edit}}(z_{I},z_{J})=\textrm{cost}(I\times J)$ . In particular $d_{\textrm{edit}}(x,y)=\textrm{cost}(\{0,\ldots,n\}\times\{n,\ldots,2n\})$ .

Displacement of a box relative to a path or box. The following easy fact (noted in [16]) relates the cost of two boxes having the same horizontal projection:

Proposition 2.2.

For intervals $I,J,J^{\prime}\subseteq\{0,\ldots,n\}$ , $|\textrm{cost}(I\times J)-\textrm{cost}(I\times J^{\prime})|\leq|J\Delta J^{\prime}|$ , where $\Delta$ denotes symmetric difference.

Let $\tau$ be a path whose horizontal projection includes $I$ . The displacement of the square box $I\times J$ with respect to $\tau$ , $\textrm{disp}(I\times J,\tau)$ is the smallest $K$ such that $(\min(I),\min(J))$ is within $K$ vertical units of ${\textrm{source}}(\tau_{I})$ and $(\max(I),\max(J))$ is within $K$ vertical units of ${\textrm{sink}}(\tau_{I})$ .

We make a few easy observations.

Proposition 2.3.

Let $\tau$ be a path whose horizontal projection includes $I$ and let $I\times J$ be a box. Then $\textrm{cost}(I\times J)\leq\textrm{cost}(\tau_{I})+2\textrm{disp}(I\times J,\tau)$ .

Proof.

Let $J^{\prime}$ be the vertical projection of $\tau_{I}$ . Then: $\textrm{cost}(I\times J)\leq\textrm{cost}(I\times J^{\prime})+|J\Delta J^{\prime}|\leq\textrm{cost}(\tau_{I})+|J\Delta J^{\prime}|\leq\textrm{cost}(\tau_{I})+2\textrm{disp}(I\times J,\tau)$ . ∎

The following fact (which is essentially the same as Proposition 3.4 of [17]) says that every path $\tau$ with projection $I^{\prime}$ can be approximately covered by a $\delta$ -aligned box whose cost is close to $\textrm{cost}(\tau)$ and whose displacement from $\tau$ is small:

Proposition 2.4.

Let $I^{\prime}$ and $J$ be intervals and suppose $\delta\in(0,1]$ . Let $\tau$ be a path lying inside of $I^{\prime}\times J$ whose horizontal projection is $I^{\prime}$ . There is a $\delta$ -aligned interval $J^{\prime}$ of width $\mu(I^{\prime})$ so that $\textrm{disp}(I^{\prime}\times J^{\prime},\tau_{I^{\prime}})\leq\delta\mu(I^{\prime})+\textrm{cost}(\tau_{I^{\prime}})$ and $\textrm{ncost}(I^{\prime}\times J^{\prime})\leq 2\textrm{ncost}(\tau_{I^{\prime}})+\delta$ .

Proof.

Let $J$ be the vertical projection of $\tau_{I^{\prime}}$ . If $\mu(J)\geq\mu(I^{\prime})$ then let $\hat{J}$ be the interval of width $\mu(I^{\prime})$ with $\min(\hat{J})=\min(J)$ . Otherwise let $\hat{J}$ be any interval of width $\mu(I^{\prime})$ that contains $J$ .

The box $I^{\prime}\times\hat{J}$ has displacement at most $\textrm{cost}(\tau_{I^{\prime}})$ from $\tau_{I^{\prime}}$ , and has cost at most $2\textrm{cost}(\tau_{I^{\prime}})$ . Finally, let $J^{\prime}$ be obtained by shifting $\hat{J}$ up or down to the closest $\delta$ -aligned interval. This shift is at most $\delta/2$ units. This increases both the displacement and the cost by at most $\delta\mu(I^{\prime})$ . ∎

The diagonal of a square box $I\times J$ is the diagonal path joining $(\min(I),\min(J))$ to $(\max(I),\max(J))$ . Let $I\times J$ and $I^{\prime}\times J^{\prime}$ be square boxes with $I^{\prime}\subseteq I$ . The displacement of $I^{\prime}\times J^{\prime}$ with respect to $I\times J$ , $\textrm{disp}(I^{\prime}\times J^{\prime},I\times J)$ is the displacement of $I^{\prime}\times J^{\prime}$ with respect to the diagonal of $I\times J$ , which is just the number of vertical units one needs to shift $I^{\prime}\times J^{\prime}$ so that its diagonal is a subpath of the diagonal of $I\times J$ .

Proposition 2.5.

Suppose $\tau$ traverses the square box $I\times J$ of width $w$ . Then every point of $\tau$ is within vertical distance $\textrm{cost}(\tau)/2$ of the diagonal of $I\times J$ .

Proof.

Consider a point of $\tau$ expressed as $P=(\min(I)+u,\min(J)+v)$ . Then $\tau$ can be split into two parts $\tau_{1}$ , ending at $P$ and $\tau_{2}$ starting at $P$ . Then $\textrm{cost}(\tau)=\textrm{cost}(\tau_{1})+\textrm{cost}(\tau_{2})\geq 2|v-u|$ which is twice the vertical distance of $P$ to the diagonal of $I\times J$ . ∎

Weighted boxes and stacks, certified boxes and stacks, shortcut graphs.

A weighted box is a pair $(I\times J,\kappa)$ where $\kappa\geq 0$ . If $\textrm{ncost}(I\times J)\leq\kappa$ we say that $(I\times J,\kappa)$ is a certified box. A weighted stack $(I\times\mathcal{J},\kappa)$ is a pair where $I\times\mathcal{J}$ is a stack and $\kappa\geq 0$ . We associate $(I\times\mathcal{J},\kappa)$ with the set $\{(I\times J,\kappa):J\in\mathcal{J}\}$ . If every box in $(I\times\mathcal{J},\kappa)$ is certified, we call it a certified stack.

Let $\widetilde{G}$ be the digraph on $\{0,\ldots,n\}\times\{0,\ldots,2n\}$ with arc set $\{(i,j)\rightarrow(i^{\prime},j^{\prime}):i\leq i^{\prime},j\leq j^{\prime},(i,j)\neq(i^{\prime},j^{\prime})\}$ The edges with $i<i^{\prime}$ and $j<j^{\prime}$ are called shortcuts. Associated to any weighted box $(I\times J,\kappa)$ there is a weighted shortcut edge $(\min(I),\min(J))\rightarrow(\max(I),\max(J))$ with weight $\kappa\mu(I)$ . Given a set $\mathcal{R}$ of weighted boxes, we define the weighted shortcut graph $\widetilde{G}(\mathcal{R})$ to be the weighted directed graph consisting of all H-edges and V-edges with weight 1, and all of the shortcut edges corresponding to the boxes in $\mathcal{R}$ . For a box $I\times J$ , let $\textrm{cost}_{\mathcal{R}}(I\times J)$ denote the minimum cost of a traversal of $I\times J$ in $\widetilde{G}(\mathcal{R})$ .

If every box in $\mathcal{R}$ is certified we say that $\widetilde{G}(\mathcal{R})$ is a certified shortcut graph. A certified shortcut graph $\bar{G}(\mathcal{R})$ provides upper bounds on the edit distance. We omit the proof of the following easy fact:

Proposition 2.6.

Let $\mathcal{R}$ be a set of certified boxes. For any box $I\times J$ , $d_{\textrm{edit}}(z_{I},z_{J})\leq\textrm{cost}_{\mathcal{R}}(I\times J)$ .

3 The core speed-up algorithm of [16]

As discussed in Section 1.1, the main ingredient in [16] is a core speed-up algorithm that has access to a slow edit distance approximation algorithm and uses it to build a faster approximation algorithm. We review the main ideas of the core speed-up algorithm in [16], which provides the starting point for ours. To simplify the description we assume that the slow edit distance algorithm is just the quadratic exact edit distance algorithm. In their work, they reduce to the case $\theta>n^{-1/5}$ and build a subquadratic time algorithm for the gap-problem where $\theta\geq n^{-1/5}$ . The algorithm operates in two phases. The discovery phase generates a set $\mathcal{Q}$ of certified boxes. In the shortest path phase the algorithm evaluates the cost of $(\{0,\ldots,n\}\times\{n,\ldots,2n\})$ in the shortcut graph $\widetilde{G}(\mathcal{R})$ where $\mathcal{R}$ is a set of certified boxes obtained by a minor modification of $\mathcal{Q}$ . Proposition 2.6 implies that this is an upper bound on $d_{\textrm{edit}}(x,y)$ . The main work is to define the discovery phase to ensure that this upper bound is not too much bigger than the true value. The shortest path phase is implemented by a straightforward variant of dynamic programming.

The discovery phase is defined in terms of parameters $w_{1}<d<w_{2}$ , which are powers of 2 that are, respectively, approximately $n^{1/7}$ , $n^{2/7}$ and $n^{3/7}$ . The set $\mathcal{Q}$ consists of certified $w_{1}$ -boxes and certified $w_{2}$ -boxes, and satisfies with high probability: for every horizontally aligned $w_{2}$ -box $I\times J$ , $\textrm{cost}_{\mathcal{R}}(I\times J)\leq C\cdot[\textrm{cost}(I\times J)+\theta w_{2}]$ for some constant $C$ . It is not difficult to show that this implies that the upper bound on $d_{\textrm{edit}}(x,y)$ output by the shortest path inference phase will be at most $C\cdot[d_{\textrm{edit}}(x,y)+\theta n]$ , which is enough to solve the gap-problem.

The algorithm generates boxes of width $w_{1}$ iteratively for $i$ from $0,\ldots,\log(1/\theta)$ and ${\varepsilon(i)}=2^{-i}$ . For each horizontally aligned $I$ , let $\mathcal{N}_{{\varepsilon(i)}}(I)$ be the set of $J$ that are ${\varepsilon(i+3)}$ -aligned and satisfy $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ . Iteration $i$ starts by classifying each of the $n/w_{1}$ -aligned $w_{1}$ -intervals, as dense or sparse subject to the requirement that every $I$ with $N_{{\varepsilon(i)}}(I)\geq 2d$ is classified as dense, and every $I$ with $N_{{\varepsilon(i)}}(I)\leq d/2$ is classified as sparse; this classification of $I$ is done with high probability by sampling $J$ at a rate $\log(n)/d$ and calling $I$ dense (resp. sparse) if at least (resp. at most) $\log(n)$ of the sample are within distance ${\varepsilon(i)}$ of $I$ . Next for each dense interval $I$ a set $\mathcal{J}(I)$ of ${\varepsilon(i+3)}$ -aligned $w_{1}$ -intervals $J$ is constructed such that $\textrm{ncost}(I\times J)\leq 5{\varepsilon(i)}$ and $\mathcal{N}_{{\varepsilon(i)}}(I)\subseteq\mathcal{J}(I)$ . For any given $I$ we can construct $\mathcal{J}(I)$ by computing its edit distance with every ${\varepsilon(i)}/8$ -aligned interval, in time $O(nw_{1}/{\varepsilon(i)})$ . If we do this for all $n/w_{1}$ -aligned intervals the time is $\Theta(n^{2}/{\varepsilon(i)})$ , but the restriction to dense intervals allows a savings of a factor of ${\varepsilon(i)}d$ : Initialize $\mathcal{D}$ to be the set of dense aligned $w_{1}$ -intervals. While $\mathcal{D}\neq\emptyset$ choose $I\in\mathcal{D}$ (the pivot for the current round) and construct $\mathcal{X}=N_{2{\varepsilon(i)}}(I)$ and $\mathcal{Y}=N_{3{\varepsilon(i)}}(I)$ and certify all boxes $(I^{\prime}\times J^{\prime},5{\varepsilon(i)})$ for $I^{\prime}\in\mathcal{X}$ and $J^{\prime}\in\mathcal{Y}$ . Delete $\mathcal{X}$ from $\mathcal{D}$ and continue. The number of pivots is thus only $O(n/w_{1}{\varepsilon(i)}d)$ since the sets $N_{{\varepsilon(i)}}(I)$ are of size at least $d$ and are disjoint for different pivots.

The rest of the discovery phase constructs a (relatively small) set of $w_{2}$ -boxes. For each horizontally aligned $w_{2}$ -interval $I^{\prime}$ , the $w_{1}$ -subintervals of $I^{\prime}$ that were declared sparse (over all iterations of $i$ ) are used to select a small subset $\mathcal{J}^{\prime}(I^{\prime})$ of the $w_{2}$ -intervals, and we certify each box $I^{\prime}\times J^{\prime}$ for $J^{\prime}\in\mathcal{J}^{\prime}(I^{\prime})$ by computing their edit distance exactly. The set $\mathcal{J}^{\prime}(I^{\prime})$ is obtained as follows: For each $i\in\{0,\ldots,\log(1/\theta)\}$ , select a polylog $(n)$ size subset $\mathcal{S}_{i}(I^{\prime})$ of the subintervals of $I^{\prime}$ that were declared sparse in iteration $i$ , and for each $I^{\prime\prime}\in\mathcal{S}_{i}(I^{\prime})$ exactly compute $\textrm{cost}(I^{\prime\prime},J)$ for all ${\varepsilon(i+3)}$ -aligned intervals $J$ to determine $\mathcal{N}_{{\varepsilon(i)}}(I^{\prime\prime})$ (which has size at most $2d$ ). For each box $I^{\prime\prime}\times J$ , let $J^{\prime}$ be the unique $w_{2}$ -interval such that the diagonal of $I^{\prime\prime}\times J$ is a subset of the diagonal of $I^{\prime}\times J^{\prime}$ and add $J^{\prime}$ to $\mathcal{J}^{\prime}(I^{\prime})$ . The size of $\mathcal{J}^{\prime}(I^{\prime})$ is $\widetilde{O}(d)$ and so the total cost of evaluating the edit distance of boxes $I^{\prime}\times J^{\prime}$ for $I^{\prime}\in\textrm{Intervals}(w_{2};\{0,\ldots,n\})$ and $J^{\prime}\in\mathcal{J}^{\prime}(I^{\prime})$ is $\widetilde{O}(ndw_{2})$ .

The parameters $w_{1},d,w_{2}$ are adjusted to minimize the run time at $\widetilde{O}(n^{12/7})$ . The key claim in [16] is that for every horizontally aligned $w_{2}$ -box $I\times J$ , the boxes from the discovery phase imply an upper bound $\textrm{ncost}(I\times J)$ that is at most $C\cdot\textrm{ncost}(I\times J)+C^{\prime}\theta$ which is sufficient for the shortcut phase to succeed. The claim is proved by showing that if the set of certified $w_{1}$ -boxes does not imply a sufficiently good upper bound on $\textrm{ncost}(I\times J)$ , then with high probability, one of the $w_{2}$ -boxes $I\times J^{\prime}$ constructed in the second part of the discovery phase is within a small vertical shift of $I\times J$ , and therefore can be used in the inference phase to imply a good upper bound on $\textrm{cost}(I\times J)$ .

4 The new core speed-up algorithm

The main new ingredient of the new core speed-up algorithm presented here is the replacement of the pair $w_{1}<w_{2}$ of widths from [16] by a hierarchy $w_{1}<\cdots<w_{k}$ of widths. While the idea of such an extension is natural, it is not a priori clear how to extend the ideas of [16] to such a hierarchy. Our new algorithm proceeds in $k$ iterations. During iteration $j$ the algorithm builds a data structure that supports approximate distance queries between substrings of width $w_{j}$ . Each successive data structure recursively uses the data structure from the previous iterations. Iteration $j$ is accomplished by a suitable variant of the algorithm from [16].

The algorithm of [16] splits neatly into a discovery phase and an inference phase. In the new algorithm, each iteration begins with an inference phase (using boxes discovered in the previous phase) followed by a discovery phase.

Here is our main speed-up theorem.

Theorem 4.1.

Suppose that SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ where $T^{\prime}\geq 1$ , $\zeta^{\prime}>0$ and $Q^{\prime}\geq 1$ . There is an algorithm FAST-GED (using SLOW-GED as a subroutine) that satisfies $\textrm{gap-condition}(T,\zeta,Q)$ with $T=T^{\prime}+1/6$ where $\zeta>0$ and $Q\geq 1$ are suitably chosen (depending only on $T^{\prime}$ , $\zeta^{\prime}$ and $Q^{\prime}$ ).

Applying this theorem inductively with $A_{0}$ being the exact edit distance algorithm, we get a sequence of algorithms $A_{j}$ where $A_{j}$ satisfies $\textrm{gap-condition}(1+j/6,\zeta_{j},Q_{k})$ for suitable constants $\zeta_{j}>0$ and $Q_{j}$ , and taking $j=6(T-1)$ gives Theorem 1.2.

The proof of Theorem 4.1 is the heart of the paper. We describe the algorithm in the following order:

The parameters used by the algorithm (Section 4.1). 2. 2.

The overall architecture, including data objects, of the algorithm (Section 4.2). 3. 3.

Some basic functions used in the algorithm (Section 4.3). 4. 4.

The mechanics of the algorithm. (Section 4.4). 5. 5.

The use of randomness in the algorithm (Section 4.5). 6. 6.

The properties enforced by the algorithm (Section 4.6 and 4.7). 7. 7.

The proof that FAST-GED satisifes the gap-algorithm Soundness and Completeness requirements (Section 4.8). 8. 8.

The running time analysis in terms of the parameters (Section 4.9). 9. 9.

The choice of parameters that attain the run time claims for FAST-GED (Section 4.10). 10. 10.

Tying up the proof of Theorem 4.1 (Section 4.11).

4.1 The algorithm parameters

Recall that a gap-algorithm takes as input $(n,\theta,\delta;x,y)$ where $n$ is a power of 2 and $|x|=|y|=n$ , and $\theta\in(0,1]$ is a power of 1/2.

In our description of the algorithm, we fix the input parameter $\delta$ in the algorithm FAST-GED to $\delta=1/2$ . For $\delta<1/2$ , we execute the algorihm with $\delta=1/2$ independently for $r=\lceil\log_{2}(1/\delta)\rceil$ times, and reject only if every run returns reject. This compound algorithm will reject every input $x,y$ such that $d_{\textrm{edit}}(x,y)\geq Q\theta n$ , since every run will reject. The probability that the compound algorithm incorrectly returns reject on input with $d_{\textrm{edit}}(x,y)\leq\theta n$ is at most $(1/2)^{r}\leq\delta$ , as required.

Second, we fix the value of $\delta$ for all calls of SLOW-GED within FAST-GED, to $\delta=n^{-12}$ where $n$ is the length of the global input to FAST-GED. Since the number of calls to SLOW-GED will be bounded above (easily) by $\widetilde{O}(n^{2})$ , a union bound implies that the probability that every call to SLOW-GED is correct is at least $1-n^{-8}$ .

The algorithm FAST-GED takes as input $n,\theta;x,y$ where $n$ is a power of 2, $x$ and $y$ are strings of length $n$ and the gap parameter $\theta\in(0,1]$ is a power of 1/2. The algorithm sets $z$ to be the concatenation of $xy$ and treats $z$ as a global variable.

The number of iterations (levels) of FAST-GED is a parameter $k$ . For each $j\in{1,\ldots,k+1}$ , there is a width parameter $w_{j}$ and for each $j\in\{0,\ldots,k\}$ , there is a density parameter $d_{j}$ . These parameters are integer powers of 2 satisfying:111We denote by $\lfloor\cdot\rfloor_{2}$ the closest power of two of size smaller or equal.

[TABLE]

Furthermore, for $1\leq j\leq k$ :

[TABLE]

These parameters will be chosen in Section 4.10 to optimize the time analysis. For now we note a technical assumption, that will be verified in Section 4.10, that is needed in the analysis. For $1\leq j\leq k$ :

[TABLE]

For each $j\in\{0,\ldots,k\}$ , there are quality parameters $q_{j}$ that satisfy the recurrence:

[TABLE]

The quality of the final approximation is $Q=2^{q_{k}+6}$

We also define, for integers $i$ , ${\varepsilon(i)}=2^{-i}$ . In most cases, $i\in\{0,\ldots,\log(1/\theta)\}$ so $1\geq{\varepsilon(i)}\geq\theta$ .

There is a constant $c_{0}$ used in the definition of the procedure ProcessDense. (See Section 4.5.)

4.2 The architecture of the algorithm, and the neighborhood data structure

FAST-GED consists of $k$ iterations (levels), and a final post-processing step. During iteration $j$ , the algorithm examines pairs $\langle i;I\times J\rangle$ , called candidates, where $i\in\{0,\ldots,\log(1/\theta)\}$ , $I\in\textrm{Intervals}(w_{j})$ and $J\in\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ . (Hence, a candidate is any $\langle i;I\times J\rangle$ that satisfies some weak consistency requirements.) The pair $I\times J$ is called a level $j$ box and $\langle i;I\times J\rangle$ is a level $j$ candidate. Iteration $j$ implicitly classifies all level $j$ -candidates as close or far. This classification satisfies:

•

If $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then $\langle i;I\times J\rangle$ is classified as close.

•

If $\textrm{ncost}(I\times J)>{\varepsilon(i-q_{j-1}-6)}$ then $\langle i;I\times J\rangle$ is classified as far.

If ${\varepsilon(i)}<\textrm{ncost}(I\times J)\leq{\varepsilon(i-q_{j-1}-6)}$ then $\langle i;I\times J\rangle$ may be classified as either close or far.

This implicit classification is accomplished by a data structure, called the neighborhood data structure. The data structure implements a query EnumerateClose which takes as input $(j,I\times\mathcal{J},i)$ where:

•

$j\in\{1,\ldots,k\}$ is the level,

•

$I\times\mathcal{J}$ is a stack satisfying $I\in\textrm{Intervals}(w_{j})$ and $\mathcal{J}\subseteq\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ ,

•

$i\in\{0,\ldots,\log(1/\theta)\}$ ,

and returns the set of $J\in\mathcal{J}$ for which $\langle i;I\times J\rangle$ is close. In particular, $\textrm{EnumerateClose}(j,I\times\{J\},i)$ returns $\{J\}$ if $\langle i;I\times J\rangle$ is close and returns $\emptyset$ otherwise. The pair $\langle i;I\times\mathcal{J}\rangle$ is called a level $j$ candidate stack.

The queries with level parameter $j$ are the level $j$ queries. Initially the data structure is unable to answer any queries. During iteration $j$ the algorithm constructs the part of the data structure that determines the classification of level $j$ candidates as close or far, and thereby enabling level $j$ queries.

At the start of iteration $j$ , queries up to level $j-1$ have been enabled. To enable $\textrm{EnumerateClose}(j,\cdot)$ the algorithm constructs families of sets for each $I\in\textrm{Intervals}(w_{j})$ and each $i\in\{0,\ldots,\log(1/\theta)\}$ as follows:

•

A subset of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ denoted ${\bf\mathcal{B}^{\textrm{below}}}(j,I,i)$ .

•

A subset of $\textrm{Intervals}(w_{j-1};I)$ denoted ${\bf\textrm{SparseSample}}(j,I,i)$ .

The query $\textrm{EnumerateClose}(j,\cdot)$ uses these sets, as well as calls to $\textrm{EnumerateClose}(j-1,\cdot)$ . Thus the level $j$ neighborhood data structure consists of all of the sets $\mathcal{B}^{\textrm{below}}(j^{\prime},\cdot)$ and $\textrm{SparseSample}(j^{\prime},\cdot)$ for $1\leq j^{\prime}\leq j$ .

During iteration $j$ , subroutines Preprocess and ProcessDense are called with parameter $j$ . The purpose of $\textrm{Preprocess}(j)$ is to create the sets $\mathcal{B}^{\textrm{below}}(j,\cdot)$ and $\textrm{SparseSample}(j,\cdot)$ . The construction of these sets involves some random choices, which affect the close/far classification; but once the choices are made the close/far classification is fixed. The creation of these sets activates $\textrm{EnumerateClose}(j,\cdot)$ . While the data structure grows during each iteration to enable higher level queries, once $\textrm{EnumerateClose}(j,\cdot)$ is enabled, the portion of the data structure used to handle level $j$ queries is static.

The other procedure in iteration $j$ of FAST-GED() is $\textrm{ProcessDense}(j)$ . $\textrm{ProcessDense}(j)$ creates the following sets for each $i\in\{0,\ldots,\log(1/\theta)\}$ :

•

$\textrm{Sparse}(j,i)\subseteq\textrm{Intervals}(w_{j})$ .

•

For each $I\not\in\textrm{Sparse}(j,i)$ , a subset of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ denoted $\mathcal{B}^{\textrm{dense}}(j,I,i)$ .

•

A set $\mathcal{R}(j)$ of weighted boxes (which we will prove are all certified).

The sets $\mathcal{B}^{\textrm{dense}}(j,\cdot)$ are local variables within $\textrm{ProcessDense}(j)$ , used to create $\mathcal{R}(j)$ .

The set $\mathcal{R}(j)$ and $\textrm{Sparse}(j,\cdot)$ are global variables but, with the exception of the final iteration $j=k$ , they are used only in $\textrm{Preprocess}(j+1)$ , and then never used again. Following iteration $k$ , the set $\mathcal{R}(k)$ is used in the post-processing step to generate the final output which is $\textrm{cost}_{R(k)}(\{0,\ldots,n\}\times\{n,\ldots,2n\})$ .

4.3 Elementary primitives

We describe some elementary functions used within the algorithm.

The function Round. $\textrm{Round}(J,\epsilon)$ where $J$ is an interval and $\epsilon\leq 1$ is a power of 2, is equal to the $\epsilon$ -aligned interval $J^{\prime}$ of width $\mu(J)$ obtained by shifting $J$ down (decreasing its two endpoints) at most $\epsilon\mu(J)-1$ units.

**The function **ZoomIn. Recall the definition of displacement in Section 2. The function ZoomIn takes as input a box $I\times J$ , and a subinterval $I^{\prime}$ of $I$ and some additional parameters, and outputs a set of suitably aligned intervals $J^{\prime}$ of width $\mu(I^{\prime})$ so that each box $I^{\prime}\times J^{\prime}$ has small displacement from $I\times J$ . More precisely, for a box $I\times J$ , a subinterval $I^{\prime}\subseteq I$ , and $0\leq i^{\prime}\leq i\leq\log(1/\theta)$ , $\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ is the set of all ${\varepsilon(i^{\prime}+3)}$ -aligned intervals $J^{\prime}\subseteq J$ of width $\mu(I^{\prime})$ , for which the displacement of $I^{\prime}\times J^{\prime}$ from $I\times J$ is at most $2{\varepsilon(i)}\mu(I)$ .

Proposition 4.2.

Let $I$ be an interval of width $w$ and $I^{\prime}\subseteq I$ of width $w^{\prime}$ a divisor of $w$ . Let $i^{\prime}\leq i\in\{0,\ldots,\log(1/\theta)\}$ .

For $J$ of width $w$ , $|\textrm{ZoomIn}(j,I\times J,I^{\prime},i^{\prime})|$ has size at most $1+32{\varepsilon(i-i^{\prime})}w/w^{\prime}$ . 2. 2.

Let $I^{\prime}\times J^{\prime}$ be a box. The number of ${\varepsilon(i+3)}$ -aligned width- $w$ * intervals $J$ such that $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ is at most 33.*

Proof.

Set $\Delta=\min(I^{\prime})-\min(I)$ . If $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ then $|\min(J^{\prime})-\Delta-\min(J)|\leq 2{\varepsilon(i)}w$ .

Proof of (1). Holding $J$ fixed, we have $\min(J^{\prime})\in[\min(J)+\Delta-2{\varepsilon(i)}w,\min(J)+\Delta+2{\varepsilon(i)}w]$ . This is an interval of width $4{\varepsilon(i)}w$ , and the number of ${\varepsilon(i^{\prime}+3)}$ -aligned intervals of width $w^{\prime}$ that start in this interval is at most $1+32{\varepsilon(i-i^{\prime})}w/w^{\prime}$ .

Proof of (2). Holding $J^{\prime}$ fixed, we have $\min(J)\in[\min(J^{\prime})-\Delta-2{\varepsilon(i)}w,\min(J^{\prime})-\Delta+2{\varepsilon(i)}w]$ . This is an interval of width $4{\varepsilon(i)}w$ , and the number of ${\varepsilon(i+3)}$ -aligned intervals of width $w$ that start in this interval is at most 33. ∎

Calling $\textrm{ZoomIn}(j,I\times\mathcal{J},i,I^{\prime},i^{\prime})$ with a stack $I\times\mathcal{J}$ returns the union of results $\bigcup_{J\in\mathcal{J}}\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ .

**The function **InducedBoxes. This is a function that takes as input a set of weighted square boxes $\mathcal{Q}$ and outputs a collection of weighted boxes induced by $\mathcal{Q}$ . For an interval $J$ , and $t\leq\mu(J)/2$ let $J/[t]$ denote the interval $[\min(J)+t,\max(J)-t]$ . For each $(I\times J,\kappa)$ in $\mathcal{Q}$ , $\textrm{InducedBoxes}(\mathcal{Q})$ includes $(I\times J,\kappa)$ together with boxes of the form $(I\times J/[2^{i}],\kappa+\frac{2^{i+1}}{\mu(I)})$ for $i\in\{0,\ldots,\log(\mu(J))-1\}$ .

Proposition 4.3.

If all boxes of $\mathcal{Q}$ are certified boxes then so are all boxes of $\textrm{InducedBoxes}(\mathcal{Q})$ .

Proof.

Note that $|J\Delta(J/[2^{i}])|=2^{i+1}$ and apply Proposition 2.2. ∎

The function APM (Approximate pattern match). Recall from Section 2 that $\textrm{cost}_{\mathcal{R}}(I\times J)$ is the length of the min-cost traversal of $I\times J$ in the shortcut graph $\widetilde{G}(\mathcal{R})$ . APM takes as input a stack $I\times\mathcal{J}$ , $\kappa>0$ and a set $\mathcal{R}$ of certified boxes, and outputs a subset $\cal{S}$ of $\cal{J}$ that satisfies:

Completeness of APM.

For all $J\in\mathcal{J}$ satisfying $\textrm{cost}_{\mathcal{R}}(I\times J)\leq\kappa\mu(I)$ , $J\in\mathcal{S}$

Soundness of APM.

For all $J\in\mathcal{J}$ satisfying $\textrm{cost}(I\times J)>2\kappa\mu(I)$ , $J\not\in\mathcal{S}$ .

The running time is $\widetilde{O}(\mu(I)+|\mathcal{J}|+|\mathcal{R}|)$ . (Notice, the subtle distinction between $\textrm{cost}_{\mathcal{R}}$ and cost in Soundness and Completeness.) The implementation, described in Section 6, is a customized variant of dynamic programming that closely follows [17, 18].

4.4 The mechanics of the algorithm

We are now ready to present the pseudocode for FAST-GED and the three main subroutines: Preprocess and ProcessDense, and EnumerateClose.

**The algorithm **FAST-GED. This algorithm inputs an integer $n$ which is a power of $2$ , $\theta\in(0,1]$ a power of 1/2, and two strings $x,y$ of length $n$ , and returns accept or reject. (Recall that the error parameter $\delta$ is fixed to 1/2.) The algorithm consists of iterations indexed by $j\in\{1,\ldots,k\}$ . $\textrm{Preprocess}(j)$ creates the sets $\mathcal{B}^{\textrm{below}}(j,I,i)$ and $\textrm{SparseSample}(j,I,i)$ that enable the level $j$ queries $\textrm{EnumerateClose}(j,\cdot)$ . $\textrm{ProcessDense}(j)$ creates sets $\mathcal{R}(j)$ and $\textrm{Sparse}(j,i)$ needed for $\textrm{Preprocess}(j+1)$ .

**The subroutine **Preprocess. On input $j$ , the sets $\textrm{Sparse}(j-1,i)$ and $\mathcal{R}(j-1)$ created by $\textrm{ProcessDense}(j-1)$ are used to produce the sets $\mathcal{B}^{\textrm{below}}(j,I,i)$ and $\textrm{SparseSample}(j,I,i)$ for $I\in\textrm{Intervals}(w_{j})$ and $i\in\{0,\ldots,\log(1/\theta)\}$ . To begin, the set of weighted $w_{j-1}$ -boxes $\mathcal{R}(j-1)$ is partitioned into sets $\mathcal{R}(j-1,I)$ , with $I^{\prime}\times J^{\prime}$ assigned to $\mathcal{R}(j-1,I)$ for $I^{\prime}\subseteq I$ . For each $i$ and $I$ :

The set $\textrm{Sparse}(j-1,i)\subseteq\textrm{Intervals}(w_{j-1})$ was produced by $\textrm{ProcessDense}(j-1)$ . $\textrm{SparseSample}(j,I,i)=\emptyset$ if $\textrm{Sparse}(j-1,i)$ contains no subintervals of $I$ , and otherwise is an independent random sample (multiset) of size $\log(n)^{\theta(1)}$ selected from the subsets of $I$ belonging to $\textrm{Sparse}(j-1,i)$ . 2. 2.

Run APM with input stack $I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ and $\mathcal{R}(j-1,I)$ to determine the set of intervals $J\in\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ that are suitably close to $I$ in the shortcut graph $\widetilde{G}(\mathcal{R}(j-1,I))$ .

**The subroutine **EnumerateClose. The creation of $\mathcal{B}^{\textrm{below}}(j,I,i)$ and $\textrm{SparseSample}(j,I,i)$ by Preprocess enables the query $\textrm{EnumerateClose}(j,\cdot)$ , which implicitly classifies all level $j$ candidates $\langle i;I\times J\rangle$ as close or far subject to:

Completeness of EnumerateClose.

If $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then with high probability $\langle i;I\times J\rangle$ is close.

Soundness of EnumerateClose.

If $\textrm{ncost}(I\times J)>{\varepsilon(i-q_{j-1}-6)}$ then $\langle i;I\times J\rangle$ is far.

$\textrm{EnumerateClose}(j,\cdot)$ takes a stack $I\times\mathcal{J}$ and $i\in\{0,\ldots,\log(1/\theta)\}$ with $I\in\textrm{Intervals}(w_{j})$ and $\mathcal{J}\subseteq\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ and returns $\{J\in\mathcal{J}:\langle i;I\times J\rangle$ is close}. $\mathcal{S}$ accumulates the set of intervals to be output. For $j=1$ , $\textrm{SLOW-GED}(z_{I},z_{J},\epsilon)$ is run for each $J\in\mathcal{J}$ and $\mathcal{S}$ is the set of accepted $J$ . For $j>1$ , $\mathcal{S}$ is the union of two sets. The first is $\mathcal{B}^{\textrm{below}}(j,I,i)\cap\mathcal{J}$ found by $\textrm{Preprocess}(j)$ . The second is obtained by identifying (as described below) a small subset $\mathcal{K}\subseteq\mathcal{J}$ , testing each $J\in\mathcal{K}$ using SLOW-GED, and adding $J$ to $\mathcal{S}$ if $z_{J}$ is suitably close to $z_{I}$ . To identify $\mathcal{K}$ , for each $(I^{\prime},i^{\prime})\in\textrm{SparseSample}(j,I,i)\times\{0,\ldots,i\}$ use ZoomIn to identify the set $\mathcal{J}^{\prime}$ of $J^{\prime}\in\textrm{Intervals}(w_{j-1},{\varepsilon(i^{\prime}+3)})$ such that $I^{\prime}\times J^{\prime}$ has displacement at most $2{\varepsilon(i)}\mu(I)$ from $I\times J$ . Recursively use $\textrm{EnumerateClose}(j-1,I^{\prime}\times\mathcal{J}^{\prime},i^{\prime})$ to select $\mathcal{S}^{\prime}=\{J^{\prime}\in\mathcal{J}^{\prime}:\langle i^{\prime};I^{\prime}\times J^{\prime}\rangle\text{ is }{\bf\textsc{close}}\}$ . $\mathcal{K}$ consists of those $J$ for which $I\times J$ has small displacement from $I^{\prime}\times J^{\prime}$ for some $J^{\prime}\in\mathcal{S}^{\prime}$ .

The loops on $i^{\prime},I^{\prime}$ (line 11-21) produce $\mathcal{K}\subseteq\mathcal{J}$ . For each $J\in\mathcal{K}$ , SLOW-GED is run on $z_{I},z_{J}$ . The loop on $I^{\prime}$ is over $\textrm{SparseSample}(j,I,i^{\prime})$ . The subset $\mathcal{K}$ of $\mathcal{J}$ depends on the random sample $\textrm{SparseSample}(j,I,i^{\prime})$ of $\textrm{Sparse}(j-1,i^{\prime})\cap\textrm{Intervals}(w_{j-1};I)$ . The following definitions highlight this dependence.

•

For $\langle i;I\times J\rangle$ , let $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ and $i^{\prime}\in\{0,\ldots,i\}$ . The pair $(I^{\prime},i^{\prime})$ is a marker222We call it marker as in genomics, where a short DNA sequence identifies a gene. Similarly here, a marker for $z_{I}$ is its substring $z_{I^{\prime}}$ which is relatively rare in $z$ , i.e., $I^{\prime}$ belongs to $\textrm{Sparse}(j-1,i^{\prime})$ . for the candidate $\langle i;I\times J\rangle$ if $I^{\prime}\in\textrm{Sparse}(j-1,i^{\prime})$ and there is some $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ such that $\langle i^{\prime};I^{\prime}\times J^{\prime}\rangle$ is classified as close. When lines (13-18) are executed for a marker $(I^{\prime},i^{\prime})$ , $J$ is added to $\mathcal{K}$ in line 17. Ideally, $\mathcal{K}$ will consist of all intervals $J$ identifiable by their markers.

•

$\mathcal{M}(j,I\times J,i,i^{\prime})=\{I^{\prime}\in\textrm{Sparse}(j-1,i^{\prime})\cap\textrm{Intervals}(w_{j-1};I):(I^{\prime},i^{\prime})$ is a marker for $\langle i;I\times J\rangle\}$ . We will be interested in situations when for some $i^{\prime}\leq i$ there will be many markers, namely, $|\mathcal{M}(j,I\times J,i,i^{\prime})|\geq\frac{1}{3}|\textrm{Sparse}(j-1,i^{\prime})\cap\textrm{Intervals}(w_{j-1};I)|$ , so that with high probability $\textrm{SparseSample}(j-1,I,i^{\prime})$ will contain a marker that will identify $J$ .

**The procedure **ProcessDense. This takes as input a level number $j$ . The procedure corresponds closely to the procedure Dense Strip Removal in [17].

For each $i\in\{0,\ldots,\log(1/\theta)\}$ the procedure builds a set $\textrm{Sparse}(j,i)\subseteq\textrm{Intervals}(w_{j})$ and also builds sets $\mathcal{B}^{\textrm{dense}}(j,I,i)\subseteq\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ for every $I\in\textrm{Intervals}(w_{j})\setminus\textrm{Sparse}(j,i)$ . This is done by processing the intervals of $\textrm{Intervals}(w_{j})$ ; when interval $I$ is processed it is either assigned to $\textrm{Sparse}(j,i)$ or the set $\mathcal{B}^{\textrm{dense}}(j,I,i)$ is constructed. We keep track of a subset $\mathcal{T}\subseteq\textrm{Intervals}(w_{j})$ of unprocessed intervals. This set is initialized to $\textrm{Intervals}(w_{j})$ and the iteration ends when $\mathcal{T}=\emptyset$ . We proceed in rounds. In a round we select an arbitrary $I$ from $\mathcal{T}$ . We perform a test (see ”Testing potential pivots in ProcessDense” in Section 4.5) to decide whether to put it in $\textrm{Sparse}(j,i)$ . If $I$ is not placed in $\textrm{Sparse}(j,i)$ then $I$ is designated the pivot for that round. We then call EnumerateClose on the stack $I\times\mathcal{T}$ (with suitable parameters) to determine the subset $\mathcal{X}$ of $\textrm{Intervals}(w_{j})$ , we call EnumerateClose on the stack $I\times\textrm{Intervals}(w_{j},\kappa)$ (for a suitable $\kappa\geq{\varepsilon(i)}$ ) to determine $\mathcal{Y}^{\prime}\subseteq\textrm{Intervals}(w_{j},\kappa)$ and we let $\mathcal{Y}$ be the set of intervals from $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ which round to an interval in $\mathcal{Y}^{\prime}$ . We then define $\mathcal{B}^{\textrm{dense}}(j,I^{\prime},i)=\mathcal{Y}$ for all $I^{\prime}\in\mathcal{X}$ , and remove $\mathcal{X}$ from $\mathcal{T}$ , to complete the round.

The parameters used in the above calls are expressed in terms of $h_{1}$ and $h_{2}$ introduced in the pseudocode. The particular choice $h_{1}$ and $h_{2}$ is motivated by both the correctness analysis and the time analysis (Section 4.9).

In the sequel, we will need the following definition and observation.

Approved Candidate. A candidate $\langle i;I\times J\rangle$ is said to be approved if $I\not\in\textrm{Sparse}(j,i)$ and $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ . Note that the boxes in $\mathcal{Q}(j)$ are in one-to-one correspondence with the approved candidates, with $(I\times J,{\varepsilon(i-q_{j})})\in\mathcal{Q}(j)$ if and only if $\langle i;I\times J\rangle$ is approved. All candidates of the form $\langle i;I\times J\rangle$ are approved for $i\leq q_{j}$ .

Proposition 4.4.

At level $k$ , the sets $\textrm{Sparse}(k,i)$ are empty for all $i\in\{0,\dots,\log(1/\theta)\}$ .

Proof.

Since $d_{k}=1$ , the set $\mathcal{S}$ created in line (10) is all of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ which, in particular includes $I$ . The set returned by EnumerateClose in line (11) includes $I$ and so the if condition fails, and $I$ is not added to $\textrm{Sparse}(k,i)$ . ∎

4.5 The use of randomization

Randomization is used in three parts of the algorithm: the subroutine SLOW-GED, the construction of SparseSample during Preprocess and in ProcessDense, each time we test a selected $I\in\mathcal{T}$ to decide whether it is a pivot. We discuss each of these uses below.

The subroutine SLOW-GED. SLOW-GED takes calling parameters $(n^{\prime},\theta^{\prime},\delta^{\prime};x^{\prime},y^{\prime})$ . By our assumption $\delta^{\prime}$ is fixed to $n^{-12}$ for all calls. The gap-soundness and completeness conditions for SLOW-GED guarantee that if $\Delta_{\textrm{edit}}(x^{\prime},y^{\prime})>Q^{\prime}\theta^{\prime}n^{\prime}$ then SLOW-GED returns reject, and if $\Delta_{\textrm{edit}}(x^{\prime},y^{\prime})\leq\theta^{\prime}n^{\prime}$ then SLOW-GED returns accept with probability at least $1-n^{-12}$ . Say that an execution of $\textrm{SLOW-GED}(n^{\prime},\theta^{\prime},n^{-12};x^{\prime},y^{\prime})$ fails if $\Delta_{\textrm{edit}}(x^{\prime},y^{\prime})\leq\theta^{\prime}n^{\prime}$ and SLOW-GED returns reject. We will introduce an event $\mathbf{SG}$ that no call to SLOW-GED fails.

To simplify the analysis, we make the following assumption: when we run FAST-GED we pregenerate a single string $B_{\mathrm{SG}}$ of $b$ random bits where $b$ is an upper bound on the number of random bits used in any call to SLOW-GED. In every call to SLOW-GED we use (a prefix of) $B_{\mathrm{SG}}$ to provide the random bits for the call. This makes all calls to SLOW-GED deterministic, and also ensures that if the algorithm makes multiple calls to SLOW-GED with the same input parameters then all such calls yield the same output.

Reusing random bits for different calls of SLOW-GED makes these calls dependent, but this is irrelevant to the analysis. The proof of correctness relies only on the fact that the event $\mathbf{SG}$ holds.

We now upper bound the probability that there is a call that does not succeed. Every possible input tuple $(n^{\prime},\theta^{\prime},n^{-12};x^{\prime},y^{\prime})$ for SLOW-GED satisfies that $n^{\prime}$ is a power of 2 with $n^{\prime}<n$ , $x^{\prime},y^{\prime}$ are substrings of $z=xy$ of length $n^{\prime}$ , and $\theta^{\prime}$ is an integral power of 1/2. We may assume that $\theta^{\prime}\geq 1/n$ since for $\theta^{\prime}<1/n$ we may assume that SLOW-GED is the deterministic algorithm that returns accept if $x=y$ and reject otherwise. Let $\mathbf{SG}$ denote the event that for all possible choices of input parameters $(n^{\prime},x^{\prime},y^{\prime},\theta^{\prime})$ with $\theta^{\prime}\geq 1/n$ , the choice of random bits succeeds.

The number of possible choices of input parameters for which randomness is used is at most $4n^{2}\log^{2}(n)$ . (There are at most $\log(n)$ ways to choose $n^{\prime}$ , and to choose $\theta^{\prime}$ , and at most $2n$ ways to choose the starting location of $x^{\prime}$ and of $y^{\prime}$ .) Thus by a union bound, the probability that $\mathbf{SG}$ does not hold is at most $n^{-8}$ .

The construction of SparseSample. $\textrm{SparseSample}(j,I,i)$ is a random sample of $\textrm{Sparse}(j-1,i)$ generated during Preprocess $(j)$ . What we want from this sample is that for each $i^{\prime}\in\{0,\ldots,i\}$ , if a nontrivial fraction of $\textrm{Sparse}(j-1,i)$ belongs to the set of markers $\mathcal{M}(j,I\times J,i,i^{\prime})$ then $\textrm{SparseSample}(j,I,i)$ should include a member of $\mathcal{M}(j,I\times J,i,i^{\prime})$ . (Note: for the purposes of this discussion, the exact technical definition of $\mathcal{M}(j,I\times J,i,i^{\prime})$ is unimportant, we only need that for each $j,I,J,i,i^{\prime}$ , $\mathcal{M}(j,I\times J,i,i^{\prime})$ and $\textrm{Sparse}(j-1,i)$ are completely determined after iteration $j-1$ , and $\mathcal{M}(j,I\times J,i,i^{\prime})\subseteq\textrm{Sparse}(j-1,i)$ .) Formally, we say that $\textrm{SparseSample}(j,I,i)$ fails for $J\in\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ and $i^{\prime}\in\{0,\ldots,i\}$ if $|\textrm{Sparse}(j-1,i)\cap\textrm{Intervals}(w_{j-1};I)|>0$ , $|\mathcal{M}(j,I\times J,i,i^{\prime})|\geq|\textrm{Sparse}(j-1,i)\cap\textrm{Intervals}(w_{j-1};I)|/3$ and $\textrm{SparseSample}(j,I,i)\cap\mathcal{M}(j,I\times J,i,i^{\prime})=\emptyset$ . Since $\mathcal{M}(j,I\times J,i,i^{\prime})$ is completely determined by the end of iteration $j-1$ , and $\textrm{SparseSample}(j,I,i)$ is an independent sample of $30\log n$ elements from $\textrm{Sparse}(j,I,i)$ selected during iterations $j$ , the probability that $\textrm{SparseSample}(j,I,i)$ fails for $J,i^{\prime}$ is at most $(1-1/3)^{30\log n}\leq n^{-10}$ . There are at most $n$ pairs $J,i^{\prime}$ so the probabibility that $\textrm{SparseSample}(j,I,i)$ fails for some $J,i^{\prime}$ is at most $n^{-9}$ . There are at most $n$ triples $j,I,i$ so the probability that some $\textrm{SparseSample}(j,I,i)$ fails is at most $n^{-8}$ . We denote by $B_{\mathrm{SS}}^{j}$ the random bits that are used at iteration $j$ to generate the samples from $\textrm{Sparse}(j,I,i)$ for all $I$ and $i$ .

Testing potential pivots in ProcessDense. During the while loop for $I\in\mathcal{T}$ of ProcessDense, we make a random selection of a set $\mathcal{S}$ , and this choice affects whether $I$ is assigned to $\textrm{Sparse}(j,i)$ or becomes a pivot. The constant $c_{0}$ in line (10) is chosen below to satisfy certain technical conditions. We denote by $B_{\mathrm{PD}}^{j}$ the random bits used at iteration $j$ to generate sets $\mathcal{S}$ where we make the simplifying assumption that there is a designated block of bits for each possible $I\in\textrm{Intervals}(w_{j})$ and $i$ to select the corresponding $\mathcal{S}$ . (Some of the blocks might be unused.)

There are two bad events that depend on the choice of $\mathcal{S}$ :

$|\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)|<d_{j}/2$ and $I$ is not assigned to $\textrm{Sparse}(j,i)$ . 2. 2.

$|\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)|>2d_{j}$ and $I$ is assigned to $\textrm{Sparse}(j,i)$ .

For both of the bad events, we observe that (i) for any input $(j,I\times\mathcal{J},i)$ , $\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ returns the stack of candidates $\langle i;I\times J\rangle$ that are classified as close among $I\times\mathcal{J}$ , and (ii) the classification of level $j$ candidates as close or far is completely deterministic given the random bits $B_{\mathrm{SG}}$ for SLOW-GED, and the random bits $B_{\mathrm{SS}}^{\leq j}$ and $B_{\mathrm{PD}}^{\leq j-1}$ for the first $j-1$ iterations and $\textrm{Preprocess}(j)$ . Thus, for the random sample $\mathcal{S}$ of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)$ , where each interval is placed in $\mathcal{S}$ independently with probability $p$ , $\frac{1}{p}|\textrm{EnumerateClose}(j,I\times\mathcal{S},i)|$ is an estimate of $|\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)|$ , and the bad events can only occur if this estimate is sufficiently inaccurate. For suitably large $c_{0}$ , a simple Chernoff-Hoeffding bound shows that for each $(I,i)$ the probability of a bad event is at most $n^{-10}$ , and summing over the at most $O(n)$ such pairs, the probability of a bad event is at most $n^{-9}$ . We say ProcessDense has successful sampling if no such bad event occurs.

Successful randomization. An execution of FAST-GED has successful randomization if all calls to SLOW-GED are correct, all calls to SparseSample are successful, and ProcessDense has successful sampling. We denote the event of successful randomization by $\mathbf{SR}$ . By the above, $\Pr[\mathbf{SR}]\geq 1-1/n^{7}$ .

4.6 The properties enforced by FAST-GED.

In this section we state and prove a theorem that states the main properties enforced by FAST-GED. By hypothesis, SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ . We want to show thatFAST-GED satisfies $\textrm{gap-condition}(T,\zeta,Q)$ with $T=T^{\prime}+1/6$ and suitably chosen $\zeta>0$ and $Q\geq 1$ (depending only on $T^{\prime}$ , $\zeta^{\prime}$ and $Q^{\prime}$ ). As in the discussion in Section 4.2, we say that the level $j$ candidate $\langle i;I\times J\rangle$ is classified as close if $\textrm{EnumerateClose}(j,I\times\{J\},i)$ returns $\{J\}$ and is classified as far if $\textrm{EnumerateClose}(j,I\times\{J\},i)$ returns $\emptyset$ .

Theorem 4.5.

Assume that SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ . Consider a run of FAST-GED on input $(n,\theta,1/2;x,y)$ where $n^{-\zeta^{\prime}}\leq\theta\leq 1$ and $|x|=|y|=n$ , that meets the conditions for successful randomization.

For all $j\in\{1,\ldots,k\}$ , $i\in\{0,\ldots,\log(1/\theta)\}$ , $I\in\textrm{Intervals}(w_{j})$ , $J\in\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ , $\mathcal{J}\subseteq\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ :

Soundness of $\mathcal{B}^{\textrm{below}}$ .

If $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ then $\textrm{ncost}(I\times J)\leq{\varepsilon(i-q_{j-1}-6)}$ .

Completeness of $\mathcal{B}^{\textrm{below}}$ .

If $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then (i) $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ or (ii) there exists an $i^{\prime}\leq i$ such that $|\mathcal{M}(j,I\times J,i,i^{\prime})|>\frac{1}{3}|\textrm{Intervals}(w_{j-1};I)\cap\textrm{Sparse}(j-1,i^{\prime})|$ .

Consistency of EnumerateClose.

$J\in\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ * if and only if $J\in\textrm{EnumerateClose}(j,I\times\{J\},i)$ . If $J\in\textrm{EnumerateClose}(j,I\times\{J\},i)$ then $\langle i;I\times J\rangle$ is classified as close.*

Soundness of EnumerateClose.

If $\langle i;I\times J\rangle$ is classified as close then $\textrm{ncost}(I\times J)\leq{\varepsilon(i-q_{j-1}-6)}$ .

Completeness of EnumerateClose.

If $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then $\langle i;I\times J\rangle$ is classified as close.

Validity of Sparse.

$I\in\textrm{Sparse}(j,i)$ * implies that $\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}))$ has size at most $2d_{j}$ .*

Soundness of $\mathcal{B}^{\textrm{dense}}$ .

If $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ then $\textrm{ncost}(I\times J)\leq{\varepsilon(i-q_{j})}$ .

Completeness of $\mathcal{B}^{\textrm{dense}}$ .

If $I\not\in\textrm{Sparse}(j,i)$ and $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ .

Soundness of $\mathcal{R}(j)$ .

Every box in $\mathcal{R}(j)$ is correctly certified, i.e., $(I\times J,\kappa)\in\mathcal{R}(j)$ implies $\textrm{ncost}(I\times J)\leq\kappa$ .

Completeness of $\mathcal{Q}(j)$ .

If $I\not\in\textrm{Sparse}(j,i)$ and $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ then $(I\times J,\min(1,{\varepsilon(i-q_{j})}))\in\mathcal{Q}(j)$

The proof of this theorem is by induction on $j$ . For fixed $j$ when we prove a property we assume the properties listed above it hold. With the exception of the Completeness of $\mathcal{B}^{\textrm{below}}$ , which we defer to the next subsection, the proofs are straightforward.

Proof of Soundness of $\mathcal{B}^{\textrm{below}}$ . For $j=1$ , the requirement is vacuously satisfied. Suppose $j>1$ . By Soundness of $\mathcal{R}(j-1)$ , every box in $\mathcal{R}(j-1,I)$ is certified. If $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ , then the pseudocode implies that $J\in\textrm{APM}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)},{\varepsilon(i-q_{j-1}-5)},\mathcal{R}(j-1,I))$ . By definition of the soundness of APM, $J$ is included in the output to the call of APM implies that $\textrm{cost}(I\times J)\leq 2{\varepsilon(i-q_{j-1}-5)}={\varepsilon(i-q_{j-1}-6)}$ .

Proof of Completeness of $\mathcal{B}^{\textrm{below}}$ . See subsection 4.7

Proof of Consistency of EnumerateClose. We must show that whether $J\in\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ does not depend on $\mathcal{J}\setminus\{J\}$ . In the case $j=1$ , $J\in\textrm{EnumerateClose}(i,I\times\mathcal{J},i)$ if and only $\textrm{SLOW-GED}(I\times J)\leq{\varepsilon(i)}$ returns accept which does not depend on $\mathcal{J}\setminus\{J\}$ . Assume $j>1$ . From the pseudocode of EnumerateClose, $J\in\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ if and only if (i) $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ or (ii) $J\in\mathcal{K}$ and $\textrm{SLOW-GED}(z_{I},z_{J},{\varepsilon(i)})$ returns accept. Neither condition (i) nor $\textrm{SLOW-GED}(z_{I},z_{J},{\varepsilon(i)})$ depend on $\mathcal{J}\setminus\{J\}$ . It remains to show that whether $J\in\mathcal{K}$ is also independent of $\mathcal{J}\setminus\{J\}$ . Now $J\in\mathcal{K}$ if and only if there exists $i^{\prime}\in\{0,\ldots,i\}$ , $I^{\prime}\in\textrm{SparseSample}(j,I,i)$ and $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ such that $J^{\prime}\in\mathcal{S}^{\prime}$ . The set $\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ obviously doesn’t depend on $\mathcal{J}\setminus\{J\}$ . For $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ we must have $J^{\prime}\in\mathcal{J}^{\prime}$ , and therefore by the consistency of EnumerateClose at level $j-1$ , $J^{\prime}\in\textrm{EnumerateClose}(j-1,I^{\prime}\times\mathcal{J}^{\prime},i^{\prime})$ if and only if $J^{\prime}\in\textrm{EnumerateClose}(j-1,I^{\prime}\times\{J^{\prime}\},i^{\prime})$ .

Proof of Soundness of EnumerateClose. $\langle i;I\times J\rangle$ is classified as close means that $J\in\textrm{EnumerateClose}(j,I\times\{J\},i)$ . Now for this to happen either (i) $\textrm{SLOW-GED}(z_{I},z_{J},{\varepsilon(i)})$ returns accept, or (ii) $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ . If (i) holds then the guarantee on SLOW-GED implies $\textrm{ncost}(I\times J)\leq Q^{\prime}{\varepsilon(i)}\leq{\varepsilon(i-q_{j-1}-6)}$ , since $\log(Q^{\prime})=q_{0}\leq q_{j-1}$ for all $j\geq 1$ . If (ii) holds then the result follows from the Soundness of $\mathcal{B}^{\textrm{below}}$ .

Proof of Completeness of EnumerateClose. Suppose $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ . By the Completeness of $\mathcal{B}^{\textrm{below}}$ , we have (i) $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ or (ii) $\textrm{Sparse}(j-1,i)\cap\textrm{Intervals}(w_{j-1};I)\neq\emptyset$ and there exists an $i^{*}\leq i$ so that $|\mathcal{M}(j,I\times J,i,i^{*})|\geq\frac{1}{3}|\textrm{Intervals}(w_{j-1};I)\cap\textrm{Sparse}(j-1,i^{*})|$ . If (i) holds, then the definition of EnumerateClose immediately gives $J\in\textrm{EnumerateClose}(j,I\times\{J\},i)$ . If (ii) holds, then the success condition for $\textrm{SparseSample}(j,I,i)$ (from Section 4.5) implies that there is an $I^{*}\in\textrm{SparseSample}(j,I,i^{*})$ such that $(I^{*},i^{*})$ is a marker for $\langle i;I\times J\rangle$ . During the execution of $\textrm{EnumerateClose}(j,I\times J,i)$ , when $i^{*}$ is selected in line (11) and $I^{*}$ in line (12), by the definition of marker, $J$ is added to $\mathcal{K}$ in line (17). The correctness of SLOW-GED implies that $\textrm{SLOW-GED}(I\times J,{\varepsilon(i)})$ will accept in line (23) and so $J$ will be added to $\mathcal{S}$ .

Proof of Validity of Sparse. This follows immediately from the assumption that ProcessDense has successful sampling.

Proof of Soundness of $\mathcal{B}^{\textrm{dense}}$ . For $i\leq q_{j}$ the claim is trivial so we assume $i-q_{j}>0$ . Suppose $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ . $\mathcal{B}^{\textrm{dense}}(j,I,i)$ was defined during iteration $i$ of the main loop (1-34) of $\textrm{ProcessDense}(j)$ , during one of the iterations of the while loop (8-23). Let $I^{*}$ be the pivot during that iteration. Then $I\in\textrm{EnumerateClose}(j,I^{*}\times\textrm{Intervals}(w_{j}),h_{1})$ and $J^{\prime}\in\textrm{EnumerateClose}(j,I^{*}\times\textrm{Intervals}(w_{j},{\varepsilon(h_{2}+3)}),h_{2})$ , for $J^{\prime}=\textrm{Round}(J,{\varepsilon(h_{2}+3)})$ . By the Soundness of EnumerateClose, $\textrm{ncost}(I^{*}\times I)\leq{\varepsilon(h_{1}-q_{j-1}-6)}$ and $\textrm{ncost}(I^{*}\times J^{\prime})\leq{\varepsilon(h_{2}-q_{j-1}-6)}$ . By the triangle inequality and Propositon 2.2, we have $\textrm{ncost}(I\times J)\leq{\varepsilon(h_{1}-q_{j-1}-6)}+{\varepsilon(h_{2}-q_{j-1}-6)}+{\varepsilon(h_{2}+3)}\leq 2{\varepsilon(h_{2}-q_{j-1}-6)}={\varepsilon(h_{2}-q_{j-1}-7)}={\varepsilon(i-3q_{j-1}-21)}={\varepsilon(i-q_{j})}$ .

Proof of Completeness of $\mathcal{B}^{\textrm{dense}}$ . Suppose $I\not\in\textrm{Sparse}(j,i)$ and $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ . Since $I\not\in\textrm{Sparse}(j,i)$ during iteration $i$ of the main loop (1), there is an iteration of the while loop (8-22) of $\textrm{ProcessDense}(j)$ where $I$ was removed from $\mathcal{T}$ . Let $I^{*}$ be the pivot for that iteration. Since $I$ was removed from $\mathcal{T}$ , $I\in\mathcal{X}$ during this iteration, so $I\in\textrm{EnumerateClose}(j,I^{*}\times\textrm{Intervals}(w_{j}),h_{1})$ and by the Soundness of EnumerateClose $\textrm{ncost}(I^{*}\times I)\leq{\varepsilon(h_{1}-q_{j-1}-6)}$ . Let $J^{\prime}=\textrm{Round}(J,{\varepsilon(h_{2}+3)})$ . It suffices to show that $J\in\mathcal{Y}$ for this same iteration, which would follow from $J^{\prime}\in\textrm{EnumerateClose}(j,I^{*}\times\textrm{Intervals}(w_{j},{\varepsilon(h_{2}+3)}),h_{2})$ . By the Completeness of EnumerateClose it suffices to show that $\textrm{ncost}(I^{*}\times J^{\prime})\leq{\varepsilon(h_{2})}$ . By the triangle inequality and Propositon 2.2, $\textrm{ncost}(I^{*}\times J^{\prime})\leq\textrm{ncost}(I^{*}\times I)+\textrm{ncost}(I\times J^{\prime})\leq\textrm{ncost}(I^{*}\times I)+\textrm{ncost}(I\times J)+{\varepsilon(h_{2}+3)}\leq{\varepsilon(h_{1}-q_{j-1}-6)}+{\varepsilon(i)}+{\varepsilon(h_{2}+3)}\leq{\varepsilon(h_{2})}/2+{\varepsilon(h_{2})}/4+{\varepsilon(h_{2})}/8\leq{\varepsilon(h_{2})}$ as required.

Proof of Soundness of $\mathcal{R}(j)$ . By Proposition 4.3, it suffices that every box in $\mathcal{Q}(j)$ is correctly certified. In line (26) of $\textrm{ProcessDense}(j)$ , $(I\times J,{\varepsilon(i-q_{j})})\in\mathcal{Q}(j)$ only if $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ which is correctly certified by the Soundness of $\mathcal{B}^{\textrm{dense}}$ .

Proof of Completeness of $\mathcal{Q}(j)$ . Suppose $I\not\in\textrm{Sparse}(j,i)$ and $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ . By the Completeness of $\mathcal{B}^{\textrm{dense}}$ , $J\in\mathcal{B}^{\textrm{dense}}(j,I,i)$ and so the definition of $\mathcal{Q}(j)$ implies that $(I\times J,{\varepsilon(i-q_{j})})\in\mathcal{Q}(j)$ .

4.7 Proof of Completeness of $\mathcal{B}^{\textrm{below}}$

Here we finish the proof of Theorem 4.5, by establishing the final property, whose proof is significantly more involved than that of the others. The proof is based on ideas from [17].

Consider a candidate $\langle i;I\times J\rangle$ with $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ . We assume condition (ii) fails and deduce $\textrm{cost}_{\mathcal{R}(j-1,I)}(I\times J)\leq{\varepsilon(i-q_{j-1}-5)}w_{j}$ . By the definition of Preprocess and the Completeness of APM, this immediately implies condition $J\in\mathcal{B}^{\textrm{below}}(j,I,i)$ , which is condition (i).

Fix a minimum cost traversal $\tau$ of $I\times J$ . The proof proceeds via the following steps.

Step 1.

For each $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ we specify a candidate $\langle t(I^{\prime});I^{\prime}\times\hat{J}(I^{\prime})\rangle$ , which is approved in the sense defined in the description of ProcessDense in Section 4.4. (The collection of boxes $\{I^{\prime}\times\hat{J}(I^{\prime}):I^{\prime}\in\textrm{Intervals}(w_{j-1};I)\}$ should be thought of as approximatly covering $\tau$ .)

Step 2.

We upper bound $\textrm{cost}_{\mathcal{R}(j-1,I)}(I\times J)$ as a constant times $\sum_{I^{\prime}}{\varepsilon(t(I^{\prime}))}w_{j-1}$ plus $8{\varepsilon(i)}w_{j}$ .

Step 3.

We show that if (ii) fails, then $\sum_{I^{\prime}}{\varepsilon(t(I^{\prime}))}w_{j-1}$ can be upper bounded by a constant multiple of ${\varepsilon(i)}w_{j}$

Step 4.

This gives that $\textrm{cost}_{cR(j-1)}(I\times J)$ is at most a constant multiple of ${\varepsilon(i)}w_{j}$ .

Step 1. Specifying $\langle t(I^{\prime});I\times\hat{J}(I^{\prime})\rangle$ for each $I^{\prime}$ . Consider a pair $(I^{\prime},i^{\prime})$ where $i^{\prime}\in\{0,\ldots,i\}$ and $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ . Proposition 2.4 implies there is a level $j-1$ candidate $\langle i^{\prime};I^{\prime}\times J^{\prime}\rangle$ such that $\textrm{ncost}(I^{\prime}\times J^{\prime})\leq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i^{\prime}+3)}$ and $\textrm{disp}(I^{\prime}\times J^{\prime},\tau_{I^{\prime}})\leq\textrm{cost}(\tau_{I^{\prime}})+{\varepsilon(i^{\prime}+3)}w_{j-1}$ . Select such an interval $J^{\prime}$ and denote it by $J_{i^{\prime}}(I^{\prime})$ (keeping the dependence on $\tau$ implicit.)

For each $I^{\prime}$ let us define $t(I^{\prime})$ to be the largest index $h\leq i$ for which the candidate $\langle h;I^{\prime}\times J_{i}(I^{\prime})\rangle$ is approved that is $I^{\prime}\not\in\textrm{Sparse}(j-1,h)$ and $J_{i^{\prime}}(I^{\prime})\in\mathcal{B}^{\textrm{dense}}(j-1,I^{\prime},h)$ . Let $\hat{J}(I^{\prime})=J_{t(I^{\prime})}(I^{\prime})$ . We record the important properties:

Proposition 4.6.

For each $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ :

The box $I^{\prime}\times\hat{J}(I^{\prime})$ satisfies $\textrm{ncost}(I^{\prime}\times\hat{J}(I^{\prime}))\leq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(t(I^{\prime})+3)}$ and $\textrm{disp}(I^{\prime}\times\hat{J}(I^{\prime}),\tau_{I^{\prime}})\leq\textrm{cost}(\tau_{I^{\prime}})+{\varepsilon(i^{\prime}+3)}w_{j-1}$ . 2. 2.

The candidate $\langle t(I^{\prime});I^{\prime}\times\hat{J}(I^{\prime})\rangle$ is approved, and hence $(I^{\prime}\times\hat{J}(I^{\prime}),{\varepsilon(t(I^{\prime})-q_{j-1})})\in\mathcal{Q}(j-1)$ . 3. 3.

For any $i^{\prime}\in\{t(I^{\prime})+1,\dots,i\}$ either $I^{\prime}\in\textrm{Sparse}(j-1,i^{\prime})$ or $I^{\prime}\not\in\textrm{Sparse}(j-1,i^{\prime})$ and $J_{i^{\prime}}(I^{\prime})\not\in\mathcal{B}^{\textrm{dense}}(j-1,I^{\prime},i^{\prime})$ .

Proof.

The first two properties follow immediately from the definitions of $t(I^{\prime})$ and $\hat{J}(I^{\prime})$ . For the third property, the maximality of $t(I^{\prime})$ implies that for $i^{\prime}\in\{t(I^{\prime})+1,\dots,i\}$ , $\langle i^{\prime};I^{\prime}\times J_{i}(I^{\prime})\rangle$ is not approved, and the result follows from the definition of approved. ∎

Step 2. Upper bound on $\textrm{cost}_{\mathcal{R}(j)}(I\times J)$ .

Proposition 4.7.

[TABLE]

(This is closely related to Lemma 4.1 of [17] and the proof is similar.)

Proof.

We transform the path $\tau$ in $G_{z}$ to a path $\tau^{\prime}$ in the shortcut graph $\widetilde{G}(\mathcal{R}(j-1,I))$ (see Section 2) and control the increase in cost. Let $I_{1},\ldots,I_{m}$ be the intervals of $\textrm{Intervals}(w_{j-1};I)$ in order, and for $h\in[m]$ , let $i_{h}=t(I_{h})$ and $J_{h}=\widehat{J}(I_{h})$ . Let $\delta_{h}$ be the smallest power of 2 such that $\delta_{h}w_{j-1}\geq\textrm{disp}(I_{h}\times J_{h},\tau_{I_{h}})$ . By Proposition 4.6, $\delta_{h}\leq 2\textrm{ncost}(\tau_{I_{h}})+2{\varepsilon(i_{h}+3)}$ , and $(I_{h}\times J_{h},{\varepsilon(i_{h}-q_{j-1})})\in\mathcal{Q}(j-1)$ . Let $L=\{h\in[m]:\delta_{h}<1/2\}$ . For $h\in L$ , let $J_{h}^{\prime}=J_{h}/[\delta_{h}w_{j-1}]$ (the interval obtained by removing the first and last $\delta_{h}w_{j-1}$ indices from $J_{h}$ ). The certified box $(I_{h}\times J_{h}^{\prime},{\varepsilon(i_{h}-q_{j-1})}+2\delta_{h})$ belongs to $\mathcal{R}(j-1)$ , and since $I_{h}\subseteq I$ , it also belongs to $\mathcal{R}(j-1,I)$ . Let $e_{h}=e_{I_{h},J_{h}^{\prime}}$ be the shortcut edge with cost $({\varepsilon(i_{h}-q_{j-1})}+2\delta_{h})w_{j-1}$ . We claim (1) there is a source-sink path $\tau^{\prime}$ in $\widetilde{G}(\mathcal{R}(j-1,I))$ that consists of $\{e_{i}:i\in L\}$ , plus a collection $\{H_{i}:i\in[m]\setminus L\}$ where $H_{i}$ is a horizontal path whose projection to the $x$ -axis is $I_{i}$ , plus a collection of (possibly empty) vertical paths $V_{0},V_{1},\ldots,V_{m}$ where the $x$ -coordinate of $V_{i}$ for $i>0$ is $\max(I_{i})$ and 0 for $V_{0}$ , and (2) $\textrm{cost}(\tau^{\prime})$ satisfies the bound of the lemma.

For the first claim, for $h\in[m]$ , let $p_{h}=(i_{h},j_{h})$ be the first point in $\tau_{I_{h}}$ and define $p_{m+1}$ to be the final point of $\tau$ . We will define $\tau^{\prime}$ to pass through all of the $p_{h}$ . Let $J_{h}^{*}$ be the vertical projection of $\tau_{I_{h}}$ so that $\tau_{I_{h}}$ traverses $I_{h}\times J_{h}^{*}$ . The choice of $\delta_{h}$ implies that for $h\in L$ , $J_{h}^{\prime}\subseteq J_{h}^{*}$ . Define the portion $\tau^{\prime}_{h}$ between $p_{h}$ and $p_{h+1}$ as follows: if $h\in L$ , climb vertically from $p_{h}$ to $(i_{h},\min(J^{\prime}_{h}))$ and traverse $e_{I_{h},J^{\prime}_{h}}$ and climb vertically to $p_{h+1}$ and if $h\not\in L$ then move horizontally from $p_{h}$ to $(i_{h+1},j_{h})$ and then climb vertically to $p_{h+1}$ .

For the second claim, we upper bound $\textrm{cost}(\tau^{\prime})$ . For $h\in L$ , $e_{I_{h},J_{h}}$ has cost at most $({\varepsilon(i_{h}-q_{j-1})}+2\delta_{h})w_{j-1}$ , and for $h\not\in L$ , the horizontal path that projects to $I_{h}$ costs $w_{j-1}\leq 2\delta_{h}w_{j-1}$ ; the total cost of shortcut and horizontal edges is at most $\sum_{h}({\varepsilon(i_{h}-q_{j-1})}+2\delta_{h})w_{j-1}$ . The cost of vertical edges is $\sum_{h\in L}(w_{j-1}-\mu(J^{\prime}_{h}))+\sum_{h\not\in L}w_{j-1}=\sum_{h\in L}2\delta_{h}w_{j-1}+\sum_{h\not\in L}w_{j-1}\leq\sum_{h}2\delta_{h}w_{j-1}$ .

The combined cost of all edges is at most

[TABLE]

which implies the desired bound. ∎

Step 3. Implication of failure of condition (ii). We now use the failure of (ii) to obtain an upper bound on the righthand side of Proposition 4.7.

For $i^{\prime}\leq i$ , let $\mathcal{M}_{i^{\prime}}=\mathcal{M}(j,I\times J,i,i^{\prime})$ and $\mathcal{S}_{i^{\prime}}$ represent the set $\textrm{Sparse}(j-1,i^{\prime})\cap\textrm{Intervals}(w_{j-1};I)$ . Let $\mathcal{I}^{\prime}=\textrm{Intervals}(w_{j-1};I)$ .

The failure of condition (ii) implies:

[TABLE]

.

Multiplying (3) by ${\varepsilon(i^{\prime})}$ and summing on $i^{\prime}$ yields:

[TABLE]

Switching the sums:

[TABLE]

To reduce this further, we need the following sufficient condition for $I^{\prime}\in\mathcal{M}_{i^{\prime}}$ .

Proposition 4.8.

Suppose the candidate $\langle i;I\times J\rangle$ satisfies $\textrm{ncost}(I\times J)\leq{\varepsilon(i)}$ and $\tau$ is a min-cost traversal of $I\times J$ . Let $(I^{\prime},i^{\prime})$ be a pair such that $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ and $i^{\prime}\in\{0,\ldots,i\}$ .

If ${\varepsilon(i^{\prime})}\geq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}$ then $\textrm{ncost}(I^{\prime}\times J_{i^{\prime}}(I^{\prime}))\leq{\varepsilon(i^{\prime})}$ . 2. 2.

If ${\varepsilon(i^{\prime})}\geq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}$ and $I^{\prime}\in\textrm{Sparse}(j-1,i^{\prime})$ then $(I^{\prime},i^{\prime})$ is a marker for $\langle i;I\times J\rangle$ .

Proof.

For the first part, by the choice of $J_{i^{\prime}}(I^{\prime})$ , we have $\textrm{ncost}(I^{\prime}\times J_{i^{\prime}}(I^{\prime}))\leq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}$ and by the hypothesis of the Proposition, this is at most ${\varepsilon(i^{\prime})}$ .

For the second part. By Completeness of $\textrm{EnumerateClose}(j-1,\cdot)$ and the first part, $\langle i^{\prime};I^{\prime}\times J_{i^{\prime}}(I^{\prime})\rangle$ is classified as close. So we just have to show that $J_{i^{\prime}}(I)\in\textrm{ZoomIn}(j,I\times\{J\},i,I^{\prime},i^{\prime})$ . It suffices that $\textrm{disp}(I^{\prime}\times J_{i^{\prime}}(I^{\prime}),I\times J)\leq 2{\varepsilon(i)}w_{j}$ . To bound $\textrm{disp}(I^{\prime}\times J_{i^{\prime}}(I^{\prime}),I\times J)$ it suffices to bound the vertical distance from the point $(\min(I^{\prime}),\min(J_{i^{\prime}}(I^{\prime})))$ to the diagonal of $I\times J$ . Let $(p,q)$ be the initial point of $\tau_{I^{\prime}}$ . By the definition of $J_{i^{\prime}}(I^{\prime}))$ , the vertical distance from $(\min(I^{\prime}),\min(J_{i^{\prime}}(I^{\prime})))$ to $(p,q)$ is at most $\textrm{cost}(\tau_{I^{\prime}})+{\varepsilon(i^{\prime}+3)}w_{j-1}\leq\textrm{cost}(\tau)+w_{j-1}$ . By Proposition 2.5 the vertical distance from $(p,q)$ to the diagonal of $I\times J$ is at most $\textrm{cost}(\tau)/2$ . So $\textrm{disp}(I^{\prime}\times J_{i^{\prime}}(I^{\prime}),I\times J)\leq\frac{3}{2}\textrm{cost}(\tau)+w_{j-1}$ . By hypothesis, $\textrm{cost}(\tau)\leq{\varepsilon(i)}w_{j}$ , and by assumption (2) in Section 4.1, $w_{j-1}\leq\frac{\theta}{2}w_{j}$ , and so $\textrm{disp}(I^{\prime}\times J_{i^{\prime}}(I^{\prime})),I\times J)\leq(\frac{3}{2}{\varepsilon(i)}+\frac{1}{2}\theta)w_{j}\leq 2{\varepsilon(i)}w_{j}$ , as required. ∎

Let $\mathcal{G}(I)=\{I^{\prime}\in\textrm{Intervals}(w_{j-1};I):t(I^{\prime})<i\;\&\;{\varepsilon(t(I^{\prime})+1)}\geq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}\}$ . We claim that for each $I^{\prime}\in\mathcal{G}(I)$ , $I^{\prime}\in\textrm{Sparse}(j-1,t(I^{\prime})+1)$ . If it were not then by Part 3 of Proposition 4.6, $J_{i^{\prime}}(I^{\prime})\not\in\mathcal{B}^{\textrm{dense}}(j-1,I^{\prime},t(I^{\prime})+1)$ . But by Part 1 of Proposition 4.8 this would contradict completeness of $\mathcal{B}^{\textrm{dense}}(j-1,I^{\prime},t(I^{\prime})+1)$ . Hence, for each $I^{\prime}\in\mathcal{G}(I)$ , $(I^{\prime},t(I^{\prime})+1)$ is a marker.

We will combine Proposition 4.8 with inequality (4). The sum on the lefthand side of (4) includes all pairs $(I^{\prime},t(I^{\prime})+1)$ where $I^{\prime}\in\mathcal{G}(I)$ and so is bounded below by $\sum_{I^{\prime}\in\mathcal{G}(I)}{\varepsilon(t(I^{\prime})+1)}$ . To upper bound the righthand sum of (4), we look at the inner sum corresponding to a given $I^{\prime}\in\textrm{Intervals}(w_{j-1};I)$ . This is a sum of ${\varepsilon(i^{\prime})}$ over those $i^{\prime}$ such that $I^{\prime}$ in $\textrm{Sparse}(j-1,i^{\prime})$ and $I^{\prime}$ not in $\mathcal{M}_{i^{\prime}}$ .

We claim that if $i^{\prime}$ contributes to this sum then

[TABLE]

To see this note that if ${\varepsilon(i^{\prime})}\geq 2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}$ then Part 2 of Proposition 4.8 implies that $I^{\prime}\not\in\mathcal{S}_{i^{\prime}}\setminus\mathcal{M}_{i^{\prime}}$ , so $i^{\prime}$ is not included in the sum.

Now in the case that $I^{\prime}\in\mathcal{G}(I)$ then (5) implies that ${\varepsilon(i^{\prime})}<{\varepsilon(t(I^{\prime})+1)}$ and so ${\varepsilon(i^{\prime})}\leq{\varepsilon(t(I^{\prime})+2)}$ . Summing over all such $i^{\prime}$ , the geometric series is at most ${\varepsilon(t(I^{\prime})+1)}$ .

For $I^{\prime}\not\in\mathcal{G}(I)$ , let $v(I^{\prime})$ be the least $i^{\prime}$ that contributes to the sum. So the sum is at most $2{\varepsilon(v(I^{\prime}))}$ , and by (5) this is at most $4\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+2)}$ .

Thus (4) implies:

[TABLE]

Multiplying the inequality by $2$ and substracting $\sum_{I^{\prime}\in\mathcal{G}(I)}{\varepsilon(t(I^{\prime})+1)}$ from both sides gives:

[TABLE]

Now add $\sum_{I^{\prime}\not\in\mathcal{G}(I)}{\varepsilon(t(I^{\prime})+1)}$ to both sides:

[TABLE]

For the first sum on the right, $I^{\prime}\not\in\mathcal{G}(I)$ implies either ${\varepsilon(t(I^{\prime})+1)}={\varepsilon(i+1)}$ or ${\varepsilon(t(I^{\prime})+1)}<2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+3)}$ , so you can bound this in both cases by $2\textrm{ncost}(\tau_{I^{\prime}})+{\varepsilon(i+1)}$ . Thus we get:

[TABLE]

Step 4. Combining the bounds. Combining the previous bound with the bound of Proposition 4.7 gives:

[TABLE]

as required to establish the Completeness of $\mathcal{B}^{\textrm{below}}$ .

4.8 Correctness of FAST-GED

We now complete the proof that the output of FAST-GED gives a constant factor approximation to edit distance with high probability. As in Theorem 4.5 we assume that SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ . Consider a run of FAST-GED on input $(n,\theta,1/2;x,y)$ where $n^{-\zeta^{\prime}}\leq\theta\leq 1$ and $|x|=|y|=n$ . The conclusion of the theorem has a quality parameter $Q$ which we set to $2^{q_{k}+6}$ . We must prove that the FAST-GED satisfies the Soundness and Completeness properties for gap algorithms from Section 1.

The final post-processing step is a call to $\textrm{APM}(\{0,\ldots,n\}\times\{\{n,\ldots,2n\}\},\theta 2^{q_{k}+5},\mathcal{R}_{k})$ , and the algorithm returns accept or reject according to the output of this call. We will apply the Soundness and Completeness of $\mathcal{B}^{\textrm{below}}$ (with $j=k+1$ ) by reinterpreting this final step as asking whether $\{n,\ldots,2n\}\in\mathcal{B}^{\textrm{below}}(k+1,\{0,\ldots,n\},\log(1/\theta)-q_{k}-6)$ (where $w_{k+1}=n$ ). The Soundness and Completeness of $\mathcal{B}^{\textrm{below}}$ extends (with no change) to this case. Thus if the algorithm returns accept, then $\textrm{ncost}(\{0,\ldots,n\},\{n,\ldots,2n\})\leq\theta 2^{q_{k}+6}=\theta Q$ , and the gap-algorithm satisfies Soundness. For Completeness, assume $\Delta_{\textrm{edit}}(x,y)\leq\theta$ . The Completeness of $\mathcal{B}^{\textrm{below}}$ extends (with no change) to this case. We conclude that (i) $J\in\mathcal{B}^{\textrm{below}}(k+1,\{0,\ldots,n\},\log(1/\theta)-q_{k}-5)$ or (ii) there exists an $i^{\prime}\leq i$ such that $|\mathcal{M}(k+1,I\times J,i,i^{\prime})|>\frac{1}{3}|\textrm{Intervals}(w_{k};I)\cap\textrm{Sparse}(k,i)|$ . Since $d_{k}=1$ , Proposition 4.4 implies all sets $\textrm{Sparse}(k,i)$ are empty, so $\mathcal{M}(k+1,I\times J,i,i^{\prime})$ are also empty but (ii) requires them to be non-empty. Hence, (ii) can not hold, and so (i) holds, which implies FAST-GED must accept, and so Completeness holds.

4.9 Time analysis

In this subsection, we upper bound the expected running time of FAST-GED conditioned on the event $\mathbf{SR}$ of successful randomization, in terms of the algorithm parameters $w_{1},\ldots,w_{k}$ , and $d_{0},\ldots,d_{k}$ . These parameters will be optimized in the next subsection.

Theorem 4.9.

Suppose that SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ . For $\theta\geq n^{-\zeta^{\prime}}$ the expected running time of $\textrm{FAST-GED}(n,\theta,1/2;x,y)$ conditioned on $\mathbf{SR}$ is upper-bounded by:

[TABLE]

The above theorem is not quite sufficient for our purposes since it gives only an expected upper bound on the running time of the algorithm, while we want an absolute upper bound. We can replace the expected upper bound by an absolute upper bound by the following routine modification of FAST-GED. On a given input, use the above theorem to determine a number $\tau^{*}$ which is at least six times the expected upper bound on running time given by the above theorem. Then the probability that FAST-GED takes more than $\tau^{*}$ steps is at most $1/6$ . So we run FAST-GED but terminate with reject if it reaches $\tau^{*}$ steps. This converts the expected running time to an absolute bound on running time, but now the completeness error (the probability of false rejection) is increased from 1/2 to 2/3. But by running this algorithm twice and accepting if either run accepts we restore the completeness error to below 1/2.

Combining the above theorem with this modification gives an algorithm satisfying the correctness properties proved for FAST-GED and having an absolute upper bound on time given as in the above theorem.

We now proceed to the proof of Theorem 4.9.

Proof.

Recall from Section 4.5 that successful randomization means: (1) All calls to SLOW-GED return correct answers, (2) All calls to SparseSample are successful and (3) ProcessDense has successful sampling.

Recall that $B_{\mathrm{SG}}$ is the sequence of random bits pregenerated for the calls to SLOW-GED (as described in Section 4.5). For $j\in[k]$ , $B_{\mathrm{SS}}^{j}$ are the random bits generated to select SparseSample’s in $\textrm{Preprocess}(j)$ at iteration $j$ of the algorithm, and $B_{\mathrm{PD}}^{j}$ are the random bits generated to select sets $\mathcal{S}$ in $\textrm{ProcessDense}(j)$ at iteration $j$ (also as described in Section 4.5). Let $B^{\leq j}$ denote the random bits $B_{\mathrm{SG}},B_{\mathrm{SS}}^{1},B_{\mathrm{PD}}^{1},\ldots,B_{\mathrm{SS}}^{j},B_{\mathrm{PD}}^{j}$ . We introduce the following events:

$\mathbf{SG}$

All calls to SLOW-GED return correct answers.

$\mathbf{SS}(j)$

All calls to SparseSample during iteration $j$ are successful.

$\mathbf{SS}(\leq j)$

All calls to SparseSample through the end of iteration $j$ are successful.

$\mathbf{PD}(j)$

ProcessDense has successful sampling during iteration $j$ .

$\mathbf{PD}(\leq j)$

ProcessDense has successful sample through the end of iteration $j$ .

$\mathbf{SR}$

Successful randomization, i.e. $\mathbf{SG}\wedge\mathbf{SS}(\leq k)\wedge\mathbf{PD}(\leq k)$ .

We will argue that the expected running time of FAST-GED conditioned on $\mathbf{SR}$ is bounded by (9). In the bound, the outer sum on $j$ corresponds to iterations of FAST-GED. We will show that the cost of iteration $j$ is bounded by the inner sum. When we analyze iteration $j$ we fix the randomness $B^{\leq j}$ in such a way that $\mathbf{SG}\wedge\mathbf{SS}(\leq j-1)\wedge\mathbf{PD}(\leq j-1)$ holds. The cost of iteration $j$ is bounded conditioned on these fixed random bits and subject to requirement $\mathbf{SS}(j)\wedge\mathbf{PD}(j)$ .

As a first step, we need a bound on the running time for EnumerateClose. Recall that fixing the random bits $B^{\leq j-1}$ and $B_{\mathrm{SS}}^{j}$ makes $\textrm{EnumerateClose}(j,\cdot)$ run deterministically. In the lemma below, we condition on $(B^{\leq j-1},B_{\mathrm{SS}}^{j})=\beta^{*j}$ and consider the expected time of $\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ where $\mathcal{J}$ is a set of intervals chosen according to any distribution (possibly depending on $\beta^{*j}$ ) in which no set appears in $\mathcal{J}$ with probability more than some fixed bound $p$ .

Lemma 4.10.

Let $p\in[0,1]$ , $j\in[k]$ , $I\in\textrm{Intervals}(w_{j})$ , and $i\in\{0,\ldots,\log(1/\theta)\}$ . Let $\beta^{*j}$ be an assignment of the random bits $B^{\leq j-1}$ and $B_{\mathrm{SS}}^{j}$ that satisfies the success conditions $\mathbf{SS}(\leq j)$ and $\mathbf{PD}(\leq j-1)$ . Let $\mathcal{J}$ be a random variable whose value is a subset of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ with the property that given the fixed randomness $B^{\leq j-1}$ and $B_{\mathrm{SS}}^{j}$ , each $J\in\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ belongs to $\mathcal{J}$ with probability at most $p$ . Then the expected running time of $\textrm{EnumerateClose}(j,I\times\mathcal{J},i)$ over the choice of $\mathcal{J}$ is at most:

[TABLE]

Proof.

The proof is by induction on $j$ . Suppose $j=1$ . We run $\textrm{SLOW-GED}(z_{I},z_{J},\kappa)$ for each $J\in\mathcal{J}$ . The expected time is $\widetilde{O}(\frac{1}{\theta}pd_{0}w_{1}^{1+1/T^{\prime}})$ since the expected size of $\mathcal{J}$ is at most $8p\frac{n}{\theta w_{1}}\leq\frac{16p}{\theta}\sqrt{n}\leq\frac{32p}{\theta}d_{0}$ and each call of SLOW-GED costs $w_{1}^{1+1/T^{\prime}}$ .

Now suppose $j>1$ . The loops on $i^{\prime}$ and $I^{\prime}$ starting in lines (11-12) are executed $\widetilde{O}(1)$ times. The construction of $\mathcal{J}^{\prime}$ in line (13) using ZoomIn takes $\widetilde{O}(|\mathcal{J}^{\prime}|)$ time (sort $\mathcal{J}$ in the natural order and build $\mathcal{J}^{\prime}$ "from left to right"). By Proposition 4.2, for each $J^{\prime}\in\textrm{Intervals}(w_{j-1},{\varepsilon(i^{\prime}+3)})$ , the number of ${\varepsilon(i+3)}$ -aligned $w_{j}$ -intervals $J$ such that $J^{\prime}\in\textrm{ZoomIn}(j,I\times J,i,I^{\prime},i^{\prime})$ is at most 33. Since $\mathcal{J}$ is selected according to a probability distribution so that no set $J$ belongs to $\mathcal{J}$ with probability more than $p$ , $\mathcal{J}^{\prime}$ is sampled according to some distribution where for each $J^{\prime}\in\textrm{Intervals}(w_{j-1},{\varepsilon(i^{\prime}+3)})$ the probability of $J^{\prime}\in\mathcal{J}^{\prime}$ is at most $33p$ . Hence, the expected size of $\mathcal{J}^{\prime}$ is at most $33p\frac{n}{w_{j-1}{\varepsilon(i^{\prime}+3)}}\leq O(p\frac{w_{j}}{\theta})$ , since $w_{j}\geq w_{j-1}\geq\lfloor\sqrt{n}\rfloor_{2}$ and ${\varepsilon(i^{\prime}+3)}\geq\theta/8$ . This is dominated by the summand for $h=j$ in (10), which is at least $\frac{p}{\theta}d_{j-1}w_{j}^{1+1/T^{\prime}}$ .

By induction hypothesis, the recursive call to EnumerateClose in line (14) takes expected time $\widetilde{O}(\frac{33p}{\theta}\sum_{1\leq h\leq j-1}d_{h-1}w_{h}^{1+1/T^{\prime}})$ which is $\widetilde{O}(\frac{p}{\theta}\sum_{1\leq h\leq j-1}d_{h-1}w_{h}^{1+1/T^{\prime}})$ .

The final loop (22-26) on $J\in\mathcal{K}$ requires $O(|\mathcal{K}|w_{j}^{1+1/T^{\prime}})$ time. So we need to bound the size of $\mathcal{K}$ . $\mathcal{K}$ is created in the loop on $i^{\prime},I^{\prime}$ . As noted there are $\widetilde{O}(1)$ iterations of these loops, so it suffices to bound the number of elements added to $\mathcal{K}$ for a single choice of $I^{\prime},i^{\prime}$ . During lines (15-17), for each $J\in\mathcal{J}$ , $J$ is added to $\mathcal{K}$ if there is a $J^{\prime}\in\mathcal{S}$ that is in $\textrm{ZoomIn}(i,I\times J,I^{\prime},i^{\prime})$ . By Proposition 4.2, each $J^{\prime}\in\mathcal{S}^{\prime}$ is responsible for the addition of at most 33 intervals to $\mathcal{K}$ , so $|\mathcal{K}|\leq 33|\mathcal{S}^{\prime}|$ . Now, $\mathcal{S}^{\prime}$ is the output of a call to $\textrm{EnumerateClose}(j-1,I^{\prime}\times\mathcal{J}^{\prime},i^{\prime})$ where $I^{\prime}\in\textrm{SparseSample}(j,I,i^{\prime})$ . By the success condition for iteration $j-1$ of ProcessDense (Section 4.5) there are at most $2d_{j-1}$ intervals $J^{\prime}\in\textrm{Intervals}(w_{j-1},{\varepsilon(i^{\prime}+3)})$ classified as close for $i^{\prime}$ . As observed in the previous paragraph, each of these at most $2d_{j-1}$ intervals belongs to $\mathcal{J}^{\prime}$ with probability at most $33p$ . So the expected size of $|\mathcal{S}^{\prime}|\leq 66pd_{j-1}$ . Thus the expected cost of the loop (22-26) is $\widetilde{O}(pd_{j-1}w_{j}^{1+1/T^{\prime}})$ . Combining with the other loop gives the claimed time bound for EnumerateClose. ∎

Now we analyze the running time of $\textrm{Preprocess}(j)$ . There are $\widetilde{O}(n/w_{j})$ pairs $(I,i)$ that are enumerated in the two outer loops. For each such pair, we construct $\textrm{Sparse}(j,I,i)$ (which takes $\widetilde{O}(1)$ time), and $\mathcal{B}^{\textrm{below}}(j,I,i)$ whose running time is $\widetilde{O}(1)$ if $j=1$ and is $\widetilde{O}(w_{j}+|\mathcal{R}(j-1,I)|+|\textrm{Intervals}(w_{j},{\varepsilon(i+3)})|)$ for $j>1$ , which is the time to run APM. Summing over $O(\log(n))$ values of $i$ and noting that ${\varepsilon(i)}\geq\theta$ , we obtain the upper bound $\widetilde{O}(w_{j}+|\mathcal{R}(j-1,I)|+\frac{n}{\theta w_{j-1}})$ . Summing over $I$ gives $\widetilde{O}(n+|\mathcal{R}(j-1)|+\frac{n^{2}}{\theta w_{j-1}w_{j}})$ . $|\mathcal{R}(j-1)|$ is at most the number of level $j-1$ candidates $\langle i;I^{\prime}\times J^{\prime}\rangle$ which is at most $\widetilde{O}(\frac{n^{2}}{\theta w_{j-1}^{2}})$ . Since $w_{h}\geq\lfloor\sqrt{n}\rfloor_{2}$ for all $h$ by assumption, the overall time for $\textrm{Preprocess}(j)$ is $\widetilde{O}(n/\theta)$ . We observe that this term is dominated by the $h=j$ term in the inner sum of (9) which is $\frac{n}{\theta^{2}}w_{j}^{1/T^{\prime}}\frac{d_{j-1}}{d_{j}}\geq\frac{n}{\theta}$ . The asymptotics of the running time does not depend on the choice of random bits $B_{\mathrm{SS}}^{j}$ .

We now analyze the time of $\textrm{ProcessDense}(j)$ . We will condition the analysis on fixing the random bits $(B^{\leq j-1},B_{\mathrm{SS}}^{j})=\beta^{*j}$ so that $\mathbf{SG}\wedge\mathbf{SS}(\leq j)\wedge\mathbf{PD}(\leq j-1)$ holds. The multiplicative cost of the outer iteration on $i$ is absorbed in the $\widetilde{O}$ term. The main part is the while loop (lines 8-22) on $I\in\mathcal{T}$ . This cost is divided into two parts, the call to EnumerateClose within line (11), and the cost of (lines 14-20) which is only executed within the "else".

To bound the cost of the call to EnumerateClose in line (11), we want to apply Lemma 4.10. For the hypothesis of this lemma we need an upper bound $p^{\prime}$ on the probability of any particular $w_{j}$ interval being selected for $\mathcal{S}$ . According to the code of EnumerateClose, every interval is placed in $\mathcal{S}$ with probability at most $p=\min(1,c_{0}\log n/d_{j})$ . However we need to consider the probability of a given interval being placed in $\mathcal{S}$ conditioned on the event $\mathbf{PD}(j)$ , and this can be bounded above by $p/\Pr[\mathbf{PD}(j)]$ . As noted in Section 4.5, $\mathbf{PD}(j)$ occurs with probability at least $1-n^{-9}\geq 1/2$ so we can bound the conditional probability of any interval being placed in $\mathcal{S}$ by $2p$ . Applying Lemma 4.10, the expected time for the call to EnumerateClose in line (11) is $\widetilde{O}(\frac{2p}{\theta}\sum_{1\leq h\leq j}d_{h-1}w_{h}^{1+1/T^{\prime}})$ which is $\widetilde{O}(\frac{1}{\theta d_{j}}\sum_{1\leq h\leq j}d_{h-1}w_{h}^{1+1/T^{\prime}})$ . The number of times this is executed is the number of possible $I$ , which is at most $|\textrm{Intervals}(w_{j})|=n/w_{j}$ , so the overall expected cost of calls to EnumerateClose in line (11) is $\widetilde{O}(\frac{n}{\theta w_{j}d_{j}}\sum_{h=1}^{j}d_{h-1}w_{h}^{1+1/T^{\prime}})$ , as claimed in the theorem.

The time for executing (14-20) is dominated by the time of the two calls to EnumerateClose, which are bounded to be at most $\widetilde{O}(\frac{1}{\theta}(\sum_{h=1}^{j}d_{h-1}w_{h}^{1+1/T^{\prime}})$ using Lemma 4.10 with the trivial setting $p=1$ . The number of times this is executed is bounded by the number of times in the loop on $I$ that $I$ is declared dense and used as a pivot. We claim that if $\mathbf{PD}(j)$ holds then the number of pivots is upper bounded by $O(\frac{n}{\theta w_{j}d_{j}})$ . To see this, first note that if $I$ is chosen as a pivot then by Section 4.5, conditioning on $\mathbf{PD}(j)$ implies $|\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)|\geq d_{j}/2$ . Furthermore, we claim that if $I$ and $I^{\prime}$ are both pivots then $\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)$ is disjoint from $\textrm{EnumerateClose}(j,I^{\prime}\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)$ . Suppose for contradiction that both are pivots and there is a $J$ in both sets, and that $I$ is selected first as a pivot. Then by the Soundness of EnumerateClose, $\textrm{ncost}(I\times J)\leq{\varepsilon(i-q_{j-1}-6)}$ and $\textrm{ncost}(I^{\prime}\times J)\leq{\varepsilon(i-q_{j-1}-6)}$ and so by the triangle inequality $\textrm{ncost}(I\times I^{\prime})\leq{\varepsilon(i-q_{j-1}-7)}={\varepsilon(h_{1})}$ (where $h_{1}$ is defined in the pseudocode of EnumerateClose.) But, in that case, the pseudocode of EnumerateClose ensures that $I^{\prime}$ is placed in $\mathcal{X}$ in line (16) and therefore removed from $\mathcal{T}$ in line (20), making it impossible for $I^{\prime}$ to be chosen as a pivot.

Since the sets $\textrm{EnumerateClose}(j,I\times\textrm{Intervals}(w_{j},{\varepsilon(i+3)}),i)$ corresponding to pivots are pairwise disjoint subsets of $\textrm{Intervals}(w_{j},{\varepsilon(i+3)})$ each have size at least $d_{j}/2$ , and $|\textrm{Intervals}(w_{j},{\varepsilon(i+3)})|=O(\frac{n}{\theta w_{j}})$ , the number of pivots is at $O(\frac{n}{\theta w_{j}d_{j}})$ . Multiplying this by the cost of a single loop as bounded above, the result is bounded above as claimed in the theorem. ∎

4.10 Choosing the parameters

The time analysis is expressed in terms of the parameters $w_{1},\ldots,w_{k}$ and $d_{0},\ldots,d_{k}$ . In this section we determine values of the parameters that achieve the claimed time bound. It is convenient to introduce parameters $\gamma_{1},\ldots,\gamma_{k}$ , $\delta_{0},\ldots,\delta_{k}$ and $\tau$ , with $w_{i}=\lfloor n^{\gamma_{i}}\rfloor_{2}$ and $d_{i}=\lfloor n^{\delta_{i}}\rfloor_{2}$ and $\theta=\lfloor n^{-\tau}\rfloor_{2}$ .

Recall that the parameters of gap-condition include $\zeta>0$ and we only need our gap algorithm to work for $\tau\leq\zeta$ . In the theorem we are allowed to choose $\zeta$ to be any positive constant. In the derivation below, we will see that we will need an upper bound on $\tau$ as a function of $T^{\prime}$ which will be used to determine $\zeta$ in the final proof of Theorem 4.1 in the next section.

We impose the following conditions.

•

$d_{0}=w_{1}=\lfloor\sqrt{n}\rfloor_{2}$ , so $\delta_{0}=\gamma_{1}=1/2$

•

$d_{k}=1$ , so $\delta_{k}=0$ .

The time for iteration $j$ is:

[TABLE]

Define

•

$\alpha_{j}=(1-\gamma_{j}-\delta_{j}+2\tau)$

•

$\nu_{i}=\delta_{i-1}+(\frac{T^{\prime}+1}{T^{\prime}})\gamma_{i}$

Then the cost of processing level $j$ can be rewritten as:

[TABLE]

We now choose $\gamma_{i}$ and $\delta_{i}$ subject to the following conditions:

•

$\gamma_{1}=\delta_{0}=1/2$

•

$\alpha_{j}$ is the same for all $j$

•

$\nu_{i}$ is the same for all $i$ .

•

$\delta_{k}=0$ .

It is easy to check that for any $B\geq 0$ , the first three conditions are satisfied by:

[TABLE]

The condition $\delta_{k}=0$ implies:

[TABLE]

Then $\alpha_{j}=1-\gamma_{j}-\delta_{j}+2\tau=\frac{B}{T^{\prime}+1}+2\tau$ and $\nu_{i}=1+\frac{1}{2T^{\prime}}$ . So the time for all iterations is:

[TABLE]

As indicated earlier, we will impose the condition $\tau\leq\frac{3T^{\prime}-2}{6(6(T^{\prime})^{3}+7(T^{\prime})^{2}+T^{\prime})}$

For fixed $T^{\prime}\geq 1$ , $B_{k}(T^{\prime})$ is a decreasing function of $k$ whose limiting value is $1/2$ . So we choose $k=k(T^{\prime})$ to be large enough so that $B\leq\frac{3T^{\prime}+1}{6T^{\prime}}$ . While the value $k(T^{\prime})$ is not important, it is straightforward to verify that we can choose $k(T^{\prime})=\lceil(T^{\prime}+1)(1+\ln(T^{\prime}+1))\rceil$ .

Using the above choice for $B$ , the exponent of $n$ is at most $1+\frac{1}{2T^{\prime}}+\frac{3T^{\prime}+1}{6T^{\prime}(T^{\prime}+1)}+2\tau$ and a computation shows that setting $T=T^{\prime}+1/6$ and imposing $\tau\leq\frac{3T^{\prime}-2}{6(6(T^{\prime})^{3}+7(T^{\prime})^{2}+T^{\prime})}$ (which we can do since $T^{\prime}\geq 1$ ) results in an upper bound on the exponent of $1+1/T$ as required.

Finally, we need to verify the assumptions (1) and (2) that $\frac{n}{w_{j}}\geq d_{j}$ and $\frac{w_{j}}{w_{j+1}}\leq\theta/2$ . The former is immediate as $\gamma_{j}+\delta_{j}\leq 1$ . For the latter, letting $M^{\prime}=-\frac{1}{\log(n)}\log(\max_{j}2w_{j}/w_{j+1})$ , we require that $\theta\geq n^{-M^{\prime}}$ , which we can ensure for $n$ large enough by choosing $\zeta<M$ , where $M=\min_{j}\gamma_{j+1}-\gamma_{j}$ .

4.11 Tying up the proof of Theorem 4.1

We have that SLOW-GED is a gap algorithm for edit distance satisfying $\textrm{gap-condition}(T^{\prime},\zeta^{\prime},Q^{\prime})$ where $T^{\prime}\geq 1$ , $\zeta^{\prime}>0$ and $Q^{\prime}\geq 1$ . We have shown FAST-GED (using SLOW-GED as a subroutine) that satisfies $\textrm{gap-condition}(T,\zeta,Q)$ with $T=T^{\prime}+1/6$ and $\zeta>0$ and $Q\geq 1$ are suitably chosen (depending only on $T^{\prime}$ , $\zeta^{\prime}$ and $Q^{\prime}$ . In Section 4.8 we proved that FAST-GED has quality $Q=2^{q_{k}+6}$ . In section 4.10 we adjusted the parameters so that the running time computed in Section 4.9 is $\widetilde{O}(n^{1+1/T})$ provided that $\theta\geq n^{-\frac{3T^{\prime}-2}{6(6(T^{\prime})^{3}+7(T^{\prime})^{2}+T^{\prime})}}$ , $\theta\geq n^{-M/2}$ (where $M$ is defined in Section 4.10) and also $\theta\geq\zeta^{\prime}$ . So we set $\zeta=\min(\zeta^{\prime},M/2,\frac{3T^{\prime}-2}{6(6(T^{\prime})^{3}+7(T^{\prime})^{2}+T^{\prime})})$ .

5 Proof of Theorem 1.1

Here we present the (routine) construction of the algorithm $\textrm{\bf{FAST-ED-UB}}^{T}$ promised by Theorem 1.1 Given $T$ , let $\zeta(T)$ and $Q(T)$ be given by Theorem 1.2.

On input $x,y$ , $\textrm{\bf{FAST-ED-UB}}^{T}$ defines $i_{\max}=\lfloor\zeta\log n\rfloor$ and for $i$ from 1 to $i_{\max}$ , runs FAST-GED on input $(x,y,\theta=2^{-i},\delta=1/\zeta n\log(n))$ . Define $i^{*}=0$ if none of the runs accepts, and otherwise define $i^{*}$ to be the largest index for which run $i^{*}$ accepts. $\textrm{\bf{FAST-ED-UB}}^{T}$ outputs $Q2^{-i^{*}}n$ . This is an upper bound on $d_{\textrm{edit}}(x,y)$ since if $i^{*}=0$ then the output is $Qn\geq n$ , and otherwise the first requirement of gap-condition ensures that $d_{\textrm{edit}}(x,y)\leq Q2^{-i^{*}}n$ .

We claim that for $R=2Q$ , the probability that the output exceeds $R(d_{\textrm{edit}}(x,y)+n^{1-\zeta})$ is at most $1/n$ . If $i^{*}=i_{\max}$ then the output is $2Qn^{1-\zeta}\leq R(d_{\textrm{edit}}(x,y)+n^{1-\zeta})$ . So assume $i^{*}<i_{\max}$ . Say that the $i$ th run of $\textrm{\bf{FAST-ED-UB}}^{T}$ fails if $d_{\textrm{edit}}(x,y)\leq 2^{-i}n$ and the algorithm rejects. The probability that some iteration fails is at most $\delta\zeta\log n\leq 1/n$ so the probability that no iteration fails is at least $1-1/n$ . If no iteration fails then in particular iteration $i^{*}+1$ does not fail, and since it rejects (by the choice of $i^{*}$ ) we conclude that $d_{\textrm{edit}}(x,y)>2^{-1-i^{*}}n$ and so $Q2^{-i^{*}}n\leq Rd_{\textrm{edit}}(x,y)\leq R(d_{\textrm{edit}}(x,y)+n^{1-\zeta})$ , and so $\textrm{\bf{FAST-ED-UB}}^{T}$ has all of the required properties.

6 Approximate Pattern Matching

In this section we descibe the implementation of the function $\textrm{APM}(I\times\mathcal{J},\epsilon,\mathcal{R})$ from Section 4.3. This is a synthesis of algorithms from [18, 17].

We assume that $\mathcal{R}$ contains certified boxes and all $J\in\mathcal{J}$ are of the same width $\mu(I)$ .

Let $\max(\mathcal{J})=\{max(J):J\in\mathcal{J}\}$ and $\min(\mathcal{J})=\{min(J):J\in\mathcal{J}\}$ . Let $\mathcal{R}^{+}$ be $\mathcal{R}$ augmented by auxiliary shortcut edges of cost 0 from $(\min(I),0)$ to $(\min(I),m)$ for all $m\in\min(\mathcal{J})$ . Also for $J\in\mathcal{J}$ let $J^{0}$ denote the interval $\{0,\ldots,\max(J)\}$ . The following was observed in [18]:

Proposition 6.1.

For all $J\in\mathcal{J}$ , $\textrm{cost}_{\mathcal{R}}(I\times J)$ satisfies $\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})\leq\textrm{cost}_{\mathcal{R}}(I\times J)$ and $\textrm{cost}(I\times J)\leq 2\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})$ .

Proof.

For the first inequality consider a min-cost traversal $\tau$ of $I\times J$ in the shortcut graph $\widetilde{G}(\mathcal{R})$ . We construct a traversal $\tau^{\prime}$ of $I\times J^{0}$ of cost at most $\textrm{cost}_{\mathcal{R}}(\tau)$ . Consider the first shortcut edge $e=(i,j)\rightarrow(i^{\prime},j^{\prime})$ of $\tau$ . We may assume that prior to $e$ , the path consists of a (possibly empty) sequence of horizontal edges followed by a (possibly empty) sequence of vertical edges. The final such horizontal edge ends at $(\min(I),j)$ and $j\in\min(\mathcal{J})$ so in $\widetilde{G}(\mathcal{R}^{+})$ we can replace the horizontal path by the shortcut edge $(\min(I),0)\rightarrow\min(I),j)$ of cost 0 to get a path that is no more costly.

For the second inequality, consider a min-cost traversal $\rho$ of $I\times J^{0}$ in $\widetilde{G}(\mathcal{R}^{+})$ . Let $j=0$ if the path does not use one of the auxiliary shortcut edges, and otherwise let $j$ be such that the path starts with auxiliary shortcut edge $(\min(I),0)\rightarrow(\min(I),j)$ . Let $\hat{J}=\{j,\ldots,\max{J}\}$ . So the remaining portion of $\rho$ is a min-cost traversal $\hat{\rho}$ of $I\times\hat{J}$ . Since $\widetilde{G}(\mathcal{R})$ is certified, $|\mu(\hat{J})-\mu(I)|\leq\textrm{cost}(I\times\hat{J})\leq\textrm{cost}_{\mathcal{R}}(\hat{\rho})=\textrm{cost}_{\mathcal{R}^{+}}(\hat{\rho})=\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})$ . Also $|J\Delta\hat{J}|=|\mu(\hat{J})-\mu(I)|=|\min(J)-j|\leq\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})$ . So $\textrm{cost}(I\times J)\leq\textrm{cost}(I\times\hat{J})+|J\Delta\hat{J}|\leq 2\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})$ . ∎

So if we compute $\textrm{cost}_{\mathcal{R}^{+}}(I\times J^{0})$ for every $J\in\mathcal{J}$ , and output the set of all $J$ for which this cost is less than $\kappa\mu(I)$ , we will satisfy the requirements of APM. We now describe a slightly modified version of an algorithm from [17] that accomplishes this in time $\widetilde{O}(|\mathcal{R}^{+}|)$ .

Let ${\widetilde{H}}$ be the graph $\widetilde{G}(\mathcal{R}^{+})$ with each cost $c_{e}$ of $e=(i,j)\to(i^{\prime},j^{\prime})$ replaced by benefit $b_{e}=(i^{\prime}-i)+(j^{\prime}-j)-c_{e}$ , (so H and V edges have benefit 0). For any interval $B$ , the min-cost traversal of $I\times B$ in $\widetilde{G}(\mathcal{R}^{+})$ is $\mu(I)+\mu(B)$ minus the max-benefit traversal of $I\times B$ in ${\widetilde{H}}$ . So it suffices to compute the max-benefit traversal of $I\times J^{0}$ in ${\widetilde{H}}$ for all $J\in\mathcal{J}$ .

To do this, let $j_{1}<\cdots<j_{r}$ be the distinct second coordinates of the heads and tails of shortcut edges in $\widetilde{G}(\mathcal{R}^{+})$ . We use a binary tree data structure with leaves corresponding to the indices of $I$ , where each tree node $v$ stores a number $a_{v}$ , and a collection of lists $L_{1}$ ,…, $L_{r}$ , where $L_{h}$ stores pairs $(e,q(e))$ where the head of $e$ has $y$ -coordinate $j_{h}$ and $q(e)$ is the max benefit of a path from $(\min(I),0)$ that ends with $e$ .

We proceed in rounds $h=1,\dots,r$ . In round $h$ , let $A_{h}$ consist of all the shortcuts whose tail has vertical coordinate $j_{h}$ . The preconditions for round $h$ are: (1) for each leaf $i$ , the stored value $a_{i}$ is the max benefit path to $(i,j_{h})$ that includes a shortcut whose head has horizontal coordinate $i$ (or 0 if there is no such path), (2) for each internal node $v$ , $a_{v}=\max\{a_{i}:i\text{ is a leaf in the subtree of$ v $}\}$ , and (3) for every shortcut edge $e=(i^{\prime},j_{h^{\prime}})\to(i^{\prime\prime},j_{h^{\prime\prime}})$ with $h^{\prime}<h$ , the value $q(e)$ has been computed and $(e,q(e))$ is in list $L_{h^{\prime\prime}}$ .

During round $h$ , for each shortcut $e=(i,j_{h})\to(i^{\prime},j_{h^{\prime}})$ in $A_{h}$ , $q(e)$ equals the max of $a_{\ell}+b_{e}$ over tree leaves $\ell$ with $\ell\leq i$ . This can be computed in $O(\log n)$ time as max $a_{v}+b_{e}$ , where $v$ ranges over the union of $\{i\}$ with the set of left children of vertices on the root-to- $i$ path that are not themselves on the path. Add $(e,q(e))$ to list $L_{h^{\prime}}$ . After processing $A_{h}$ , update the binary tree: for each $(e,q(e))\in L_{h+1}$ , let $i$ be the horizontal coordinate of the head of $e$ and for all vertices $v$ on the root-to- $i$ path, replace $a_{v}$ by $\max(a_{v},q(e))$ . The tree then satisfies the precondition for round $h+1$ .

To obtain the output to APM, for each $J\in\mathcal{J}$ , let $h(J)$ be the index of the last iteration for which $j_{h(J)}\leq\max(J)$ . The benefit of $I\times J^{0}$ is the value, at the end of iteration of $h(J)$ of $a_{v_{0}}$ where $v_{0}$ is the root.

For the runtime analysis: It would take $\widetilde{O}(\mu(I))$ time to set up the full tree data structure so we will build it incrementally by expanding only the parts of the data structure that contain non-zero values. Hence, the set up cost of the data structure is $O(1)$ . It takes $O(|\mathcal{R}^{+}|\log|\mathcal{R}^{+}|)$ time to sort the shortcuts, and $O(\log\mu(I))$ processing time per shortcut (computing $q(e)$ and later updating the data structure), overall giving runtime $\widetilde{O}(|\mathcal{R}^{+}|+|\mathcal{J}|)$ .

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Amir Abboud and Arturs Backurs. Towards hardness of approximation for polynomial time problems. In 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA , pages 11:1–11:26, 2017.
2[2] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015.
3[3] Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016 , pages 375–388, 2016.
4[4] Alex Andoni. Simpler constant-factor approximation to edit distance problems. Manuscript , 2018.
5[5] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA , pages 377–386, 2010.
6[6] Alexandr Andoni and Huy L. Nguyen. Near-optimal sublinear time algorithms for Ulam distance. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010 , pages 76–86, 2010. doi:10.1137/1.9781611973075.8 . · doi ↗
7[7] Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. In Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing , STOC ’09, pages 199–204, New York, NY, USA, 2009. ACM.
8[8] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing , STOC ’15, pages 51–58, New York, NY, USA, 2015. ACM.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Constant factor approximations to edit distance on far input pairs in nearly linear time

Abstract

1 Introduction

Theorem 1.1**.**

Theorem 1.2**.**

1.1 Speed-up routines

2 Preliminaries

Proposition 2.1**.**

Proposition 2.2**.**

Proposition 2.3**.**

Proof.

Proposition 2.4**.**

Proof.

Proposition 2.5**.**

Proof.

Proposition 2.6**.**

3 The core speed-up algorithm of [16]

4 The new core speed-up algorithm

Theorem 4.1**.**

4.1 The algorithm parameters

4.2 The architecture of the algorithm, and the neighborhood data structure

4.3 Elementary primitives

Proposition 4.2**.**

Proof.

Proposition 4.3**.**

Proof.

4.4 The mechanics of the algorithm

Proposition 4.4**.**

Proof.

4.5 The use of randomization

4.6 The properties enforced by FAST-GED.

Theorem 4.5**.**

4.7 Proof of Completeness of Bbelow\mathcal{B}^{\textrm{below}}Bbelow

Proposition 4.6**.**

Proof.

Proposition 4.7**.**

Proof.

Proposition 4.8**.**

Proof.

4.8 Correctness of FAST-GED

4.9 Time analysis

Theorem 4.9**.**

Proof.

Lemma 4.10**.**

Proof.

4.10 Choosing the parameters

4.11 Tying up the proof of Theorem 4.1

5 Proof of Theorem 1.1

6 Approximate Pattern Matching

Proposition 6.1**.**

Proof.

Theorem 1.1.

Theorem 1.2.

Proposition 2.1.

Proposition 2.2.

Proposition 2.3.

Proposition 2.4.

Proposition 2.5.

Proposition 2.6.

Theorem 4.1.

Proposition 4.2.

Proposition 4.3.

Proposition 4.4.

Theorem 4.5.

4.7 Proof of Completeness of $\mathcal{B}^{\textrm{below}}$

Proposition 4.6.

Proposition 4.7.

Proposition 4.8.

Theorem 4.9.

Lemma 4.10.

Proposition 6.1.