This paper presents a randomized algorithm that approximates the edit distance between two strings in nearly linear time, providing constant factor guarantees for inputs with sufficiently large edit distance.
Contribution
It introduces a nearly linear time algorithm that achieves constant factor approximation for edit distance on far input pairs, improving efficiency.
Findings
01
Runs in time $O(n^{1+1/T})$ for any $T \\geq 1$
02
Provides a constant factor approximation when edit distance is large
03
Achieves high probability guarantees for the approximation
Abstract
For any T≥1, there are constants R=R(T)≥1 and ζ=ζ(T)>0 and a randomized algorithm that takes as input an integer n and two strings x,y of length at most n, and runs in time O(n1+T1) and outputs an upper bound U on the edit distance ED(x,y) that with high probability, satisfies U≤R(ED(x,y)+n1−ζ). In particular, on any input with ED(x,y)≥n1−ζ the algorithm outputs a constant factor approximation with high probability. A similar result has been proven independently by Brakensiek and Rubinstein (2019).
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
Constant factor approximations to edit distance on far input pairs in nearly linear time
Michal Koucký
Email: [email protected]. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement no. 616787.
Partially supported by the Grant Agency of the Czech Republic under the grant agreement no. 19-27871X.
Computer Science Institute of Charles University,
Malostranské náměstí 25,
118 00 Praha 1, Czech Republic
Michael Saks
Email: [email protected]. Supported in part by Simons Foundation under award 332622.
Department of Mathematics, Rutgers University, Piscataway, NJ, USA
Abstract
For any T≥1, there are constants R=R(T)≥1 and ζ=ζ(T)>0 and a randomized algorithm
that takes as input an integer n and two strings x,y of length at most n, and runs in time O(n1+T1) and outputs an upper bound U
on the edit distance of dedit(x,y) that with high probability, satisfies U≤R(dedit(x,y)+n1−ζ). In particular, on any input with
dedit(x,y)≥n1−ζ the algorithm outputs a constant factor approximation with high probability. A similar result has been proven
independently by Brakensiek and Rubinstein [14].
1 Introduction
The edit distance (or Levenshtein distance) [21] between strings x,y,
denoted by dedit(x,y), is the minimum number of character insertions, deletions, and substitutions needed to convert x into y.
It was recently shown independently that edit distance can be approximated within a constant factor in truly subquadratic time in the quantum computation model [12, 13].
and in the classical model [16, 17]. The running time for a classical algorithm obtained
in [16, 17] is O(n12/7), which was improved by Andoni [4] to O(n3/2+ϵ).
This raises the natural question: what is the best possible running time of a constant factor approximation classical algorithm.
We make progress on this problem by developing a nearly linear time algorithm that gives a constant factor approximation when restricted to inputs whose edit distance is not too small:
Theorem 1.1**.**
For every T≥1 there are constants ζ=ζ(T) and R=R(T) and a randomized algorithm FAST-ED-UBT that takes as input an integer n and
two strings x and y, with ∣x∣,∣y∣≤n, over an (arbitrary) alphabet Σ, and runs in time
O(n1+T1) and outputs an upper bound U on dedit(x,y), such that
with probability at least 1−1/n, U≤R⋅(dedit(x,y)+n1−ζ).
In particular, on any input x,y with dedit(x,y)≥n1−ζ the algorithm gives a constant factor approximation.
The additive n1−ζ term arises from some technical limitations in our algorithm and analysis, but since known algorithms for exact edit distance problem run faster on instances x,y with small edit distance ([20, 24])
we expect that it should be possible
to extend our result to give a nearly linear constant approximation algorithm for all ranges of edit distance.
Brakensiak and Rubinstein [14] independently obtained essentially the same theorem. While both our work and theirs builds on the techniques of [16, 17, 12, 13], the algorithms have quite different structure.
Other prior work (quoted from [16].)
Edit distance can be evaluated exactly in quadratic time via
dynamic programming (Wagner and Fischer [25]).
Masek and Paterson[22] obtained the first (slightly) sub-quadratic O(n2/logn) time algorithm, and the current asymptotically fastest algorithm (Grabowski [19]) runs in time
O(n2loglogn/log2n).
Backurs and Indyk [8] showed that a truly sub-quadratic algorithm (O(n2−δ) for some δ>0) would imply a 2(1−γ)n time algorithm
for CNF-satisfiabilty, contradicting the Strong Exponential Time Hypothesis (SETH). Abboud et al. [3] showed that even shaving an arbitrarily large polylog factor from n2 would have the plausible, but apparently
hard-to-prove, consequence that NEXP does not have non-uniform NC1 circuits. For further “barrier” results, see [2, 15].
There is a long line of work on approximating edit distance.
The exact O(n+k2) time algorithm (where k is the edit distance of the input) of Landau et al. [20] yields a linear time n-factor approximation.
This approximation factor was improved, first to n3/7 [9], then to n1/3+o(1) [11] and later to 2O(logn) [7], all with slightly superlinear runtime.
Batu et al. [10] provided an O(n1−α)-approximation algorithm with runtime O(nmax{2α,2α−1}). The strongest result of this type is the (logn)O(1/ϵ) factor
approximation (for every ϵ>0) with running time n1+ϵ of Andoni et al. [5].
Abboud and Backurs [1] showed that a truly sub-quadratic deterministic time 1+o(1)-factor approximation algorithm for edit distance would imply new circuit lower bounds.
Andoni and Nguyen [6] found a randomized algorithm that approximates Ulam distance of two permutations of {1,…,n} (edit distance with only insertions and deletions) within a (large) constant factor in time O(n+n/k), where k is the Ulam distance of the input; this was improved by Naumovitz et al. [23] to a (1+ε)-factor approximation (for any ε>0) with similar runtime.
Reduction to a Gap-Algorithm.
For simplicity we will assume that the bound n on the length max(∣x∣,∣y∣) is a power of 2 and ∣x∣=∣y∣=n. It is easy to reduce the general
case to this case: on input x′,y′, let n be the least power of 2 that is at least max(∣x′∣,∣y′∣) and pad
both x′ and y′ using a single new symbol to obtain strings x, y of length n.
It is easy to verify that dedit(x′,y′)≤dedit(x,y)≤2dedit(x′,y′), and so it suffices to approximate
dedit(x,y).
Following a common paradigm for approximation algorithms, our approximation algorithm is built by reducing to a gap algorithm.
In this paper, we consider randomized gap algorithms for edit distance.
These algorithms take as input
(n,θ,δ;x,y)
where n is an integral power of 2, x and y are strings of length n,
θ∈(0,1] is a nonnegative power of 1/2
and δ∈(0,1).
The triple (n,θ,δ) are referred to as the input parameters
and x,y as the input strings.
We say that the algorithm has quality Q with respect to (n,θ,δ) provided that for all
strings x,y of length n:
Gap Algorithm Soundness.
If dedit(x,y)>Qθn then the algorithm returns reject.
Gap Algorithm Completeness.
If dedit(x,y)≤θn then the algorithm returns accept with probability at least 1−δ.
We say that the algorithm satisfies gap-condition(T,ζ,Q), where T,Q≥1 and ζ≥0 provided that
for n a power of 2, and for all θ≥n−ζ
•
The algorithm has quality Q with respect to (n,θ,δ),
•
The running time of the algorithm on any input (n,θ,δ;x,y) is O(n1+1/Tlog(1/δ)) with probability 1. Here O
hides powers of log(n) whose exponent may depend on T.
We will prove:
Theorem 1.2**.**
For every T≥1 there are constants ζ=ζ(T)>0 and Q=Q(T)≥1, and a gap-algorithm
GAP-EDT that satisfies gap-condition(T,ζ,Q).
In Section 5 we present the (routine) construction of the algorithm FAST-ED-UBT from GAP-EDT, which proves Theorem 1.1.
The focus of the paper is on proving Theorem 1.2.
1.1 Speed-up routines
Our algorithm, like that of [16] is built from a core speed-up algorithm having access to an existing
"slow" gap algorithm. The speed-up algorithm produces a faster gap algorithm, with worse (but still constant) approximation quality,
while making queries to the slow algorithm on pairs of "short" substrings.
Given such a speed-up algorithm, one can build up a sequence of increasingly faster gap
algorithms A0,A1,…, where A0 is just the quadratic exact edit distance algorithm,
and Aj is obtained by using the core speed-up algorithm with Aj−1 playing the role of the "slow" algorithm.
If the core speed-up algorithm involves some free parameters that may be optimized for best performance, this optimization
can be done separately for each Aj
The core speed-up algorithm designed in [16], gives an algorithm A1 that has running time O(n12/7). The algorithms
Aj are successively faster, but do not get below nϕ where ϕ=1.61....
The core speed-up algorithm we design in this paper gives a sequence of gap-algorithms where the
exponent of n in the run-time converges to 1.
2 Preliminaries
Many definitions and routine claims are adapted (with some modifications) from [17].
The edit distance of strings u,v is denoted dedit(u,v) and the normalized edit distance of u,v, denoted
Δedit(u,v) is defined to be dedit(u,v)/∣u∣.
Throughout the paper x,y denote two input strings of length n, where n is a power of 2, and z denotes the concatenation xy.
Intervals, Decompositions, aligned intervals, and δ-aligned intervals.
We consider intervals in {0,…,2n} which are as usual, subsets consisting of consecutive integers. The width of interval I, μ(I) is equal to max(I)−min(I)=∣I∣−1. Most intervals we consider have width a power of 2. An interval of width w is a w-interval.
Intervals index substrings of z, where
zI denotes the substring indexed by the set I∖{min(I)}, (Note that zmin(I) is not part of zI. In particular, z=z{0,…,2n}, and x=z{0,…,n} and y=z{n,…,2n}.
A decomposition of an interval I is a sequence I1,…,Ik of intervals with min(I1)=min(I), max(Ik)=max(I)
and min(Ij+1)=max(Ij) for j∈{1,…,k−1}. Note that zI1,…,zIk partitions the string zI.
Let w be a power of 2 that is at most n, and let δ be a power of 2 that is at most 1.
An interval of width w is aligned if min(I) is a multiple of w (and consequently max(I) is also a multiple of w).
The interval is δ-aligned if min(I) is a multiple of max(δw,1) (and consequently so is max(I)). In particular a 1-aligned interval
is aligned.
We define:
•
Intervals(w) is the set of aligned intervals of width w, subsets of {0,…,n}.
•
Intervals(w,δ) to be the set of δ-aligned intervals of width w, subsets of {0,…,2n}.
•
For an interval I, Intervals(w;I)={I′∈Intervals(w):I′⊆I},
and Intervals(w,δ;I)={I′∈Intervals(w,δ):I′⊆I}.
Since n and w are powers of 2,
Intervals(w) is a decomposition of {0,…,n}. When we use the notation Intervals(w;I), I will be an aligned interval
of width a power of 2, so that Intervals(w;I) is a decomposition of I.
The grid {0,…,n}×{0,…,2n}, boxes and stacks.
Consider the grid {0,…,n}×{0,…,2n} lying in the coordinate plane.
For S⊆{0,…,n}×{0,…,n}, the horizontal projection πH(S) is the set of first coordinates of elements of S,
and the vertical projection of S, πV(S) is the set of second coordinates.
A box is a set I×J⊆{0,…,n}×{0,…,2n} for intervals I,J, and it represents
the pair xI,zJ of substrings. Since I⊆{0,…,n}, zI=xI. Note that if J⊆{0,…,n}
then zI,zJ is a pair of substrings of x and if J⊆{n,…,2n}, it is a pair (substring of x, substring of y).
I×J is a w-box if μ(I)=μ(J)=w.
The lower left hand corner is (min(I),min(J)) and the upper right hand corner is (max(I),max(J)).
Note that πH(I×J)=I and πV(I×J)=J.
Box I×J is horizontally aligned if I is aligned, and it is vertically δ-aligned or simply
δ-aligned if J is
δ-aligned; we have no need to refer to horizontally δ-aligned boxes. Box I×J is square if μ(I)=μ(J).
A stack is a set of boxes all having the same horizontal projection. For interval I and set of intervals J,
I×J is the stack {I×J:J∈J}.
Grid graphs.
The grid graph of z, Gz, is a directed graph
with edge costs, having
vertex set {0,…,n}×{0,…,2n} and
all edges of the form (i−1,j)→(i,j) (H-edges), (i,j−1)→(i,j) (V-edges)
and (i−1,j−1)→(i,j) (D-edges). Every H-edge and V-edge costs 1, and a D-edge
has cost 1 if zi=zj and 0 otherwise. Gz is
acyclic, with edges moving "up and to the right". A directed path τ
joins a pair of vertices source(τ) and sink(τ) with source(τ)≤sink(τ).
The box spanned by τ is the unique minimal box I×J that contains τ;
this is equal to πH(τ)×πV(τ).
We say τtraversesI×J if
I×J is the box spanned by τ, which is equivalent to source(τ)=(min(I),min(J)) and sink(τ)=(max(I),max(J)).
A traversal of I×J is any path that traverses I×J.
For I⊆πH(τ), let τI denote the minimal subpath of τ whose horizontal projection is I.
Cost and normalized cost.
The cost of a directed path τ, cost(τ) is the sum of the edge costs, and the normalized cost is
ncost(τ)=μ(πH(τ))cost(τ). The cost of box I×J,
cost(I×J), is the min-cost of a traversal of I×J
and ncost(I×J)=μ(I)1cost(I×J).
It is well known (and easy to see) that for any box I×J, a traversal of I×J corresponds to an alignment from a=zI to b=zJ, i.e.
a set of character deletions, insertions and substitutions that changes a to b, where
an H-edge (i−1,j)→(i,j) corresponds to "delete ai", a V-edge (i,j−1)→(i,j)
corresponds to "insert bj between ai and ai+1" and a D-edge (i−1,j−1)→(i,j) corresponds to
replace ai by bj, unless they are already equal. Thus:
Proposition 2.1**.**
The cost of an alignment corresponding to path τ is cost(τ). Thus for any I,J⊆{0,…,2n},
dedit(zI,zJ)=cost(I×J). In particular dedit(x,y)=cost({0,…,n}×{n,…,2n}).
Displacement of a box relative to a path or box.
The following easy fact (noted in [16]) relates the cost of two boxes having the same horizontal projection:
Proposition 2.2**.**
For intervals I,J,J′⊆{0,…,n},
∣cost(I×J)−cost(I×J′)∣≤∣JΔJ′∣, where Δ denotes symmetric difference.
Let τ be a path whose horizontal projection includes I.
The displacement of the square box I×J with respect to τ, disp(I×J,τ) is the smallest K such
that (min(I),min(J)) is within K vertical units of source(τI) and (max(I),max(J)) is within K vertical units of sink(τI).
We make a few easy observations.
Proposition 2.3**.**
Let τ be a path whose horizontal projection includes I and let I×J be a box.
Then cost(I×J)≤cost(τI)+2disp(I×J,τ).
Proof.
Let J′ be the vertical projection of τI. Then:
cost(I×J)≤cost(I×J′)+∣JΔJ′∣≤cost(τI)+∣JΔJ′∣≤cost(τI)+2disp(I×J,τ).
∎
The following fact (which is essentially the same as Proposition 3.4 of [17]) says that every path τ with projection I′ can be approximately covered by a δ-aligned box whose cost is close
to cost(τ) and whose displacement from τ is small:
Proposition 2.4**.**
Let I′ and J be intervals and suppose δ∈(0,1].
Let τ be a path lying inside of I′×J
whose horizontal projection is I′.
There is a δ-aligned interval J′ of width μ(I′) so that
disp(I′×J′,τI′)≤δμ(I′)+cost(τI′) and
ncost(I′×J′)≤2ncost(τI′)+δ.
Proof.
Let J be the vertical projection of τI′.
If μ(J)≥μ(I′) then let J^ be the interval of width μ(I′) with min(J^)=min(J).
Otherwise let J^ be any interval of width μ(I′) that contains J.
The box I′×J^ has displacement at most cost(τI′) from τI′, and
has cost at most 2cost(τI′). Finally, let J′ be obtained by shifting J^ up or down to the closest
δ-aligned interval. This shift is at most δ/2 units.
This increases both the displacement and the cost by at most δμ(I′).
∎
The diagonal of a square box I×J is the diagonal path joining (min(I),min(J)) to (max(I),max(J)).
Let I×J and I′×J′ be square boxes with I′⊆I. The displacement of I′×J′ with respect to I×J,
disp(I′×J′,I×J)
is the displacement of I′×J′ with respect to the diagonal of I×J, which is just the number of vertical units
one needs to shift I′×J′ so that its diagonal is a subpath of the diagonal of I×J.
Proposition 2.5**.**
Suppose τ traverses the square box I×J of width w. Then every point of τ is within vertical distance
cost(τ)/2 of the diagonal of I×J.
Proof.
Consider a point of τ expressed as P=(min(I)+u,min(J)+v). Then τ can be split into
two parts τ1, ending at P and τ2 starting at P. Then cost(τ)=cost(τ1)+cost(τ2)≥2∣v−u∣
which is twice the vertical distance of P to the diagonal of I×J.
∎
Weighted boxes and stacks, certified boxes and stacks, shortcut graphs.
A weighted box is a pair (I×J,κ) where κ≥0. If ncost(I×J)≤κ
we say that (I×J,κ) is a certified box.
A weighted stack(I×J,κ) is a pair where I×J is a stack and κ≥0.
We associate (I×J,κ) with the
set {(I×J,κ):J∈J}.
If every box in (I×J,κ) is certified, we call it a certified stack.
Let G be the digraph on {0,…,n}×{0,…,2n} with arc set
{(i,j)→(i′,j′):i≤i′,j≤j′,(i,j)=(i′,j′)} The edges
with i<i′ and j<j′ are called shortcuts. Associated to any weighted box (I×J,κ)
there is a weighted shortcut edge (min(I),min(J))→(max(I),max(J)) with weight κμ(I).
Given a set R of weighted boxes, we define the weighted shortcut graphG(R)
to be the weighted directed graph consisting of all H-edges and V-edges with weight 1, and
all of the shortcut edges corresponding to the boxes in R. For a box
I×J, let costR(I×J) denote the minimum cost of a traversal
of I×J in G(R).
If every box in R is certified
we say that G(R) is a certified shortcut graph.
A certified shortcut graph Gˉ(R)
provides upper bounds on the edit distance.
We omit the proof of the following easy fact:
Proposition 2.6**.**
Let R be a set of certified boxes. For any box I×J,
dedit(zI,zJ)≤costR(I×J).
As discussed in Section 1.1, the main ingredient in [16] is a core speed-up algorithm
that has access to a slow edit distance approximation algorithm and uses it to build a faster approximation algorithm.
We review the main ideas of the core speed-up algorithm in [16], which provides the starting point for ours.
To simplify the description
we assume that the slow edit distance algorithm is just the quadratic exact edit distance algorithm.
In their work, they reduce to the case θ>n−1/5 and build a subquadratic time algorithm for the
gap-problem where θ≥n−1/5.
The algorithm operates in two phases.
The discovery phase generates a set Q of certified boxes. In the shortest path phase the
algorithm evaluates the cost of ({0,…,n}×{n,…,2n}) in the shortcut graph G(R) where R is a set of
certified boxes obtained by a minor modification of Q. Proposition 2.6 implies that this
is an upper bound on dedit(x,y). The main work is to define the discovery phase to ensure that this upper bound is not too much bigger than
the true value. The shortest path phase is implemented by a straightforward variant of dynamic programming.
The discovery phase is
defined in terms of parameters w1<d<w2, which are powers of 2 that are, respectively, approximately n1/7, n2/7 and n3/7.
The set Q consists of certified w1-boxes and certified w2-boxes, and satisfies with high probability: for
every horizontally
aligned w2-box I×J, costR(I×J)≤C⋅[cost(I×J)+θw2] for some constant C. It is not difficult
to show that this implies that the upper bound on dedit(x,y) output by the shortest path inference phase will be at most C⋅[dedit(x,y)+θn],
which is enough to solve the gap-problem.
The algorithm generates boxes of width w1 iteratively for i from 0,…,log(1/θ) and ε(i)=2−i.
For each horizontally aligned I, let Nε(i)(I) be the set of J that are ε(i+3)-aligned and satisfy
ncost(I×J)≤ε(i).
Iteration i starts by classifying each of the n/w1-aligned w1-intervals, as dense or sparse subject to the requirement that
every I with Nε(i)(I)≥2d is classified as dense, and every I with Nε(i)(I)≤d/2 is classified as sparse;
this classification of I is
done with high probability by sampling J at a rate log(n)/d and calling I dense (resp. sparse) if at least (resp. at most) log(n) of the sample
are within distance ε(i) of I.
Next for each dense interval I a set J(I) of ε(i+3)-aligned w1-intervals J
is constructed such that ncost(I×J)≤5ε(i) and Nε(i)(I)⊆J(I).
For any given I we can construct J(I) by computing its edit distance with every ε(i)/8-aligned interval,
in time O(nw1/ε(i)). If we do this for all n/w1-aligned intervals the time is
Θ(n2/ε(i)), but the restriction to dense intervals allows a savings of a factor of ε(i)d:
Initialize D to be the set of dense aligned w1-intervals. While D=∅ choose I∈D (the pivot for the current round) and construct X=N2ε(i)(I) and Y=N3ε(i)(I) and certify
all boxes (I′×J′,5ε(i)) for I′∈X and J′∈Y. Delete X from D
and continue. The number of pivots is thus only
O(n/w1ε(i)d) since the sets Nε(i)(I) are of size at least d and
are disjoint for different pivots.
The rest of the discovery phase constructs a (relatively small) set of w2-boxes. For each horizontally aligned w2-interval I′, the
w1-subintervals of I′ that were declared sparse (over all iterations of i) are used to select a small subset J′(I′) of
the w2-intervals, and we certify each box I′×J′ for J′∈J′(I′) by computing their
edit distance exactly. The set J′(I′) is obtained as follows: For each i∈{0,…,log(1/θ)},
select a polylog(n) size subset Si(I′) of the subintervals of I′ that were declared sparse in iteration i,
and for each I′′∈Si(I′) exactly compute cost(I′′,J) for all ε(i+3)-aligned intervals J
to determine Nε(i)(I′′) (which has size at most 2d). For each box I′′×J, let J′ be the unique w2-interval
such that the diagonal of I′′×J is a subset of the diagonal of I′×J′ and add J′ to J′(I′). The size of
J′(I′) is O(d) and so the total cost of evaluating the edit distance of boxes I′×J′ for
I′∈Intervals(w2;{0,…,n}) and J′∈J′(I′) is O(ndw2).
The parameters w1,d,w2 are adjusted to minimize the run time at O(n12/7).
The key claim in [16] is that for every horizontally aligned w2-box I×J, the boxes from the discovery phase
imply an upper bound ncost(I×J) that is at most C⋅ncost(I×J)+C′θ which is sufficient for the shortcut phase to succeed. The claim is proved by showing
that if the set of certified w1-boxes does not imply a sufficiently good upper bound on ncost(I×J), then with high probability, one of the w2-boxes I×J′ constructed
in the second part of the discovery phase is within a small vertical shift of I×J, and therefore can be used in the inference phase
to imply a good upper bound on cost(I×J).
4 The new core speed-up algorithm
The main new ingredient of the new core speed-up algorithm presented here is the replacement of the pair w1<w2 of widths from [16] by a hierarchy
w1<⋯<wk of widths. While the idea of such an extension is natural, it is not a priori clear how to extend the ideas of [16] to such a hierarchy. Our new algorithm proceeds in k iterations. During iteration j the algorithm
builds a data structure that supports approximate distance queries between substrings of width wj.
Each successive data structure recursively uses the data structure from the previous iterations. Iteration j is accomplished by
a suitable variant of the algorithm from [16].
The algorithm of [16] splits neatly into a discovery phase and an inference phase. In the new algorithm, each iteration
begins with an inference phase (using boxes discovered in the previous phase) followed by a discovery phase.
Here is our main speed-up theorem.
Theorem 4.1**.**
Suppose that SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′) where T′≥1,
ζ′>0 and Q′≥1. There is an algorithm FAST-GED
(using SLOW-GED as a subroutine) that satisfies gap-condition(T,ζ,Q) with T=T′+1/6 where
ζ>0 and Q≥1 are suitably chosen (depending only on T′,ζ′ and Q′).
Applying this theorem inductively with A0 being the exact edit distance algorithm, we get a sequence of algorithms
Aj where Aj satisfies gap-condition(1+j/6,ζj,Qk) for suitable constants ζj>0 and Qj,
and taking j=6(T−1) gives
Theorem 1.2.
The proof of Theorem 4.1 is the heart of the paper.
We describe the algorithm in the following order:
The parameters used by the algorithm (Section 4.1).
2. 2.
The overall architecture, including data objects, of the algorithm (Section 4.2).
3. 3.
Some basic functions used in the algorithm (Section 4.3).
4. 4.
The mechanics of the algorithm. (Section 4.4).
5. 5.
The use of randomness in the algorithm (Section 4.5).
6. 6.
The properties enforced by the algorithm (Section 4.6 and 4.7).
7. 7.
The proof that FAST-GED satisifes the gap-algorithm Soundness and Completeness requirements (Section 4.8).
8. 8.
The running time analysis in terms of the parameters (Section 4.9).
9. 9.
The choice of parameters that attain the run time claims for FAST-GED (Section 4.10).
10. 10.
Recall that a gap-algorithm takes as input (n,θ,δ;x,y) where n is a power of 2 and ∣x∣=∣y∣=n, and θ∈(0,1] is a power of 1/2.
In our description of the algorithm,
we fix the input parameter δ in the algorithm FAST-GED to δ=1/2.
For δ<1/2, we execute the algorihm with δ=1/2 independently for
r=⌈log2(1/δ)⌉ times, and reject only if every run returns reject. This compound algorithm will reject every input x,y such that dedit(x,y)≥Qθn, since every run will reject. The probability that the compound algorithm incorrectly returns
reject on input with dedit(x,y)≤θn is at most (1/2)r≤δ, as required.
Second, we fix the value of δ for all calls of SLOW-GED within FAST-GED, to δ=n−12 where n is the length of the global input
to FAST-GED. Since the number of calls to SLOW-GED will be bounded above (easily) by O(n2), a union bound implies that the probability that every
call to SLOW-GED is correct is at least 1−n−8.
The algorithm FAST-GED takes as input n,θ;x,y where
n is a power of 2, x and y are strings of length n and the gap parameter
θ∈(0,1] is a power of 1/2. The algorithm sets
z to be the concatenation of xy and treats z as a global variable.
The number of iterations (levels) of FAST-GED is a parameter k.
For each j∈1,…,k+1, there is a width parameterwj and for each j∈{0,…,k}, there is a density parameterdj. These parameters are integer powers of 2 satisfying:111We denote by ⌊⋅⌋2 the closest power of two of size smaller or equal.
[TABLE]
[TABLE]
Furthermore, for 1≤j≤k:
[TABLE]
These parameters will be chosen in Section 4.10 to optimize
the time analysis.
For now we note a technical assumption, that will be verified in Section 4.10, that is needed in the analysis.
For 1≤j≤k:
[TABLE]
For each j∈{0,…,k}, there are quality parameters qj that satisfy the recurrence:
[TABLE]
The quality of the final approximation is Q=2qk+6
We also define, for integers i, ε(i)=2−i. In most cases, i∈{0,…,log(1/θ)} so
1≥ε(i)≥θ.
There is a constant c0 used in the definition of the procedure ProcessDense. (See Section 4.5.)
4.2 The architecture of the algorithm, and the neighborhood data structure
FAST-GED consists of k iterations (levels), and a final post-processing step.
During iteration j, the algorithm examines pairs ⟨i;I×J⟩, called candidates, where i∈{0,…,log(1/θ)},
I∈Intervals(wj) and J∈Intervals(wj,ε(i+3)). (Hence, a candidate is any ⟨i;I×J⟩ that satisfies some weak consistency requirements.)
The pair I×J is called a level j box
and ⟨i;I×J⟩ is a level j candidate.
Iteration j implicitly classifies all level j-candidates as close or far. This classification satisfies:
•
If ncost(I×J)≤ε(i) then ⟨i;I×J⟩ is classified as close.
•
If ncost(I×J)>ε(i−qj−1−6) then ⟨i;I×J⟩ is classified as far.
If ε(i)<ncost(I×J)≤ε(i−qj−1−6) then ⟨i;I×J⟩ may be classified as either close or far.
This implicit classification is accomplished by
a data structure, called the neighborhood data structure. The
data structure implements a query EnumerateClose
which
takes as input (j,I×J,i) where:
•
j∈{1,…,k} is the level,
•
I×J is a stack satisfying I∈Intervals(wj) and J⊆Intervals(wj,ε(i+3)),
•
i∈{0,…,log(1/θ)},
and returns the set of J∈J for which ⟨i;I×J⟩ is close.
In particular, EnumerateClose(j,I×{J},i) returns {J} if ⟨i;I×J⟩ is close and returns ∅ otherwise.
The pair ⟨i;I×J⟩ is called a level j candidate stack.
The queries with level parameter j are the
level j queries. Initially the data structure is unable to answer any queries. During iteration j
the algorithm constructs the part of the data structure that determines the classification of level j candidates
as close or far, and thereby enabling level j queries.
At the start of iteration j, queries up to level j−1 have been enabled.
To enable EnumerateClose(j,⋅) the algorithm constructs families of sets
for each I∈Intervals(wj) and each i∈{0,…,log(1/θ)} as follows:
•
A subset of Intervals(wj,ε(i+3))
denoted Bbelow(j,I,i).
•
A subset of
Intervals(wj−1;I) denoted
SparseSample(j,I,i).
The query EnumerateClose(j,⋅) uses these sets, as well as calls to EnumerateClose(j−1,⋅).
Thus the level j neighborhood data structure consists of all of the sets Bbelow(j′,⋅) and SparseSample(j′,⋅) for
1≤j′≤j.
During iteration j, subroutines Preprocess and ProcessDense are called with parameter j.
The purpose of Preprocess(j) is to create the sets
Bbelow(j,⋅) and SparseSample(j,⋅). The construction of these sets
involves some random choices, which affect the close/far classification; but
once the choices are made the close/far classification is fixed.
The creation of these sets activates
EnumerateClose(j,⋅). While the data structure grows during each iteration
to enable higher level queries, once EnumerateClose(j,⋅) is enabled, the portion of the data structure
used to handle level j queries is static.
The other procedure in iteration j of FAST-GED() is ProcessDense(j). ProcessDense(j) creates
the following sets
for each i∈{0,…,log(1/θ)}:
•
Sparse(j,i)⊆Intervals(wj).
•
For each I∈Sparse(j,i),
a subset of Intervals(wj,ε(i+3))
denoted Bdense(j,I,i).
•
A set R(j) of weighted boxes (which we will prove are all certified).
The sets Bdense(j,⋅) are local variables within ProcessDense(j), used
to create R(j).
The set R(j) and Sparse(j,⋅) are global variables but,
with the exception of the final iteration j=k,
they are used only in Preprocess(j+1), and then never used again.
Following iteration k, the set R(k) is used in the post-processing step to generate the final output which is
costR(k)({0,…,n}×{n,…,2n}).
4.3 Elementary primitives
We describe some elementary functions used within the algorithm.
The function Round. Round(J,ϵ) where J is an interval and ϵ≤1 is a power of 2, is equal to the
ϵ-aligned interval J′ of width μ(J) obtained by shifting J down (decreasing its two endpoints) at
most ϵμ(J)−1 units.
**The function **ZoomIn.
Recall the definition of displacement in Section 2.
The function ZoomIn takes as input a box I×J, and a subinterval I′ of I and some additional
parameters, and outputs a set of suitably aligned intervals J′ of width
μ(I′) so that each box I′×J′ has small displacement from I×J.
More precisely,
for a box I×J, a subinterval I′⊆I, and 0≤i′≤i≤log(1/θ), ZoomIn(j,I×J,i,I′,i′)
is the set of all ε(i′+3)-aligned intervals J′⊆J of width μ(I′), for which the displacement of I′×J′ from I×J
is at most 2ε(i)μ(I).
Proposition 4.2**.**
Let I be an interval of width w and I′⊆I of width w′ a divisor of w. Let i′≤i∈{0,…,log(1/θ)}.
For J of width w, ∣ZoomIn(j,I×J,I′,i′)∣ has size at most 1+32ε(i−i′)w/w′.
2. 2.
Let I′×J′ be a box. The number of ε(i+3)-aligned width-w* intervals J such that
J′∈ZoomIn(j,I×J,i,I′,i′) is at most 33.*
Proof.
Set Δ=min(I′)−min(I). If J′∈ZoomIn(j,I×J,i,I′,i′) then ∣min(J′)−Δ−min(J)∣≤2ε(i)w.
Proof of (1). Holding J fixed, we have min(J′)∈[min(J)+Δ−2ε(i)w,min(J)+Δ+2ε(i)w]. This is an interval of width 4ε(i)w, and the number of ε(i′+3)-aligned intervals of width w′
that start in this interval is at most 1+32ε(i−i′)w/w′.
Proof of (2). Holding J′ fixed, we have min(J)∈[min(J′)−Δ−2ε(i)w,min(J′)−Δ+2ε(i)w]. This is an interval of width 4ε(i)w, and the number of ε(i+3)-aligned intervals of width w
that start in this interval is at most 33.
∎
Calling ZoomIn(j,I×J,i,I′,i′) with a stack I×J returns the union of results ⋃J∈JZoomIn(j,I×J,i,I′,i′).
**The function **InducedBoxes. This is a function that takes as input a set of weighted square boxes Q and outputs a collection
of weighted boxes induced by Q. For an interval J, and t≤μ(J)/2
let J/[t] denote the interval [min(J)+t,max(J)−t]. For each (I×J,κ) in Q,
InducedBoxes(Q) includes (I×J,κ) together with boxes of the form (I×J/[2i],κ+μ(I)2i+1)
for i∈{0,…,log(μ(J))−1}.
Proposition 4.3**.**
If all boxes of Q are certified boxes then so are all boxes of InducedBoxes(Q).
Proof.
Note that ∣JΔ(J/[2i])∣=2i+1 and apply Proposition 2.2.
∎
The function APM (Approximate pattern match).
Recall from Section 2 that costR(I×J) is the length of the min-cost traversal of I×J in the shortcut
graph G(R). APM takes as input a stack I×J, κ>0 and a set R of certified boxes,
and outputs a subset S of J that satisfies:
Completeness of APM.
For all J∈J satisfying costR(I×J)≤κμ(I), J∈S
Soundness of APM.
For all J∈J satisfying cost(I×J)>2κμ(I), J∈S.
The running time is O(μ(I)+∣J∣+∣R∣). (Notice, the subtle distinction between costR and cost in Soundness and Completeness.) The implementation, described in Section 6, is a customized variant of dynamic programming that closely follows [17, 18].
4.4 The mechanics of the algorithm
We are now ready to present the pseudocode for FAST-GED and the three main subroutines:
Preprocess and ProcessDense, and EnumerateClose.
**The algorithm **FAST-GED. This algorithm inputs an integer n which is a power of 2,
θ∈(0,1] a power of 1/2, and
two strings x,y of length n, and returns accept
or reject. (Recall that the error parameter δ is fixed to 1/2.)
The algorithm consists of iterations indexed by j∈{1,…,k}. Preprocess(j) creates the sets Bbelow(j,I,i) and SparseSample(j,I,i) that enable the
level j queries EnumerateClose(j,⋅).
ProcessDense(j) creates sets R(j) and Sparse(j,i) needed for Preprocess(j+1).
**The subroutine **Preprocess. On input j, the sets
Sparse(j−1,i) and R(j−1) created by ProcessDense(j−1) are used to produce the
sets Bbelow(j,I,i) and SparseSample(j,I,i) for I∈Intervals(wj) and i∈{0,…,log(1/θ)}.
To begin, the set of weighted wj−1-boxes R(j−1) is partitioned into sets
R(j−1,I), with I′×J′ assigned to R(j−1,I) for I′⊆I.
For each i and I:
The set Sparse(j−1,i)⊆Intervals(wj−1) was produced by ProcessDense(j−1).
SparseSample(j,I,i)=∅ if Sparse(j−1,i) contains no subintervals of I, and otherwise
is an independent random sample (multiset) of size log(n)θ(1) selected from the subsets of I belonging to Sparse(j−1,i).
2. 2.
Run APM with input stack I×Intervals(wj,ε(i+3)) and R(j−1,I) to determine
the set of intervals J∈Intervals(wj,ε(i+3)) that are suitably close to I
in the shortcut graph G(R(j−1,I)).
**The subroutine **EnumerateClose. The creation of Bbelow(j,I,i) and SparseSample(j,I,i) by Preprocess
enables the query EnumerateClose(j,⋅), which implicitly classifies all level j
candidates ⟨i;I×J⟩ as close or far subject to:
Completeness of EnumerateClose.
If ncost(I×J)≤ε(i) then with high probability
⟨i;I×J⟩ is close.
Soundness of EnumerateClose.
If ncost(I×J)>ε(i−qj−1−6) then ⟨i;I×J⟩ is far.
EnumerateClose(j,⋅) takes
a stack I×J and i∈{0,…,log(1/θ)} with I∈Intervals(wj) and J⊆Intervals(wj,ε(i+3)) and returns {J∈J:⟨i;I×J⟩ is close}.
S accumulates the set of intervals to be output. For j=1, SLOW-GED(zI,zJ,ϵ) is run for
each J∈J and S is the set of accepted J. For j>1, S is the union of two sets.
The first is
Bbelow(j,I,i)∩J found by Preprocess(j). The second is obtained by identifying (as described below) a
small subset K⊆J, testing each J∈K using SLOW-GED, and adding J to S if zJ is
suitably close to zI. To identify K, for each (I′,i′)∈SparseSample(j,I,i)×{0,…,i}
use
ZoomIn to identify the set J′ of J′∈Intervals(wj−1,ε(i′+3)) such that I′×J′ has displacement
at most 2ε(i)μ(I) from I×J.
Recursively use EnumerateClose(j−1,I′×J′,i′) to select S′={J′∈J′:⟨i′;I′×J′⟩ is \textscclose}. K consists of those J for which I×J has small displacement
from I′×J′ for some J′∈S′.
The loops on i′,I′ (line 11-21) produce K⊆J. For each J∈K,
SLOW-GED is run on zI,zJ. The loop on I′ is over SparseSample(j,I,i′). The subset K of J depends
on the random sample SparseSample(j,I,i′) of Sparse(j−1,i′)∩Intervals(wj−1;I).
The following definitions highlight this dependence.
•
For ⟨i;I×J⟩, let I′∈Intervals(wj−1;I) and i′∈{0,…,i}.
The pair (I′,i′) is a marker222We call it marker as in genomics, where a short DNA sequence identifies a gene. Similarly here, a marker for zI
is its substring zI′ which is relatively rare in z, i.e., I′ belongs to Sparse(j−1,i′).
for the candidate ⟨i;I×J⟩
if I′∈Sparse(j−1,i′) and there is some J′∈ZoomIn(j,I×J,i,I′,i′) such that
⟨i′;I′×J′⟩ is classified as close.
When lines (13-18) are executed for a marker (I′,i′), J is added to K in line 17. Ideally, K will consist of all intervals J identifiable by their markers.
•
M(j,I×J,i,i′)={I′∈Sparse(j−1,i′)∩Intervals(wj−1;I):(I′,i′) is a marker
for ⟨i;I×J⟩}.
We will be interested in situations when for some i′≤i there will be many markers, namely, ∣M(j,I×J,i,i′)∣≥31∣Sparse(j−1,i′)∩Intervals(wj−1;I)∣, so that with high probability SparseSample(j−1,I,i′) will contain a marker that will identify J.
**The procedure **ProcessDense. This takes as input a level number j. The procedure
corresponds closely to the procedure Dense Strip Removal in [17].
For each
i∈{0,…,log(1/θ)} the procedure builds a set Sparse(j,i)⊆Intervals(wj)
and also builds sets Bdense(j,I,i)⊆Intervals(wj,ε(i+3)) for every I∈Intervals(wj)∖Sparse(j,i).
This is done by processing the intervals of Intervals(wj); when interval I is processed it is either assigned to Sparse(j,i)
or the set Bdense(j,I,i) is constructed. We keep track of a subset T⊆Intervals(wj) of unprocessed intervals. This set
is initialized to Intervals(wj) and the iteration ends when T=∅. We proceed in rounds. In a round we select an arbitrary I
from T. We perform a test (see ”Testing potential pivots in ProcessDense” in Section 4.5) to decide whether to put it in Sparse(j,i). If I is not placed in Sparse(j,i)
then I is designated the pivot for that round. We then call EnumerateClose on the stack I×T (with suitable parameters)
to determine the subset X of Intervals(wj), we call EnumerateClose on the stack I×Intervals(wj,κ)
(for a suitable κ≥ε(i)) to determine Y′⊆Intervals(wj,κ) and we let Y be the set of intervals from Intervals(wj,ε(i+3)) which round to an interval in Y′.
We then define Bdense(j,I′,i)=Y for all I′∈X,
and remove X from T, to complete the round.
The parameters used in the above calls are expressed in terms of h1 and h2 introduced in the pseudocode.
The particular choice h1 and h2 is motivated by both the correctness analysis and the time analysis (Section 4.9).
In the sequel, we will need the following definition and observation.
Approved Candidate. A candidate ⟨i;I×J⟩ is said to be approved if I∈Sparse(j,i) and
J∈Bdense(j,I,i). Note that the boxes in Q(j) are in one-to-one correspondence with the
approved candidates, with (I×J,ε(i−qj))∈Q(j) if and only if ⟨i;I×J⟩ is approved.
All candidates of the form ⟨i;I×J⟩ are approved for i≤qj.
Proposition 4.4**.**
At level k, the sets Sparse(k,i) are empty for all i∈{0,…,log(1/θ)}.
Proof.
Since dk=1, the set S created in line (10) is all of Intervals(wj,ε(i+3)) which, in particular includes I.
The set returned by EnumerateClose in line (11) includes I and so the if condition fails, and I is not added to Sparse(k,i).
∎
4.5 The use of randomization
Randomization is used in three parts of the algorithm: the subroutine SLOW-GED, the construction of SparseSample during Preprocess
and in ProcessDense, each time we test a selected I∈T to decide whether it is a pivot. We discuss each of these uses below.
The subroutine SLOW-GED.
SLOW-GED takes calling parameters (n′,θ′,δ′;x′,y′). By our
assumption δ′ is fixed to n−12 for all calls.
The gap-soundness and completeness conditions for SLOW-GED guarantee that
if Δedit(x′,y′)>Q′θ′n′ then SLOW-GED returns reject,
and if Δedit(x′,y′)≤θ′n′ then SLOW-GED returns accept with
probability at least 1−n−12. Say that an execution of SLOW-GED(n′,θ′,n−12;x′,y′) fails if Δedit(x′,y′)≤θ′n′ and SLOW-GED returns reject.
We will introduce an event SG that no call to SLOW-GED fails.
To simplify the analysis, we make the following assumption: when
we run FAST-GED we pregenerate a single string BSG of b random bits where
b is an upper bound on the number of random bits used in any call to SLOW-GED.
In every call to SLOW-GED we use (a prefix of) BSG to provide the random bits
for the call. This makes all
calls to SLOW-GED deterministic, and also ensures that if the algorithm
makes multiple calls
to SLOW-GED with the same input parameters then all such calls yield the same output.
Reusing random bits for different calls of SLOW-GED makes these calls
dependent, but this is irrelevant to the analysis.
The proof of correctness relies only on the fact that the event SG holds.
We now upper bound the probability that there is a call that does not
succeed.
Every possible input tuple
(n′,θ′,n−12;x′,y′) for SLOW-GED satisfies that
n′ is a power of 2 with n′<n, x′,y′ are substrings of z=xy of
length n′, and θ′ is an integral power of 1/2. We may assume
that θ′≥1/n since for θ′<1/n we may assume that
SLOW-GED is the deterministic algorithm that returns accept if x=y
and reject otherwise. Let SG denote the event that for all possible choices of input parameters (n′,x′,y′,θ′)
with θ′≥1/n, the choice of random bits succeeds.
The number of possible choices
of input parameters for which randomness is used is at most 4n2log2(n).
(There are at most log(n) ways to choose n′, and to choose θ′,
and at most 2n ways to choose the starting location of x′ and of y′.)
Thus by a union bound, the probability that SG does not hold
is at most n−8.
The construction of SparseSample. SparseSample(j,I,i) is a random sample of Sparse(j−1,i) generated during Preprocess(j).
What we want from this sample is that for each i′∈{0,…,i}, if a nontrivial fraction of Sparse(j−1,i) belongs
to the set of markers M(j,I×J,i,i′) then SparseSample(j,I,i) should include a member of M(j,I×J,i,i′).
(Note: for the purposes of this discussion, the exact technical definition of M(j,I×J,i,i′) is unimportant, we only need that for each j,I,J,i,i′, M(j,I×J,i,i′) and Sparse(j−1,i)
are completely determined after iteration j−1, and
M(j,I×J,i,i′)⊆Sparse(j−1,i).)
Formally, we say that SparseSample(j,I,i)fails
for J∈Intervals(wj,ε(i+3))
and i′∈{0,…,i} if ∣Sparse(j−1,i)∩Intervals(wj−1;I)∣>0, ∣M(j,I×J,i,i′)∣≥∣Sparse(j−1,i)∩Intervals(wj−1;I)∣/3
and SparseSample(j,I,i)∩M(j,I×J,i,i′)=∅.
Since M(j,I×J,i,i′) is completely determined by the
end of iteration j−1, and SparseSample(j,I,i) is an independent sample of 30logn
elements from Sparse(j,I,i) selected during iterations j,
the probability that SparseSample(j,I,i) fails for J,i′ is at most
(1−1/3)30logn≤n−10. There are at most n pairs J,i′ so the probabibility that SparseSample(j,I,i)
fails for some J,i′ is at most n−9. There are at most n triples j,I,i so the probability
that some SparseSample(j,I,i) fails is at most n−8.
We denote by BSSj the random bits that are used at iteration j to generate the samples from Sparse(j,I,i) for all I and i.
Testing potential pivots in ProcessDense. During the while loop for I∈T of ProcessDense, we make a random
selection of a set S, and this choice affects whether I is assigned to Sparse(j,i) or becomes a pivot. The constant c0 in line (10) is chosen below to
satisfy certain technical conditions.
We denote by BPDj the random bits used at iteration j to generate sets S where we make the simplifying assumption
that there is a designated block of bits for each possible I∈Intervals(wj) and i to select the corresponding S. (Some of the blocks
might be unused.)
There are two bad events that depend on the choice of S:
∣EnumerateClose(j,I×Intervals(wj,ε(i+3)),i)∣<dj/2 and I is not assigned to Sparse(j,i).
2. 2.
∣EnumerateClose(j,I×Intervals(wj,ε(i+3)),i)∣>2dj and I is assigned to Sparse(j,i).
For both of the bad events, we observe that (i) for any input (j,I×J,i),
EnumerateClose(j,I×J,i) returns the stack of candidates
⟨i;I×J⟩ that are classified as close among I×J, and
(ii) the classification of
level j candidates as close or far is completely deterministic given the random
bits BSG for SLOW-GED, and the random bits BSS≤j and BPD≤j−1 for the first j−1 iterations and Preprocess(j).
Thus, for the random sample S of Intervals(wj,ε(i+3)),i), where
each interval is placed in S independently with probability p,
p1∣EnumerateClose(j,I×S,i)∣ is an estimate of
∣EnumerateClose(j,I×Intervals(wj,ε(i+3)),i)∣, and the
bad events can only occur if this estimate is sufficiently inaccurate.
For suitably large c0, a simple Chernoff-Hoeffding bound shows that for each (I,i) the probability of a bad event
is at most
n−10,
and summing over the at most O(n) such pairs, the probability of a bad event is at most n−9.
We say ProcessDense has successful sampling if no such bad event occurs.
Successful randomization.
An execution of FAST-GED has successful randomization if all calls to SLOW-GED are correct, all calls to SparseSample are successful,
and ProcessDense has successful sampling. We denote the event
of successful randomization by SR. By the above, Pr[SR]≥1−1/n7.
4.6 The properties enforced by FAST-GED.
In this section we state and prove a theorem that states the main properties enforced by FAST-GED.
By hypothesis, SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′). We want to show thatFAST-GED
satisfies gap-condition(T,ζ,Q) with T=T′+1/6 and suitably chosen
ζ>0 and Q≥1 (depending only on T′,ζ′ and Q′).
As in the discussion in
Section 4.2, we say that the level j candidate
⟨i;I×J⟩ is classified as close if EnumerateClose(j,I×{J},i) returns {J} and is classified as far if EnumerateClose(j,I×{J},i) returns
∅.
Theorem 4.5**.**
Assume that SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′).
Consider a run of FAST-GED on input (n,θ,1/2;x,y) where n−ζ′≤θ≤1 and ∣x∣=∣y∣=n, that
meets the conditions for successful randomization.
For all j∈{1,…,k}, i∈{0,…,log(1/θ)}, I∈Intervals(wj), J∈Intervals(wj,ε(i+3)), J⊆Intervals(wj,ε(i+3)):
Soundness of Bbelow.
If J∈Bbelow(j,I,i) then ncost(I×J)≤ε(i−qj−1−6).
Completeness of Bbelow.
If ncost(I×J)≤ε(i) then (i) J∈Bbelow(j,I,i) or (ii) there exists an i′≤i such that ∣M(j,I×J,i,i′)∣>31∣Intervals(wj−1;I)∩Sparse(j−1,i′)∣.
Consistency of EnumerateClose.
J∈EnumerateClose(j,I×J,i)* if and only if J∈EnumerateClose(j,I×{J},i).
If J∈EnumerateClose(j,I×{J},i) then ⟨i;I×J⟩ is classified as close.*
Soundness of EnumerateClose.
If ⟨i;I×J⟩ is classified as close then ncost(I×J)≤ε(i−qj−1−6).
Completeness of EnumerateClose.
If ncost(I×J)≤ε(i)
then ⟨i;I×J⟩ is classified as close.
Validity of Sparse.
I∈Sparse(j,i)* implies that EnumerateClose(j,I×Intervals(wj,ε(i+3))) has size at most 2dj.*
Soundness of Bdense.
If J∈Bdense(j,I,i) then ncost(I×J)≤ε(i−qj).
Completeness of Bdense.
If I∈Sparse(j,i) and ncost(I×J)≤ε(i) then J∈Bdense(j,I,i).
Soundness of R(j).
Every box in R(j) is correctly certified, i.e., (I×J,κ)∈R(j) implies ncost(I×J)≤κ.
Completeness of Q(j).
If I∈Sparse(j,i) and ncost(I×J)≤ε(i) then (I×J,min(1,ε(i−qj)))∈Q(j)
The proof of this theorem is by induction on j. For fixed j when we prove a property we assume the properties
listed above it hold. With the exception of the Completeness of Bbelow, which we defer to the next subsection, the proofs are straightforward.
Proof of Soundness of Bbelow.
For j=1, the requirement is vacuously satisfied.
Suppose j>1. By Soundness of R(j−1),
every box in R(j−1,I) is certified. If J∈Bbelow(j,I,i), then the pseudocode
implies that J∈APM(j,I×Intervals(wj,ε(i+3),ε(i−qj−1−5),R(j−1,I)). By definition of the soundness of APM, J is included in the output to the call of APM implies that
cost(I×J)≤2ε(i−qj−1−5)=ε(i−qj−1−6).
Proof of Completeness of Bbelow. See subsection 4.7
Proof of Consistency of EnumerateClose.
We must show that whether
J∈EnumerateClose(j,I×J,i) does not depend on J∖{J}.
In the case j=1, J∈EnumerateClose(i,I×J,i) if and only SLOW-GED(I×J)≤ε(i) returns accept which
does not depend on J∖{J}. Assume j>1. From the pseudocode of EnumerateClose, J∈EnumerateClose(j,I×J,i) if and only if
(i) J∈Bbelow(j,I,i) or (ii) J∈K and SLOW-GED(zI,zJ,ε(i)) returns accept.
Neither condition (i) nor SLOW-GED(zI,zJ,ε(i)) depend on J∖{J}.
It remains to show that whether J∈K is also independent of J∖{J}. Now J∈K if and only if
there exists i′∈{0,…,i}, I′∈SparseSample(j,I,i) and J′∈ZoomIn(j,I×J,i,I′,i′) such that J′∈S′.
The set ZoomIn(j,I×J,i,I′,i′) obviously doesn’t depend on J∖{J}.
For J′∈ZoomIn(j,I×J,i,I′,i′) we must have J′∈J′, and therefore by the consistency of EnumerateClose
at level j−1, J′∈EnumerateClose(j−1,I′×J′,i′) if and only if J′∈EnumerateClose(j−1,I′×{J′},i′).
Proof of Soundness of EnumerateClose.⟨i;I×J⟩ is classified as close
means that J∈EnumerateClose(j,I×{J},i). Now for this to happen either (i) SLOW-GED(zI,zJ,ε(i)) returns accept,
or (ii) J∈Bbelow(j,I,i). If (i) holds then the guarantee on SLOW-GED implies ncost(I×J)≤Q′ε(i)≤ε(i−qj−1−6), since log(Q′)=q0≤qj−1 for all j≥1.
If (ii) holds then the result follows from the Soundness of Bbelow.
Proof of Completeness of EnumerateClose.
Suppose ncost(I×J)≤ε(i). By the Completeness of Bbelow, we have (i) J∈Bbelow(j,I,i) or (ii) Sparse(j−1,i)∩Intervals(wj−1;I)=∅ and there exists an i∗≤i so that ∣M(j,I×J,i,i∗)∣≥31∣Intervals(wj−1;I)∩Sparse(j−1,i∗)∣.
If (i) holds, then the definition of EnumerateClose immediately gives J∈EnumerateClose(j,I×{J},i). If (ii) holds,
then the success condition for SparseSample(j,I,i) (from Section 4.5) implies that there is an I∗∈SparseSample(j,I,i∗) such that (I∗,i∗) is a marker for ⟨i;I×J⟩.
During the execution of EnumerateClose(j,I×J,i), when i∗ is selected in line (11) and I∗ in line (12), by the definition
of marker, J is added to K in line (17).
The correctness of SLOW-GED implies
that SLOW-GED(I×J,ε(i)) will accept in line (23) and so J will be added to S.
Proof of Validity of Sparse.
This follows immediately from the assumption that ProcessDense has successful sampling.
Proof of Soundness of Bdense.
For i≤qj the claim is trivial so we assume i−qj>0.
Suppose J∈Bdense(j,I,i). Bdense(j,I,i) was defined during iteration i of the main loop (1-34) of ProcessDense(j), during one of the iterations of the while loop (8-23). Let I∗ be the pivot during
that iteration. Then I∈EnumerateClose(j,I∗×Intervals(wj),h1) and J′∈EnumerateClose(j,I∗×Intervals(wj,ε(h2+3)),h2),
for J′=Round(J,ε(h2+3)). By the Soundness of EnumerateClose, ncost(I∗×I)≤ε(h1−qj−1−6) and ncost(I∗×J′)≤ε(h2−qj−1−6). By
the triangle inequality and Propositon 2.2, we have ncost(I×J)≤ε(h1−qj−1−6)+ε(h2−qj−1−6)+ε(h2+3)≤2ε(h2−qj−1−6)=ε(h2−qj−1−7)=ε(i−3qj−1−21)=ε(i−qj).
Proof of Completeness of Bdense.
Suppose I∈Sparse(j,i) and ncost(I×J)≤ε(i). Since I∈Sparse(j,i) during iteration i of the main loop (1),
there is an iteration of the while loop (8-22) of ProcessDense(j)
where I was removed from T. Let I∗ be the pivot for that iteration. Since
I was removed from T, I∈X during this iteration, so I∈EnumerateClose(j,I∗×Intervals(wj),h1) and by the Soundness of EnumerateClose
ncost(I∗×I)≤ε(h1−qj−1−6). Let J′=Round(J,ε(h2+3)). It suffices to show that J∈Y for this same iteration, which would follow
from J′∈EnumerateClose(j,I∗×Intervals(wj,ε(h2+3)),h2). By the Completeness of EnumerateClose it suffices to
show that ncost(I∗×J′)≤ε(h2). By the triangle inequality and Propositon 2.2,
ncost(I∗×J′)≤ncost(I∗×I)+ncost(I×J′)≤ncost(I∗×I)+ncost(I×J)+ε(h2+3)≤ε(h1−qj−1−6)+ε(i)+ε(h2+3)≤ε(h2)/2+ε(h2)/4+ε(h2)/8≤ε(h2) as required.
Proof of Soundness of R(j). By Proposition 4.3, it suffices that every box in Q(j) is correctly certified. In line (26) of ProcessDense(j), (I×J,ε(i−qj))∈Q(j) only if J∈Bdense(j,I,i)
which is correctly certified by
the Soundness of Bdense.
Proof of Completeness of Q(j).
Suppose I∈Sparse(j,i) and ncost(I×J)≤ε(i). By the Completeness of Bdense, J∈Bdense(j,I,i)
and so the definition of Q(j) implies that (I×J,ε(i−qj))∈Q(j).
4.7 Proof of Completeness of Bbelow
Here we finish the proof of Theorem 4.5, by establishing the final property, whose proof
is significantly more involved than that of the others. The proof is based on ideas from [17].
Consider a candidate ⟨i;I×J⟩ with ncost(I×J)≤ε(i).
We assume condition (ii) fails and deduce costR(j−1,I)(I×J)≤ε(i−qj−1−5)wj. By the definition
of Preprocess and the Completeness of APM, this immediately implies condition
J∈Bbelow(j,I,i), which is condition (i).
Fix a minimum cost traversal τ of I×J.
The proof proceeds via the following steps.
Step 1.
For each
I′∈Intervals(wj−1;I) we specify a candidate
⟨t(I′);I′×J^(I′)⟩, which is approved in the sense defined in the description of ProcessDense
in Section 4.4. (The
collection of boxes {I′×J^(I′):I′∈Intervals(wj−1;I)} should be thought of as approximatly covering τ.)
Step 2.
We upper bound costR(j−1,I)(I×J) as a constant times ∑I′ε(t(I′))wj−1 plus 8ε(i)wj.
Step 3.
We show that if (ii) fails, then ∑I′ε(t(I′))wj−1 can be upper bounded by a constant multiple of ε(i)wj
Step 4.
This gives that costcR(j−1)(I×J) is at most a constant multiple of ε(i)wj.
Step 1. Specifying ⟨t(I′);I×J^(I′)⟩ for each I′.
Consider a pair (I′,i′) where i′∈{0,…,i} and I′∈Intervals(wj−1;I).
Proposition 2.4 implies there
is a level j−1 candidate ⟨i′;I′×J′⟩ such that
ncost(I′×J′)≤2ncost(τI′)+ε(i′+3) and
disp(I′×J′,τI′)≤cost(τI′)+ε(i′+3)wj−1.
Select such an interval J′ and denote it by Ji′(I′) (keeping the dependence on τ implicit.)
For each I′ let us define t(I′) to be the largest index h≤i for which the candidate ⟨h;I′×Ji(I′)⟩ is approved
that is I′∈Sparse(j−1,h) and Ji′(I′)∈Bdense(j−1,I′,h). Let J^(I′)=Jt(I′)(I′). We record the important properties:
Proposition 4.6**.**
For each I′∈Intervals(wj−1;I):
The box I′×J^(I′) satisfies ncost(I′×J^(I′))≤2ncost(τI′)+ε(t(I′)+3) and
disp(I′×J^(I′),τI′)≤cost(τI′)+ε(i′+3)wj−1.
2. 2.
The candidate ⟨t(I′);I′×J^(I′)⟩ is approved, and hence (I′×J^(I′),ε(t(I′)−qj−1))∈Q(j−1).
3. 3.
For any i′∈{t(I′)+1,…,i} either I′∈Sparse(j−1,i′) or I′∈Sparse(j−1,i′)
and Ji′(I′)∈Bdense(j−1,I′,i′).
Proof.
The first two properties follow immediately from the definitions of t(I′) and J^(I′). For the third property, the maximality of t(I′)
implies that for i′∈{t(I′)+1,…,i}, ⟨i′;I′×Ji(I′)⟩ is not approved, and the result follows from the definition of approved.
∎
Step 2. Upper bound on costR(j)(I×J).
Proposition 4.7**.**
[TABLE]
(This is closely related to Lemma 4.1 of [17] and the proof is similar.)
Proof.
We transform the path τ in Gz to a path τ′ in the shortcut graph G(R(j−1,I)) (see Section 2) and control the increase in cost. Let I1,…,Im be the intervals of
Intervals(wj−1;I) in order, and for h∈[m],
let ih=t(Ih) and
Jh=J(Ih).
Let δh be the smallest power of 2 such that δhwj−1≥disp(Ih×Jh,τIh).
By Proposition 4.6,
δh≤2ncost(τIh)+2ε(ih+3), and (Ih×Jh,ε(ih−qj−1))∈Q(j−1).
Let L={h∈[m]:δh<1/2}. For h∈L, let Jh′=Jh/[δhwj−1] (the interval
obtained by
removing the first and last δhwj−1 indices from Jh). The
certified box (Ih×Jh′,ε(ih−qj−1)+2δh) belongs to R(j−1), and since Ih⊆I,
it also belongs to R(j−1,I). Let eh=eIh,Jh′ be the shortcut edge with cost
(ε(ih−qj−1)+2δh)wj−1.
We claim (1) there is a source-sink path τ′ in G(R(j−1,I)) that consists of {ei:i∈L}, plus a collection {Hi:i∈[m]∖L}
where Hi is a horizontal path whose projection to the x-axis is Ii, plus
a collection of (possibly empty) vertical paths V0,V1,…,Vm where the x-coordinate of Vi for i>0 is max(Ii)
and 0 for V0,
and (2) cost(τ′) satisfies the bound of the lemma.
For the first claim, for h∈[m],
let ph=(ih,jh) be the first point in τIh and define pm+1 to be the final point of τ. We will define
τ′ to pass through all of the ph. Let Jh∗ be the vertical projection of τIh
so that τIh traverses Ih×Jh∗. The choice of δh implies that for h∈L, Jh′⊆Jh∗.
Define the portion τh′ between ph and ph+1 as follows: if h∈L,
climb vertically from ph to (ih,min(Jh′)) and traverse eIh,Jh′ and climb vertically to ph+1
and if h∈L then move horizontally from ph to (ih+1,jh) and then climb vertically to ph+1.
For the second claim, we upper bound cost(τ′).
For h∈L, eIh,Jh has cost at most
(ε(ih−qj−1)+2δh)wj−1, and
for h∈L, the horizontal path that projects to Ih costs wj−1≤2δhwj−1; the total cost of
shortcut and horizontal edges is at most
∑h(ε(ih−qj−1)+2δh)wj−1.
The cost of vertical edges
is ∑h∈L(wj−1−μ(Jh′))+∑h∈Lwj−1=∑h∈L2δhwj−1+∑h∈Lwj−1≤∑h2δhwj−1.
The combined cost of all edges is at most
[TABLE]
which implies the desired bound.
∎
Step 3. Implication of failure of condition (ii).
We now use the failure of (ii) to obtain an upper bound on the righthand side of Proposition 4.7.
For i′≤i, let Mi′=M(j,I×J,i,i′) and Si′ represent the set Sparse(j−1,i′)∩Intervals(wj−1;I).
Let I′=Intervals(wj−1;I).
The failure of condition (ii) implies:
[TABLE]
.
Multiplying (3) by ε(i′)
and summing on i′ yields:
[TABLE]
Switching the sums:
[TABLE]
To reduce this further, we need the
following sufficient condition for I′∈Mi′.
Proposition 4.8**.**
Suppose the candidate ⟨i;I×J⟩ satisfies ncost(I×J)≤ε(i)
and τ is a min-cost traversal of I×J. Let (I′,i′) be a pair such that I′∈Intervals(wj−1;I) and i′∈{0,…,i}.
If ε(i′)≥2ncost(τI′)+ε(i+3)
then ncost(I′×Ji′(I′))≤ε(i′).
2. 2.
If ε(i′)≥2ncost(τI′)+ε(i+3) and I′∈Sparse(j−1,i′) then (I′,i′) is a marker for ⟨i;I×J⟩.
Proof.
For the first part, by the choice of
Ji′(I′), we have ncost(I′×Ji′(I′))≤2ncost(τI′)+ε(i+3) and by the hypothesis of the
Proposition, this is at most ε(i′).
For the second part. By Completeness of EnumerateClose(j−1,⋅) and the first part, ⟨i′;I′×Ji′(I′)⟩ is classified as close.
So we just have to show that Ji′(I)∈ZoomIn(j,I×{J},i,I′,i′).
It suffices that disp(I′×Ji′(I′),I×J)≤2ε(i)wj.
To bound disp(I′×Ji′(I′),I×J) it suffices to bound the vertical distance from the point
(min(I′),min(Ji′(I′))) to the diagonal of I×J. Let (p,q) be the initial point of
τI′. By the definition of Ji′(I′)), the vertical distance from (min(I′),min(Ji′(I′))) to (p,q) is
at most cost(τI′)+ε(i′+3)wj−1≤cost(τ)+wj−1. By Proposition 2.5 the
vertical distance from (p,q) to the diagonal of I×J is at most cost(τ)/2.
So disp(I′×Ji′(I′),I×J)≤23cost(τ)+wj−1.
By hypothesis, cost(τ)≤ε(i)wj, and by assumption (2)
in Section 4.1, wj−1≤2θwj, and so disp(I′×Ji′(I′)),I×J)≤(23ε(i)+21θ)wj≤2ε(i)wj, as
required.
∎
Let G(I)={I′∈Intervals(wj−1;I):t(I′)<i&ε(t(I′)+1)≥2ncost(τI′)+ε(i+3)}.
We claim that for each I′∈G(I), I′∈Sparse(j−1,t(I′)+1). If it were not then by Part 3 of Proposition 4.6,
Ji′(I′)∈Bdense(j−1,I′,t(I′)+1). But by Part 1 of Proposition 4.8 this would
contradict completeness of Bdense(j−1,I′,t(I′)+1). Hence, for each I′∈G(I), (I′,t(I′)+1) is a marker.
We will combine Proposition 4.8 with inequality (4).
The sum on the lefthand side of (4) includes
all pairs (I′,t(I′)+1) where I′∈G(I) and so is bounded below by ∑I′∈G(I)ε(t(I′)+1).
To upper bound the righthand sum of (4), we look at the inner sum corresponding to a given I′∈Intervals(wj−1;I).
This is a sum of ε(i′) over those i′ such that I′ in Sparse(j−1,i′) and I′ not in Mi′.
We claim that if i′ contributes to this sum then
[TABLE]
To see this note that if ε(i′)≥2ncost(τI′)+ε(i+3) then
Part 2 of Proposition 4.8 implies that I′∈Si′∖Mi′,
so i′ is not included in the sum.
Now in the case that I′∈G(I) then (5) implies that
ε(i′)<ε(t(I′)+1) and so ε(i′)≤ε(t(I′)+2). Summing over all such i′,
the geometric series is at most ε(t(I′)+1).
For I′∈G(I), let v(I′) be the least i′ that contributes to the sum.
So the sum is at most 2ε(v(I′)), and by (5) this is at most
4ncost(τI′)+ε(i+2).
Multiplying the inequality by 2 and substracting ∑I′∈G(I)ε(t(I′)+1) from both sides gives:
[TABLE]
Now add ∑I′∈G(I)ε(t(I′)+1) to both sides:
[TABLE]
For the first sum on the right, I′∈G(I) implies either ε(t(I′)+1)=ε(i+1) or
ε(t(I′)+1)<2ncost(τI′)+ε(i+3), so you can bound this in both cases
by 2ncost(τI′)+ε(i+1). Thus we get:
[TABLE]
Step 4. Combining the bounds.
Combining the previous bound with the bound of Proposition 4.7 gives:
[TABLE]
as required to establish the Completeness of Bbelow.
4.8 Correctness of FAST-GED
We now complete the proof that the output of FAST-GED gives a constant factor approximation to edit distance with high probability.
As in Theorem 4.5 we assume
that SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′).
Consider a run of FAST-GED on input (n,θ,1/2;x,y) where n−ζ′≤θ≤1 and ∣x∣=∣y∣=n.
The conclusion of the theorem has a quality parameter Q which we set to 2qk+6. We must prove that
the FAST-GED satisfies the Soundness and Completeness properties for gap algorithms from Section 1.
The final post-processing step is a call to APM({0,…,n}×{{n,…,2n}},θ2qk+5,Rk), and
the algorithm returns accept or reject according to the output of this call. We will apply the Soundness and
Completeness of Bbelow (with j=k+1) by reinterpreting this final step as asking whether {n,…,2n}∈Bbelow(k+1,{0,…,n},log(1/θ)−qk−6) (where
wk+1=n). The Soundness and Completeness of Bbelow extends (with no change) to this case.
Thus if the algorithm returns accept, then ncost({0,…,n},{n,…,2n})≤θ2qk+6=θQ, and the gap-algorithm satisfies Soundness. For Completeness, assume Δedit(x,y)≤θ.
The Completeness of Bbelow extends (with no change) to this case. We conclude that
(i) J∈Bbelow(k+1,{0,…,n},log(1/θ)−qk−5) or (ii)
there exists an i′≤i such that ∣M(k+1,I×J,i,i′)∣>31∣Intervals(wk;I)∩Sparse(k,i)∣.
Since dk=1, Proposition 4.4 implies
all sets Sparse(k,i) are empty, so M(k+1,I×J,i,i′) are also empty but (ii) requires them to be non-empty.
Hence, (ii) can not hold, and so (i) holds, which implies
FAST-GED must accept, and so Completeness holds.
4.9 Time analysis
In this subsection, we upper bound the expected running time of FAST-GED
conditioned on the event SR of successful randomization,
in terms of the algorithm parameters w1,…,wk,
and d0,…,dk. These parameters will be optimized in the next subsection.
Theorem 4.9**.**
Suppose that SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′). For θ≥n−ζ′
the expected running time of FAST-GED(n,θ,1/2;x,y) conditioned on
SR is upper-bounded by:
[TABLE]
The above theorem is not quite sufficient for our purposes since
it gives only an expected upper bound on the running time of the algorithm,
while we want an absolute upper bound. We can replace
the expected upper bound by an absolute upper bound by the following
routine modification of FAST-GED.
On a given input, use the above theorem to determine a number
τ∗ which is at least six times the expected upper bound on running
time given by the above theorem. Then the probability that FAST-GED
takes more than τ∗ steps is at most 1/6. So we run FAST-GED
but terminate with reject if it reaches τ∗ steps. This converts
the expected running time to an absolute bound on running time, but now
the completeness error (the probability of false rejection) is increased
from 1/2 to 2/3. But by running this algorithm twice and accepting if either
run accepts we restore the completeness error to below 1/2.
Combining the above theorem with this modification gives an algorithm
satisfying the correctness properties proved for FAST-GED and having
an absolute upper bound on time given as in the above theorem.
Recall from Section 4.5 that
successful randomization means: (1) All calls to SLOW-GED return
correct answers, (2) All calls to SparseSample are successful and (3)
ProcessDense has successful sampling.
Recall that BSG is the sequence of random bits pregenerated for the calls to SLOW-GED
(as described in Section 4.5). For j∈[k],
BSSj are the random bits generated to select SparseSample’s in Preprocess(j) at iteration j of the algorithm, and
BPDj are the random bits generated to select sets S in ProcessDense(j) at iteration j
(also as described in Section 4.5).
Let B≤j denote the random bits BSG,BSS1,BPD1,…,BSSj,BPDj.
We introduce the following events:
SG
All calls to SLOW-GED return correct answers.
SS(j)
All calls to SparseSample during iteration j are successful.
SS(≤j)
All calls to SparseSample through the end of iteration j are successful.
PD(j)
ProcessDense has successful sampling during iteration j.
PD(≤j)
ProcessDense has successful sample through the end of iteration j.
SR
Successful randomization, i.e. SG∧SS(≤k)∧PD(≤k).
We will argue that
the expected running time of FAST-GED conditioned on SR is bounded by (9).
In the bound, the outer sum on j corresponds to iterations of FAST-GED. We will show that the cost of iteration j
is bounded by the inner sum. When we analyze iteration j we fix the randomness B≤j in such a way that SG∧SS(≤j−1)∧PD(≤j−1) holds. The cost of iteration j is bounded conditioned on these
fixed random bits and subject to requirement SS(j)∧PD(j).
As a first step, we need a bound on the running time for EnumerateClose.
Recall that fixing the random bits B≤j−1 and BSSj makes EnumerateClose(j,⋅) run deterministically.
In the lemma below, we condition on (B≤j−1,BSSj)=β∗j and
consider the expected time of EnumerateClose(j,I×J,i) where J is a set
of intervals chosen
according to any distribution (possibly depending on β∗j) in which no set appears in J with probability more than some fixed bound p.
Lemma 4.10**.**
Let p∈[0,1], j∈[k], I∈Intervals(wj), and i∈{0,…,log(1/θ)}. Let β∗j be an assignment of the random bits
B≤j−1 and BSSj that satisfies
the success conditions SS(≤j) and PD(≤j−1).
Let J be a random variable whose value is a subset of
Intervals(wj,ε(i+3))
with the property that given the fixed randomness B≤j−1 and BSSj,
each J∈Intervals(wj,ε(i+3)) belongs to J with probability
at most p. Then
the expected
running time of EnumerateClose(j,I×J,i) over the choice of J is at most:
[TABLE]
Proof.
The proof is by induction on j.
Suppose j=1. We run SLOW-GED(zI,zJ,κ) for each J∈J.
The expected time is O(θ1pd0w11+1/T′) since
the expected size of J is at most 8pθw1n≤θ16pn≤θ32pd0
and each call of SLOW-GED costs w11+1/T′.
Now suppose j>1. The loops on i′ and I′ starting in lines (11-12) are executed O(1) times.
The construction of J′ in line (13) using ZoomIn
takes O(∣J′∣) time (sort J in the natural order and build J′ "from left to right").
By Proposition 4.2, for each J′∈Intervals(wj−1,ε(i′+3)), the number
of ε(i+3)-aligned wj-intervals J such that J′∈ZoomIn(j,I×J,i,I′,i′) is at most 33. Since J is selected according to a probability
distribution so that no set J belongs to J with probability more than p, J′ is sampled according to some distribution where for each
J′∈Intervals(wj−1,ε(i′+3)) the probability of J′∈J′ is at most 33p.
Hence, the expected size of J′ is at most 33pwj−1ε(i′+3)n≤O(pθwj),
since wj≥wj−1≥⌊n⌋2 and ε(i′+3)≥θ/8.
This is dominated by the summand
for h=j in (10), which
is at least
θpdj−1wj1+1/T′.
By induction hypothesis, the recursive call to EnumerateClose in
line (14) takes expected time O(θ33p∑1≤h≤j−1dh−1wh1+1/T′) which is
O(θp∑1≤h≤j−1dh−1wh1+1/T′).
The final loop (22-26) on J∈K requires O(∣K∣wj1+1/T′) time. So we need to bound the size of K.
K is created in the loop on i′,I′. As noted
there are O(1) iterations of these loops, so it suffices to bound the number of elements added to K for
a single choice of I′,i′. During lines (15-17), for each J∈J, J
is added to K if there is a J′∈S that is in ZoomIn(i,I×J,I′,i′). By Proposition 4.2,
each J′∈S′ is responsible for the addition of at most 33 intervals to K, so ∣K∣≤33∣S′∣. Now, S′ is the output of a call to EnumerateClose(j−1,I′×J′,i′) where I′∈SparseSample(j,I,i′). By the success condition for
iteration j−1 of ProcessDense (Section 4.5) there are at most 2dj−1 intervals J′∈Intervals(wj−1,ε(i′+3))
classified as close for i′. As observed in the previous paragraph,
each of these at most 2dj−1 intervals belongs to J′ with probability at most 33p. So the expected size of ∣S′∣≤66pdj−1.
Thus the expected cost of the loop (22-26) is O(pdj−1wj1+1/T′).
Combining with the other loop gives the claimed time bound for EnumerateClose.
∎
Now we analyze the running time of Preprocess(j).
There are O(n/wj) pairs (I,i) that are enumerated in the two outer loops.
For each such pair, we construct Sparse(j,I,i) (which takes O(1) time), and Bbelow(j,I,i)
whose running time is O(1) if j=1 and is O(wj+∣R(j−1,I)∣+∣Intervals(wj,ε(i+3))∣) for j>1, which is the time to run APM. Summing over O(log(n)) values of i and noting that ε(i)≥θ, we obtain the upper bound
O(wj+∣R(j−1,I)∣+θwj−1n).
Summing over I gives O(n+∣R(j−1)∣+θwj−1wjn2).
∣R(j−1)∣ is at most the number of level j−1 candidates ⟨i;I′×J′⟩ which is
at most O(θwj−12n2). Since wh≥⌊n⌋2 for all h by assumption,
the overall time for Preprocess(j) is O(n/θ). We observe that this term is dominated
by the h=j term in the inner sum of (9) which is θ2nwj1/T′djdj−1≥θn.
The asymptotics of the running time does not depend on the choice of random bits BSSj.
We now analyze the time of ProcessDense(j).
We will condition the analysis on fixing the random bits (B≤j−1,BSSj)=β∗j so that SG∧SS(≤j)∧PD(≤j−1) holds.
The multiplicative cost of the outer iteration on i is absorbed in the O term.
The main part is the while loop (lines 8-22) on I∈T. This cost is divided into two parts, the
call to EnumerateClose within line (11), and the cost of (lines 14-20) which is only executed within the "else".
To bound the cost of the call to EnumerateClose in line (11), we want to apply
Lemma 4.10. For the hypothesis of this lemma
we need an upper bound p′ on the probability of any particular wj
interval being selected for S. According to the code of EnumerateClose,
every interval is placed in S with probability at most
p=min(1,c0logn/dj). However we need to consider the probability
of a given interval being placed in Sconditioned on the
event PD(j), and this can be bounded above by p/Pr[PD(j)].
As noted in Section 4.5,
PD(j) occurs with probability at least 1−n−9≥1/2
so we can bound the conditional probability of any interval being placed in
S by 2p.
Applying Lemma 4.10,
the expected time for the call to EnumerateClose in line (11)
is
O(θ2p∑1≤h≤jdh−1wh1+1/T′)
which is O(θdj1∑1≤h≤jdh−1wh1+1/T′).
The number of times this is executed is the number of possible I, which is
at most ∣Intervals(wj)∣=n/wj, so the overall expected cost of calls to EnumerateClose in line (11) is
O(θwjdjn∑h=1jdh−1wh1+1/T′),
as claimed in the theorem.
The time for executing (14-20) is dominated by the time of
the two calls to EnumerateClose, which are bounded to be at most
O(θ1(∑h=1jdh−1wh1+1/T′)
using Lemma 4.10
with
the trivial setting p=1.
The number of times this is executed is bounded by the number of times in the loop on I that
I is declared dense and used as a pivot. We claim that if PD(j) holds then the number of pivots is upper bounded
by O(θwjdjn). To
see this, first note that if I is chosen as a pivot then by Section 4.5, conditioning on PD(j) implies
∣EnumerateClose(j,I×Intervals(wj,ε(i+3)),i)∣≥dj/2.
Furthermore, we claim that if I and I′ are both pivots then EnumerateClose(j,I×Intervals(wj,ε(i+3)),i)
is disjoint from EnumerateClose(j,I′×Intervals(wj,ε(i+3)),i). Suppose for contradiction
that both are pivots and there is a J in both sets, and that I is selected first as a pivot.
Then by the Soundness of EnumerateClose, ncost(I×J)≤ε(i−qj−1−6) and ncost(I′×J)≤ε(i−qj−1−6) and so by the triangle inequality ncost(I×I′)≤ε(i−qj−1−7)=ε(h1)
(where h1 is defined in the pseudocode of EnumerateClose.) But, in that case, the pseudocode of EnumerateClose ensures that I′
is placed in X in line (16) and therefore removed from T in line (20), making it impossible for I′ to be chosen as a pivot.
Since the sets EnumerateClose(j,I×Intervals(wj,ε(i+3)),i) corresponding to pivots are pairwise disjoint
subsets of Intervals(wj,ε(i+3)) each have size at least dj/2, and ∣Intervals(wj,ε(i+3))∣=O(θwjn),
the number of pivots is at O(θwjdjn). Multiplying this by the cost of a single loop as bounded above,
the result is bounded above as claimed in the theorem.
∎
4.10 Choosing the parameters
The time analysis is expressed in terms of the parameters w1,…,wk and d0,…,dk.
In this section we determine values of the parameters that achieve the claimed time bound.
It is convenient to introduce parameters γ1,…,γk, δ0,…,δk and τ, with
wi=⌊nγi⌋2 and di=⌊nδi⌋2 and θ=⌊n−τ⌋2.
Recall that the parameters of gap-condition include ζ>0 and we only need our gap
algorithm to work for τ≤ζ. In the theorem we are allowed to choose ζ to be any positive
constant. In the derivation below, we will see that we will need an upper bound on τ
as a function of T′ which will be used to determine ζ in
the final proof of Theorem 4.1 in the next section.
We impose the following conditions.
•
d0=w1=⌊n⌋2, so δ0=γ1=1/2
•
dk=1, so δk=0.
The time for iteration j is:
[TABLE]
Define
•
αj=(1−γj−δj+2τ)
•
νi=δi−1+(T′T′+1)γi
Then the cost of processing level j can be rewritten as:
[TABLE]
We now choose γi and δi subject to
the following conditions:
•
γ1=δ0=1/2
•
αj is the same for all j
•
νi is the same for all i.
•
δk=0.
It is easy to check that for any B≥0, the first three conditions are satisfied by:
[TABLE]
The condition δk=0 implies:
[TABLE]
Then αj=1−γj−δj+2τ=T′+1B+2τ and
νi=1+2T′1.
So the time for all iterations is:
[TABLE]
As indicated earlier, we will impose the condition τ≤6(6(T′)3+7(T′)2+T′)3T′−2
For fixed T′≥1, Bk(T′) is a decreasing function of k whose limiting value is 1/2. So we choose
k=k(T′) to be large enough so that
B≤6T′3T′+1. While the value k(T′) is not important, it is straightforward to verify that we can choose k(T′)=⌈(T′+1)(1+ln(T′+1))⌉.
Using the above choice for B, the exponent of n is at most
1+2T′1+6T′(T′+1)3T′+1+2τ and a computation shows that setting T=T′+1/6
and imposing τ≤6(6(T′)3+7(T′)2+T′)3T′−2 (which we can do since T′≥1) results in an upper bound
on the exponent of 1+1/T as required.
Finally, we need to verify the assumptions (1) and (2) that wjn≥dj and wj+1wj≤θ/2.
The former is immediate as γj+δj≤1. For the latter,
letting M′=−log(n)1log(maxj2wj/wj+1), we require that θ≥n−M′, which we can
ensure for n large enough by choosing ζ<M, where M=minjγj+1−γj.
We have that SLOW-GED is a gap algorithm for edit distance satisfying gap-condition(T′,ζ′,Q′) where T′≥1,
ζ′>0 and Q′≥1. We have shown FAST-GED
(using SLOW-GED as a subroutine) that satisfies gap-condition(T,ζ,Q) with T=T′+1/6 and
ζ>0 and Q≥1 are suitably chosen (depending only on T′,ζ′ and Q′.
In Section 4.8 we proved that FAST-GED has quality Q=2qk+6.
In section 4.10 we adjusted the parameters so that the running time
computed in Section 4.9 is O(n1+1/T) provided that
θ≥n−6(6(T′)3+7(T′)2+T′)3T′−2, θ≥n−M/2 (where M is
defined in Section 4.10) and also θ≥ζ′.
So we set ζ=min(ζ′,M/2,6(6(T′)3+7(T′)2+T′)3T′−2).
Here we present the (routine) construction of the algorithm FAST-ED-UBT promised by Theorem 1.1
Given T, let ζ(T) and Q(T) be given by Theorem 1.2.
On input x,y, FAST-ED-UBT defines imax=⌊ζlogn⌋ and for i from 1 to imax,
runs FAST-GED on input (x,y,θ=2−i,δ=1/ζnlog(n)). Define i∗=0 if none of the runs accepts, and otherwise define i∗ to be the largest index for which
run i∗ accepts. FAST-ED-UBT outputs Q2−i∗n.
This is an upper bound on dedit(x,y) since if i∗=0 then the output is Qn≥n, and otherwise
the first requirement of gap-condition ensures that
dedit(x,y)≤Q2−i∗n.
We claim that for R=2Q, the probability that the output exceeds R(dedit(x,y)+n1−ζ)
is at most 1/n. If i∗=imax then the output is 2Qn1−ζ≤R(dedit(x,y)+n1−ζ).
So assume i∗<imax.
Say that the ith run of FAST-ED-UBTfails if dedit(x,y)≤2−in and the algorithm rejects. The probability that some iteration fails is at most δζlogn≤1/n
so the probability that no iteration fails is at least 1−1/n. If no iteration fails then in particular iteration i∗+1
does not fail, and since it rejects (by the choice of i∗) we conclude that dedit(x,y)>2−1−i∗n
and so Q2−i∗n≤Rdedit(x,y)≤R(dedit(x,y)+n1−ζ), and so FAST-ED-UBT has all of the required properties.
6 Approximate Pattern Matching
In this section we descibe the implementation of the function
APM(I×J,ϵ,R) from Section 4.3.
This is a synthesis of algorithms from [18, 17].
We assume that R contains certified boxes and all J∈J are of the same width μ(I).
Let max(J)={max(J):J∈J} and min(J)={min(J):J∈J}.
Let R+ be R augmented by auxiliary shortcut edges of cost 0
from (min(I),0) to (min(I),m) for all m∈min(J).
Also for J∈J let J0 denote the interval {0,…,max(J)}. The following was observed in [18]:
Proposition 6.1**.**
For all J∈J, costR(I×J) satisfies costR+(I×J0)≤costR(I×J)
and cost(I×J)≤2costR+(I×J0).
Proof.
For the first inequality consider a min-cost traversal τ of I×J in the shortcut graph G(R).
We construct a traversal τ′ of I×J0 of cost at most costR(τ). Consider the first shortcut edge e=(i,j)→(i′,j′) of τ.
We may assume that prior to e, the path consists of a (possibly empty) sequence of horizontal edges followed by a (possibly empty)
sequence of vertical edges. The final such horizontal edge ends at (min(I),j) and j∈min(J) so in G(R+) we can replace
the horizontal path by the shortcut edge (min(I),0)→min(I),j) of cost 0 to get a path that is no more costly.
For the second inequality, consider a min-cost traversal ρ of I×J0 in G(R+). Let j=0
if the path does not use one of the auxiliary shortcut edges, and otherwise let j be such that the path
starts with auxiliary shortcut edge (min(I),0)→(min(I),j). Let J^={j,…,maxJ}.
So the remaining portion of ρ is a min-cost traversal ρ^ of I×J^.
Since G(R) is certified, ∣μ(J^)−μ(I)∣≤cost(I×J^)≤costR(ρ^)=costR+(ρ^)=costR+(I×J0).
Also ∣JΔJ^∣=∣μ(J^)−μ(I)∣=∣min(J)−j∣≤costR+(I×J0).
So cost(I×J)≤cost(I×J^)+∣JΔJ^∣≤2costR+(I×J0).
∎
So if we compute costR+(I×J0) for every J∈J, and output the set of all J for which this cost
is less than κμ(I), we will satisfy the requirements of APM. We now describe a slightly modified version
of an algorithm from [17] that accomplishes this in time O(∣R+∣).
Let H be the graph G(R+) with each
cost ce of e=(i,j)→(i′,j′) replaced by benefitbe=(i′−i)+(j′−j)−ce, (so H and V edges have
benefit 0).
For any interval B, the min-cost traversal of I×B in G(R+) is μ(I)+μ(B) minus the max-benefit traversal of I×B
in H. So it suffices to compute the max-benefit traversal of I×J0 in H for all J∈J.
To do this, let j1<⋯<jr be the distinct second coordinates of the heads and tails of shortcut edges in G(R+).
We use a binary tree data structure with leaves corresponding to the indices of I,
where each tree node v stores a number av, and a collection of lists L1,…,Lr, where Lh stores pairs (e,q(e)) where the head of e has y-coordinate jh and q(e) is the max benefit of a path from (min(I),0) that ends with e.
We proceed in rounds h=1,…,r. In round h, let Ah consist of all the shortcuts whose tail has vertical coordinate jh. The preconditions for round h are: (1) for each leaf i, the stored value ai is the max benefit path to (i,jh) that includes a shortcut
whose head has horizontal coordinate i (or 0 if there is no such path), (2) for each internal node v, a_{v}=\max\{a_{i}:i\text{ is a leaf in the subtree of v}\},
and (3) for every shortcut edge e=(i′,jh′)→(i′′,jh′′) with h′<h, the value q(e)
has been computed and (e,q(e)) is in list Lh′′.
During round h,
for each shortcut e=(i,jh)→(i′,jh′) in Ah, q(e) equals
the max of aℓ+be over tree leaves ℓ with ℓ≤i. This can be computed in O(logn) time as max av+be,
where v ranges over the union of {i} with
the set of left children of vertices on the root-to-i path
that are not themselves on the path.
Add (e,q(e)) to list Lh′. After processing Ah,
update the binary tree: for each (e,q(e))∈Lh+1, let i be the horizontal coordinate
of the head of e and for all vertices v on the root-to-i path, replace av by max(av,q(e)).
The tree then satisfies the precondition for round h+1.
To obtain the output to APM,
for each J∈J, let h(J) be the index of the last iteration for which jh(J)≤max(J).
The benefit of I×J0 is the value, at the end of iteration of h(J) of av0 where v0 is the root.
For the runtime analysis:
It would take O(μ(I)) time to set up the full tree data structure so we will build it incrementally by expanding only the parts of the data
structure that contain non-zero values. Hence, the set up cost of the data structure is O(1). It takes O(∣R+∣log∣R+∣) time to sort the shortcuts, and
O(logμ(I)) processing time per shortcut (computing q(e) and later updating the data structure), overall
giving runtime O(∣R+∣+∣J∣).
Bibliography25
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] Amir Abboud and Arturs Backurs. Towards hardness of approximation for polynomial time problems. In 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA , pages 11:1–11:26, 2017.
2[2] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015.
3[3] Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016 , pages 375–388, 2016.
4[4] Alex Andoni. Simpler constant-factor approximation to edit distance problems. Manuscript , 2018.
5[5] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA , pages 377–386, 2010.
6[6] Alexandr Andoni and Huy L. Nguyen. Near-optimal sublinear time algorithms for Ulam distance. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010 , pages 76–86, 2010. doi:10.1137/1.9781611973075.8 . · doi ↗
7[7] Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. In Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing , STOC ’09, pages 199–204, New York, NY, USA, 2009. ACM.
8[8] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing , STOC ’15, pages 51–58, New York, NY, USA, 2015. ACM.