Walking the Dog Fast in Practice: Algorithm Engineering of the Fr\'echet Distance
Karl Bringmann, Marvin K\"unnemann, Andr\'e Nusser

TL;DR
This paper introduces a highly efficient, certifying implementation for the Fréchet distance decision procedure, significantly improving practical performance and addressing the gap between theoretical hardness and real-world applications.
Contribution
It presents a fast, certifying algorithm for the Fréchet distance decision problem, enhancing practical efficiency and empirical analysis on realistic data.
Findings
Up to 100x faster decision procedure compared to previous methods.
Up to 30x faster queries in near-neighbor data structures.
Empirical validation on diverse datasets including handwritten characters and GPS trajectories.
Abstract
The Fr\'echet distance provides a natural and intuitive measure for the popular task of computing the similarity of two (polygonal) curves. While a simple algorithm computes it in near-quadratic time, a strongly subquadratic algorithm cannot exist unless the Strong Exponential Time Hypothesis fails. Still, fast practical implementations of the Fr\'echet distance, in particular for realistic input curves, are highly desirable. This has even lead to a designated competition, the ACM SIGSPATIAL GIS Cup 2017: Here, the challenge was to implement a near-neighbor data structure under the Fr\'echet distance. The bottleneck of the top three implementations turned out to be precisely the decision procedure for the Fr\'echet distance. In this work, we present a fast, certifying implementation for deciding the Fr\'echet distance, in order to (1) complement its pessimistic worst-case hardness by…
| data set | type | #curves | mean hops | stddev hops |
|---|---|---|---|---|
| Sigspatial | synthetic GPS-like | 20199 | 247.8 | 154.0 |
| Characters | handwritten | 2858 | 120.9 | 21.0 |
| GeoLife | GPS (multi-modal) | 16966 | 1080.4 | 1844.1 |
| Sigspatial | Characters | GeoLife | |
|---|---|---|---|
| omit none | 99.085 | 153.195 | 552.661 |
| omit Rule \@slowromancapii@ | 112.769 | 204.347 | 1382.306 |
| omit Rule \@slowromancapiii@a | 193.437 | 296.679 | 1779.810 |
| omit Rule \@slowromancapiii@b | 5317.665 | 1627.817 | 385031.421 |
| omit Rule \@slowromancapiii@c | 202.469 | 273.146 | 2049.632 |
| omit Rule \@slowromancapiv@ | 110.968 | 161.142 | 696.382 |
| Sigspatial | Characters | GeoLife | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 10 | 100 | 1000 | 0 | 1 | 10 | 100 | 1000 | 0 | 1 | 10 | 100 | 1000 | |
| [7] | 0.094 | 0.123 | 0.322 | 1.812 | 8.408 | 0.187 | 0.217 | 0.421 | 2.222 | 17.169 | 0.298 | 0.741 | 4.327 | 33.034 | 109.44 |
| [13] | 0.421 | 0.618 | 1.711 | 7.86 | 35.704 | 0.176 | 0.28 | 0.611 | 3.039 | 17.681 | 3.627 | 6.067 | 26.343 | 120.509 | 415.548 |
| [22] | 0.197 | 0.188 | 0.643 | 5.564 | 76.144 | 0.142 | 0.147 | 0.222 | 1.849 | 22.499 | 2.614 | 4.112 | 16.428 | 166.206 | 1352.19 |
| ours | 0.017 | 0.007 | 0.026 | 0.130 | 0.490 | 0.004 | 0.020 | 0.058 | 0.301 | 1.176 | 0.027 | 0.089 | 0.341 | 1.108 | 3.642 |
| Sigspatial | Characters | GeoLife | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 10 | 100 | 1000 | 0 | 1 | 10 | 100 | 1000 | 0 | 1 | 10 | 100 | 1000 | |
| spatial hashing | 0.002 | 0.003 | 0.005 | 0.017 | 0.074 | 0.002 | 0.002 | 0.004 | 0.011 | 0.032 | 0.006 | 0.009 | 0.016 | 0.032 | 0.091 |
| greedy filter | 0.004 | 0.006 | 0.024 | 0.143 | 0.903 | 0.004 | 0.010 | 0.032 | 0.153 | 0.721 | 0.009 | 0.017 | 0.060 | 0.273 | 1.410 |
| adaptive equal-time filter | 0.000 | 0.001 | 0.006 | 0.030 | 0.088 | 0.001 | 0.004 | 0.018 | 0.088 | 0.424 | 0.005 | 0.017 | 0.063 | 0.273 | 1.211 |
| negative filter | 0.001 | 0.002 | 0.010 | 0.044 | 0.107 | 0.003 | 0.012 | 0.038 | 0.152 | 0.309 | 0.008 | 0.020 | 0.069 | 0.200 | 0.606 |
| complete decider | 0.002 | 0.011 | 0.044 | 0.214 | 0.330 | 0.005 | 0.030 | 0.109 | 0.671 | 2.639 | 0.062 | 0.210 | 0.998 | 3.025 | 8.760 |
| Sigspatial | |||||
|---|---|---|---|---|---|
| 0 | 1 | 10 | 100 | 1000 | |
| computation without certification | 6.9 | 21.3 | 84.5 | 429.4 | 1409.1 |
| certifying computation | 10.0 | 29.6 | 117.8 | 553.8 | 1840.2 |
| –computation of certificates | 1.0 | 3.7 | 12.0 | 40.9 | 65.7 |
| –YES certificates (complete decider) | 0.0 | 0.4 | 1.6 | 8.2 | 12.1 |
| –NO certificates (complete decider) | 1.0 | 3.3 | 10.0 | 31.4 | 50.0 |
| checking certificates | 6.4 | 13.9 | 63.7 | 426.5 | 3803.2 |
| –checking filter certificates | 6.0 | 11.3 | 51.6 | 361.2 | 3666.7 |
| –checking complete decider certificates | 0.4 | 2.6 | 12.1 | 65.3 | 136.5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Image and Video Retrieval Techniques · Human Mobility and Location-Based Analysis
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, [email protected] Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, [email protected] Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, [email protected] \CopyrightKarl Bringmann, Marvin Künnemann, and André Nusser \supplementhttps://github.com/chaot4/frechet_distance\funding\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNop \hideLIPIcs
Walking the Dog Fast in Practice: Algorithm Engineering of the Fréchet Distance
Karl Bringmann
,
Marvin Künnemann
and
André Nusser
Abstract.
The Fréchet distance provides a natural and intuitive measure for the popular task of computing the similarity of two (polygonal) curves. While a simple algorithm computes it in near-quadratic time, a strongly subquadratic algorithm cannot exist unless the Strong Exponential Time Hypothesis fails. Still, fast practical implementations of the Fréchet distance, in particular for realistic input curves, are highly desirable. This has even lead to a designated competition, the ACM SIGSPATIAL GIS Cup 2017: Here, the challenge was to implement a near-neighbor data structure under the Fréchet distance. The bottleneck of the top three implementations turned out to be precisely the decision procedure for the Fréchet distance.
In this work, we present a fast, certifying implementation for deciding the Fréchet distance, in order to (1) complement its pessimistic worst-case hardness by an empirical analysis on realistic input data and to (2) improve the state of the art for the GIS Cup challenge. We experimentally evaluate our implementation on a large benchmark consisting of several data sets (including handwritten characters and GPS trajectories). Compared to the winning implementation of the GIS Cup, we obtain running time improvements of up to more than two orders of magnitude for the decision procedure and of up to a factor of 30 for queries to the near-neighbor data structure.
Key words and phrases:
Curve simplification, Fréchet distance, algorithm engineering
1991 Mathematics Subject Classification:
\ccsdesc[500]Theory of computation Computational geometry, \ccsdesc[500]Theory of computation Design and analysis of algorithms
category:
\relatedversion
1. Introduction
A variety of practical applications analyze and process trajectory data coming from different sources like GPS measurements, digitized handwriting, motion capturing, and many more. One elementary task on trajectories is to compare them, for example in the context of signature verification [30], map matching [18, 29, 19, 12], and clustering [14, 16]. In this work we consider the Fréchet distance as curve similarity measure as it is arguably the most natural and popular one. Intuitively, the Fréchet distance between two curves is explained using the following analogy. A person walks a dog, connected by a leash. Both walk along their respective curve, with possibly varying speeds and without ever walking backwards. Over all such traversals, we search for the ones which minimize the leash length, i.e., we minimize the maximal distance the dog and the person have during the traversal.
Initially defined more than one hundred years ago [24], the Fréchet distance quickly gained popularity in computer science after the first algorithm to compute it was presented by Alt and Godau [5]. In particular, they showed how to decide whether two length- curves have Fréchet distance at most in time by full exploration of a quadratic-sized search space, the so-called free-space (we refer to Section 3.1 for a definition). Almost twenty years later, it was shown that, conditional on the Strong Exponential Time Hypothesis (SETH), there cannot exist an algorithm with running time for any [8]. Even for realistic models of input curves, such as -packed curves [21], exact computation of the Fréchet distance requires time under SETH [8]. Only if we relax the goal to finding a -approximation of the Fréchet distance, algorithms with near-linear running times in and on -packed curves are known to exist [21, 9].
It is a natural question whether these hardness results are mere theoretical worst-case results or whether computing the Fréchet distance is also hard in practice. This line of research was particularly fostered by the research community in form of the GIS Cup 2017 [28]. In this competition, the 28 contesting teams were challenged to give a fast implementation for the following problem: Given a data set of two-dimensional trajectories , answer queries that ask to return, given a curve and query distance , all with Fréchet distance at most to . We call this the near-neighbor problem.
The three top implementations [7, 13, 22] use multiple layers of heuristic filters and spatial hashing to decide as early as possible whether a curve belongs to the output set or not, and finally use an essentially exhaustive Fréchet distance computation for the remaining cases. Specifically, these implementations perform the following steps:
- (0)
Preprocess .
On receiving a query with curve and query distance :
- (1)
Use spatial hashing to identify candidate curves . 2. (2)
For each candidate , decide whether have Fréchet distance :
- a)
Use heuristics (filters) for a quick resolution in simple cases. 2. b)
If unsuccessful, use a complete decision procedure via free-space exploration.
Let us highlight the Fréchet decider outlined in steps 22a and 22b: Here, filters refer to sound, but incomplete Fréchet distance decision procedures, i.e., whenever they succeed to find an answer, they are correct, but they may return that the answer remains unknown. In contrast, a complete decision procedure via free-space exploration explores a sufficient part of the free space (the search space) to always determine the correct answer. As it turns out, the bottleneck in all three implementations is precisely Step 22b, the complete decision procedure via free-space exploration. Especially [7] improved upon the naive implementation of the free-space exploration by designing very basic pruning rules, which might be the advantage because of which they won the competition. There are two directions for further substantial improvements over the cup implementations: (1) increasing the range of instances covered by fast filters and (2) algorithmic improvements of the exploration of the reachable free-space.
Our Contribution.
We develop a fast, practical Fréchet distance implementation. To this end, we give a complete decision procedure via free-space exploration that uses a divide-and-conquer interpretation of the Alt-Godau algorithm for the Fréchet distance and optimize it using sophisticated pruning rules. These pruning rules greatly reduce the search space for the realistic benchmark sets we consider – this is surprising given that simple constructions generate hard instances which require the exploration of essentially the full quadratic-sized search space [8, 10]. Furthermore, we present improved filters that are sufficiently fast compared to the complete decider. Here, the idea is to use adaptive step sizes (combined with useful heuristic tests) to achieve essentially “sublinear” time behavior for testing if an instance can be resolved quickly. Additionally, our implementation is certifying (see [25] for a survey on certifying algorithms), meaning that for every decision of curves being close/far, we provide a short proof (certificate) that can be checked easily; we also implemented a computational check of these certificates. See Section 8 for details.
An additional contribution of this work is the creation of benchmarks to make future implementations more easily comparable. We compile benchmarks both for the near-neighbor problem (Steps ‣ 1 to 2) and for the decision problem (Step 2). For this, we used publicly available curve data and created queries in a way that should be representative for the performance analysis of an implementation. As data sets we use the GIS Cup trajectories [1], a set of handwritten characters called the Character Trajectories Data Set [2] from [20], and the GeoLife data set [3] of Microsoft Research [32, 31, 33]. Our benchmarks cover different distances and also curves of different similarity, giving a broad overview over different settings. We make the source code as well as the benchmarks publicly available to enable independent comparisons with our approach.111Code and benchmarks are available at: https://github.com/chaot4/frechet_distance Additionally, we particularly focus on making our implementation easily readable to enable and encourage others to reuse the code.
Evaluation.
The GIS Cup 2017 had 28 submissions, with the top three submissions222The submissions were evaluated ?for their correctness and average performance on a[sic!] various large trajectory databases and queries?. Additional criteria were the following: ?We will use the total elapsed wall clock time as a measure of performance. For breaking ties, we will first look into the scalability behavior for more and more queries on larger and larger datasets. Finally, we break ties on code stability, quality, and readability and by using different datasets.? (in decreasing order) due to Bringmann and Baldus [7], Buchin et al. [13], and Dütsch and Vahrenhold [22]. We compare our implementation with all of them by running their implementations on our new benchmark set for the near-neighbor problem and also comparing to the improved decider of [7]. The comparison shows significant speed-ups up to almost a factor of 30 for the near-neighbor problem and up to more than two orders of magnitude for the decider.
Related Work.
The best known algorithm for deciding the Fréchet distance runs in time on the word RAM [11]. This relies on the Four Russians technique and is mostly of theoretical interest. There are many variants of the Fréchet distance, e.g., the discrete Fréchet distance [4, 23]. After the GIS Cup 2017, several practical papers studying aspects of the Fréchet distance appeared [6, 17, 27]. Some of this work [6, 17] addressed how to improve upon the spatial hashing step (Step 1) if we relax the requirement of exactness. Since this is orthogonal to our approach of improving the complete decider, these improvements could possibly be combined with our algorithm. The other work [27] neither compared with the GIS Cup implementations, nor provided their source code publicly to allow for a comparison, which is why we have to ignore it here.
Organization.
First, in Section 2, we present all the core definitions. Subsequently, we explain our complete decider in Section 3. The following section then explains the decider and its filtering steps. Then, in Section 5, we present a query data structure which enables us to compare to the GIS Cup submissions. Section 6 contains some details regarding the implementation to highlight crucial points we deem relevant for similar implementations. We conduct extensive experiments in Section 7, detailing the improvements over the current state of the art by our implementation. Finally, in Section 8, we describe how we make our implementation certifying and evaluate the certifying code experimentally.
2. Preliminaries
Our implementation as well as the description are restricted to two dimensions, however, the approach can also be generalized to polygonal curves in dimensions. Therefore, a curve is defined by its vertices which are connected by straight lines. We also allow continuous indices as follows. For with and , let
[TABLE]
We call the with the points on . A subcurve of which starts at point and ends at point on is denoted by . In the remainder, we denote the number of vertices of (resp. ) with (resp. ) if not stated otherwise. We denote the length of a curve by , i.e., the sum of the Euclidean lengths of its line segments. Additionally, we use for the Euclidean norm of a vector . For two curves and , the Fréchet distance is defined as
[TABLE]
where is the set of monotone and continuous functions . We define a traversal as a pair . Given two curves and a query distance , we call them close if and far otherwise. There are two problem settings that we consider in this paper:
**Decider Setting:: **
Given curves and a distance , decide whether . (With such a decider, we can compute the exact distance by using parametric search in theory and binary search in practice.)
**Query Setting:: **
Given a curve dataset , build a data structure that on query returns all with .
We mainly focus on the decider in this work. To allow for a comparison with previous implementations (which are all in the query setting), we also run experiments with our decider plugged into a data structure for the query setting.
2.1. Preprocessing
When reading the input curves we immediately compute additional data which is stored with each curve:
**Prefix Distances:: **
To be able to quickly compute the curve length between any two vertices of , we precompute the prefix lengths, i.e., the curve lengths for every . We can then compute the curve length for two indices on by .
**Bounding Box:: **
We compute the bounding box of all curves, which is a simple coordinate-wise maximum and minimum computation.
Both of these preprocessing steps are extremely cheap as they only require a single pass over all curves, which we anyway do when parsing them. In the remainder of this work we assume that this additional data was already computed, in particular, we do not measure it in our experiments as it is dominated by reading the curves.
3. Complete Decider
The key improvement of this work lies in the complete decider via free-space exploration. Here, we use a divide-and-conquer interpretation of the algorithm of Alt and Godau [5] which is similar to [7] where a free-space diagram is built recursively. This interpretation allows us to prune away large parts of the search space by designing powerful pruning rules identifying parts of the search space that are irrelevant for determining the correct output. Before describing the details, we formally define the free-space diagram.
3.1. Free-Space Diagram
The free-space diagram was first defined in [5]. Given two polygonal curves and and a distance , it is defined as the set of all pairs of indices of points from and that are close to each other, i.e.,
[TABLE]
For an example see Figure 1. A path from to in the free-space diagram is defined as a continuous mapping with and . A path in the free-space diagram is monotone if is component-wise at most for any . The reachable space is then defined as
[TABLE]
Figure 2 shows the reachable space for the free-space diagram of Figure 1. It is well known that if and only if .
This leads us to a simple dynamic programming algorithm to decide whether the Fréchet distance of two curves is at most some threshold distance. We iteratively compute starting from and ending at , using the previously computed values. As is potentially a set of infinite size, we have to discretize it. A natural choice is to restrict to cells. The cell of with coordinates is defined as . This is a natural choice as given and , we can compute in constant time; this follows from the simple fact that is convex [5]. We call this computation of the outputs of a cell the cell propagation. This algorithm runs in time and was introduced by Alt and Godau [5].
3.2. Basic Algorithm
For integers we call the set a box. We denote the left/right/bottom/top boundaries of by . The left input of is , and its bottom input is . Similarly, the right/top output of is , . A box is a cell if and . We always denote the lower left corner of a box by and the top right by , if not mentioned otherwise.
A recursive variant of the standard free-space decision procedure is as follows: Start with . At any recursive call, if is a cell, then determine its outputs from its inputs in constant time, as described by [5]. Otherwise, split vertically or horizontally into and first compute the outputs of from the inputs of and then compute the outputs of from the inputs of and the outputs of . In the end, we just have to check to decide whether the curves are close or far. This is a constant-time operation after calculating all outputs.
Now comes the main idea of our approach: we try to avoid recursive splitting by directly computing the outputs for non-cell boxes using certain rules. We call them pruning rules as they enable pruning large parts of the recursion tree induced by the divide-and-conquer approach. Our pruning rules are heuristic, meaning that they are not always applicable, however, we show in the experiments that on practical curves they apply very often and therefore massively reduce the number of recursive calls. The detailed pruning rules are described Section 3.3. Using these rules, we change the above recursive algorithm as follows. In any recursive call on box , we first try to apply the pruning rules. If this is successful, then we obtained the outputs of and we are done with this recursive call. Otherwise, we perform the usual recursive splitting. Corresponding pseudocode is shown in Algorithm 1.
In the remainder of this section, we describe our pruning rules and their effects.
3.3. Pruning Rules
In this section we introduce the rules that we use to compute outputs of boxes which are above cell-level in certain special cases. Note that we aim at catching special cases which occur often in practice, as we cannot hope for improvements on adversarial instances due to the conditional lower bound of [8]. Therefore, we make no claims whether they are applicable, only that they are sound and fast. In what follows, we call a boundary empty if its intersection with is .
Rule @slowromancapi@: Empty Inputs
The simplest case where we can compute the outputs of a box is if both inputs are empty, i.e. . In this case no propagation of reachability is possible and thus the outputs are empty as well, i.e. . See Figure 3 for an example.
Rule @slowromancapii@: Shrink Box
Instead of directly computing the outputs, this rule allows us to shrink the box we are currently working on, which reduces the problem size. Assume that for a box we have that and the lowest point of is with . In this case, no pair in is reachable. Thus, we can shrink the box to the coordinates without losing any reachability information. An equivalent rule can be applied if we swap the role of and . See Figure 4 for an example of applying this rule.
Rule @slowromancapiii@: Simple Boundaries
Simple boundaries are boundaries of a box that contain at most one free section. To define this formally, a set is called an interval if or or for real and an interval . In particular, the four boundaries of a box are intervals. We say that an interval is simple if is again an interval. Geometrically, we have a free interval of a point and a curve (which is the form of a boundary in the free-space diagram) if the circle of radius around intersects at most twice. See Figure 5 for an example. We call such a boundary simple because it is of low complexity, which we can exploit for pruning.
There are three pruning rules that we do based on simple boundaries (see Figure 6 for visualizations). They are stated here for the top boundary , but symmetric rules apply to . Later, in Section 3.4, we then explain how to actually compute simple boundaries, i.e., also how to compute . The pruning rules are:
- (a)
If is simple because is empty then we also know that the output of this boundary is empty. Thus, we are done with . 2. (b)
Suppose that is simple and, more specifically, of the form that it first has a free and then a non-free part; in other words, we have . Due to our recursive approach, we already computed the left inputs of the box and thus know whether the top left corner of the box is reachable, i.e. whether . If this is the case, then we also know the reachable part of our simple boundary: Since and is an interval containing , we conclude that and we are done with . 3. (c)
Suppose that is simple, but the leftmost point of has . In this case, we try to certify that , because then it follows that and we are done with . To check for reachability of , we try to propagate the reachability through the inside of the box, which in this case means to propagate it from the bottom boundary. We test whether is in the input, i.e., if , and whether (by slightly modifying the algorithm for simple boundary computations). If this is the case, then we can reach every point in from via . Note that this is an operation in the complete decider where we explicitly use the inside of a box and not exclusively operate on its boundaries.
We also use symmetric rules by swapping ?top? with ?right? and ?bottom? with ?left?.
Rule @slowromancapiv@: Boxes at Free-Space Diagram Boundaries
The boundaries of a free-space diagram are a special form of boundary which allows us to introduce an additional rule. Consider a box which touches the top boundary of the free-space diagram, i.e., . Suppose the previous rules allowed us to determine the output for . Since any valid traversal from to passing through intersects , the output is not needed anymore, and we are done with . A symmetric rule applies to boxes which touch the right boundary of the free-space diagram.
3.4. Implementation Details of Simple Boundaries
It remains to describe how we test whether a boundary is simple, and how we determine the free interval of a simple boundary. One important ingredient for the fast detection of simple boundaries are two simple heuristic checks that check whether two polygonal curves are close or far, respectively. The former check was already used in [7]. We first explain these heuristic checks, and then explain how to use them for the detection of simple boundaries.
Heuristic check whether two curves are close.
Given two subcurves and , this filter heuristically tests whether . Let and be the indices of the midpoints of and (with respect to hops). Then holds if
[TABLE]
The triangle equality ensures that this is an upper bound on all distances between two points on the curves. For a visualization, see Figure 7(a). Observe that all curve lengths that need to be computed in the above equation can be determined quickly due to our preprocessing, see Section 2.1. We call this procedure .
Heuristic check whether two curves are far.
Symmetrically, we can test whether all pairs of points on and are far by testing
[TABLE]
We call this procedure .
Computation of Simple Boundaries.
Recall that an interval is defined as (intervals of the form are handled symmetrically). The naive way to decide whether interval is simple would be to go over all the segments of and compute the intersection with the circle of radius around . However, this is too expensive because
(i) computing the intersection of a disc and a segment involves taking a square root, which is an expensive operation with a large constant running time, and
(ii) iterating over all segments of incurs a linear factor in for large boxes, while we aim at a logarithmic dependence on for simple boundary detection.
We avoid these issues by resolving long subcurves using our heuristic checks (HeurClose, HeurFar). Here, is an adaptive step size that grows whenever the heuristic checks were applicable, and shrinks otherwise. See Algorithm 2 for pseudocode of our simple boundary detection. It is straightforward to extend this algorithm to not only detect whether a boundary is simple, but also compute the free interval of a simple boundary; we call the resulting procedure SimpleBoundary.
3.5. Effects of Combined Pruning Rules
All the pruning rules presented above can in practice lead to a reduction of the number of boxes that are necessary to decide the Fréchet distance of two curves. We exemplify this on two real-world curves; see Figure 8 on page 8 for the curves and their corresponding free-space diagram. We explain in the following where the single rules come into play. For Box 1 we apply Rule @slowromancapiii@b twice – for the top and right output. The top boundary of Box 2 is empty and thus we computed the outputs according to Rule @slowromancapiii@a. Note that the right boundary of this box is on the right boundary of the free-space diagram and thus we do not have to compute it according to Rule @slowromancapiv@. For Box 3 we again use Rule @slowromancapiii@b for the top, but we use Rule @slowromancapiii@c for the right boundary – the blue dotted line indicates that the reachability information is propagated through the box. For Box 4 we first use Rule @slowromancapii@ to move the bottom boundary significantly up, until the end of the left empty part; we can do this because the bottom boundary is empty and the left boundary is simple, starting with an empty part. After two splits of the remaining box, we see that the two outputs of the leftmost box are empty as the top and right boundaries are non-free, using Rule @slowromancapiii@a. For the remaining two boxes we use Rule @slowromancapi@ as their inputs are empty.
This example illustrates how propagating through a box (in Box 3) and subsequently moving a boundary (in Box 4) leads to pruning large parts. Additionally, we can see how using simple boundaries leads to early decisions and thus avoids many recursive steps. In total, we can see how all the explained pruning rules together lead to a free-space diagram with only twelve boxes, i.e., twelve recursive calls, for curves with more than 50 vertices and more than 1500 reachable cells. Figure 9 shows what effects the pruning rules have by introducing them one by one in an example.
4. Decider with Filters
Now that we introduced the complete decider, we are ready to present the decider. We first give a high-level overview.
4.1. Decider
The decider can be divided into two parts:
- (1)
Filters (see this section) 2. (2)
Complete decider via free-space exploration (see Section 3)
As outlined in Section 1, we first try to determine the correct output by using fast but incomplete filtering mechanisms and only resort to the slower complete decider presented in the last section if none of the heuristic deciders (filters) gave a result. The high-level pseudocode of the decider is shown in Algorithm 3.
The speed-ups introduced by our complete decider were already explained in Section 3. A second source for our speed-ups lies in the usage of a good set of filters. Interestingly, since our optimized complete decider via free-space exploration already solves many simple instances very efficiently, our filters have to be extremely fast to be useful – otherwise, the additional effort for an incomplete filter does not pay off. In particular, we cannot afford expensive preprocessing and ideally, we would like to achieve sublinear running times for our filters. To this end, we only use filters that can traverse large parts of the curves quickly. We achieve sublinear-type behavior by making previously used filters work with an adaptive step size (exploiting fast heuristic checks), and designing a new adaptive negative filter.
In the remainder of this section, we describe all the filters that we use to heuristically decide whether two curves are close or far. There are two types of filters: positive filters check whether a curve is close to the query curve and return either ?close? or ?unknown?; negative filters check if a curve is far from the query curve and return either ?far? or ?unknown?.
4.2. Bounding Box Check
This is a positive filter already described in [22], which heuristically checks whether all pairs of points on are in distance at most . Recall that we compute the bounding box of each curve when we read it. We can thus check in constant time whether the furthest points on the bounding boxes of are in distance at most . If this is the case, then also all points of have to be close to each other and thus the free-space diagram is completely free and a valid traversal trivially exists.
4.3. Greedy
This is a positive filter. To assert that two curves and are close, it suffices to find a traversal satisfying . We try to construct such a traversal staying within distance by making greedy steps that minimize the current distance. This may yield a valid traversal: if after at most steps we reach both endpoints and during the traversal the distance was always at most , we return ?near?. We can also get stuck: if a step on each of the curves would lead to a distance greater than , we return ?unknown?. A similar filter was already used in [7], however, here we present a variant with adaptive step size. This means that instead of just advancing to the next node in the traversal, we try to make larger steps, leveraging the heuristic checks presented in Section 3.4. We adapt the step size depending on the success of the last step. For pseudocode of the greedy filter see Algorithm 4, and for a visualization see Figure 10a.
4.4. Adaptive Equal-Time
We also consider a variation of Greedy Filter, which we call Adaptive Equal-Time Filter. The only difference to Algorithm 4 is that the allowed steps are now:
[TABLE]
In contrast to Greedy Filter, this searches for a traversal that stays as close as possible to the diagonal.
4.5. Negative
A negative filter was already used in [7] and [22]. However, changing this filter to use an adaptive step size does not seem to be practical when used with our approach. Preliminary tests showed that this filter would dominate our running time. Therefore, we developed a new negative filter which is more suited to be used with an adaptive step size and thus can be used with our approach.
Let be the points at which Greedy Filter got stuck. We check whether some point , is far from all points of using HeurFar. If so, we conclude that . We do the same with the roles of and exchanged. See Algorithm 5 for the pseudocode of this filter; for a visualization see Figure 10c.
5. Query Data Structure
In this section we give the details of extending the fast decider to compute the Fréchet distance in the query setting. Recall that in this setting we are given a curve dataset that we want to preprocess for the following queries: Given a polygonal curve (the query curve) and a threshold distance , report all that are -close to . To be able to compare our new approach to existing work (especially the submissions of the GIS Cup) we present a query data structure here, which is influenced by the one presented in [7].
The most important component that we need additionally to the decider to obtain an efficient query data structure is a mechanism to quickly determine a set of candidate curves on which we can then run the decider presented above. The candidate selection is done using a kd-tree on 8-dimensional points, similar to the octree used in [7], see 5.1 for more details. The high-level structure of the algorithm for answering queries is shown in Algorithm 6.
5.1. Kd-Tree
Fetching an initial set of candidate curves via a space-partitioning data structure was already used in [7, 13, 22]. We use a kd-tree which contains 8-dimensional points, each corresponding to one of the curves in the data set. Four dimensions are used for the start and end point of the curve and the remaining four dimensions are used for the maximum/minimum coordinates in x/y direction. We can then query this kd-tree with the threshold distance and obtain a set of candidate curves. See [7] for why proximity in the kd-tree is a necessary condition for every close curve.
6. Implementation Details
Square Root.
Computing which parts are close and which are far between a point and a segment involves intersecting a circle and a line segment, which in turn requires computing a square root. As square roots are computationally quite expensive, we avoid them by:
- •
filtering out simple comparisons by heuristic checks not involving square roots
- •
testing instead of (and analogous for other comparisons)
While these changes seem trivial, they have a significant effect on the running time due to the large amount of distance computations in the implementation.
Recursion.
Note that the complete decider (Algorithm 1) is currently formulated as a recursive algorithm. Indeed, our implementation is also recursive, which is feasible due to the logarithmic depth of the recursion. An iterative variant that we implemented turned out to be equally fast but more complicated, thus we settled for the recursive variant.
7. Experiments
In the experiments, we aim to substantiate the following two claims. First, we want to verify that our main contribution, the decider, actually is a significant improvement over the state of the art. To this end, we compare our implementation with the – to our knowledge – currently fastest Fréchet distance decider, namely [7]. Second, we want to verify that our improvements in the decider setting also carry over to the query setting, also significantly improving the state of the art. To show this, we compare to the top three submissions of the GIS Cup.
We use three different data sets: the GIS Cup set (Sigspatial) [1], the handwritten characters (Characters) [2], and the GeoLife data set (GeoLife) [3]. For all experiments, we used a laptop with an Intel i5-6440HQ processor with 4 cores and 16GB of RAM.
Hypotheses.
In what follows, we verify the following hypotheses:
- (1)
Our implementation is significantly faster than the fastest previously known implementation in the query and in the decider setting. 2. (2)
Our implementation is fast on a wide range of data sets. 3. (3)
Each of the described improvements of the decider speeds up the computation significantly. 4. (4)
The running time of the complete decider is proportional to the number of recursive calls.
The first two we verify by running time comparisons on different data sets. The third we verify by leaving out single pruning rules and then comparing the running time with the final implementation. Finally, we verify the fourth hypothesis by correlating the running time for different decider computations against the number of recursive calls encountered during the computation.
7.1. Data Sets Information.
Some properties of the data sets are shown in Table 1. Sigspatial has the most curves, while GeoLife has by far the longest. Characters is interesting as it does not stem from GPS data. By this selection of data sets, we hope to cover a sufficiently diverse set of curves.
Hardware.
We used standard desktop hardware for our experiments. More specifically, we used a laptop with an Intel i5-6440HQ processor with 4 cores (2.6 to 3.1 GHz) with cache sizes 256KiB, 1MiB, and 6MiB (L1, L2, L3).
Code.
The implementation is written in modern C++ and only has the standard library and openMP as dependencies. The target platforms are Linux and OS X, with little work expected to adapt it to other platforms. The code was optimized for speed as well as readability (as we hope to give a reference implementation).
7.2. Decider Setting
In this section we test the running time performance of our new decider algorithm (Algorithm 3). We first describe our new benchmark using the three data sets, and then discuss our experimental findings, in particular how the performance and improvement over the state of the art varies with the distance and also the ?neighbor rank? in the data set.
Benchmark.
For the decider, we want to specifically test how the decision distance and how the choice of the second curve influences the running time of the decider. To experimentally evaluate this, we create a benchmark for each data set in the following way. We select a random curve and sort the curves in the data set by their distance to in increasing order, obtaining the sequence . For all , we
- •
select a curve uniformly at random333Note that for some curves might be undefined as possibly . In this case we select a curve uniformly at random from .,
- •
compute the exact distance ,
- •
for each , add benchmark tests and .
By repeating this process for 1000 uniformly random curves , we create 1000 test cases for every pair of and .
Running Times.
First we show how our implementation performs in this benchmark. In Figure 11 we depict timings for running our implementation on the benchmark for all data sets. We can see that distances larger than the exact Fréchet distance are harder than smaller distances. This effect is most likely caused by the fact that decider instances with positive result need to find a path through the free-space diagram, while negative instances might be resolved earlier as it already becomes clear close to the lower left corner of the free-space diagram that there cannot exist such a path. Also, the performance of the decider is worse for computations on when is smaller. This seems natural, as curves which are closer are more likely in the data set to actually be of similar shape, and similar shapes often lead to bottlenecks in the free-space diagram (i.e., small regions where a witness path can barely pass through), which have to be resolved in much more detail and therefore lead to a higher number of recursive calls. It follows that the benchmark instances for low and are the hardest; this is the case for all data sets. In Characters we can also see that for there is suddenly a rise in the running time for certain distance factors. We assume that this comes from the fact that the previous values of all correspond to the same written character and this changes for .
We also run the original code of the winner of the GIS Cup, namely [7], on our benchmark and compare it with the running time of our implementation. See Figure 12 for the speed-up factors of our implementation over the GIS Cup winner implementation. The speed-ups obtained depend on the data set. While for every data set a significant amount of benchmarks for different and are more than one order of magnitude faster, for GeoLife even speed-ups by 2 orders of magnitude are reached. Speed-ups tend to be higher for larger distance factors. The results on GeoLife suggest that for longer curves, our implementation becomes significantly faster relative to the current state of the art. Note that there also are situations where our decider shows similar performance to the one of [7]; however, those are cases where both deciders can easily recognize that the curves are far (due to, e.g., their start or end points being far). We additionally show the percentage of instances that are already decided by the filters in Figure 13.
7.3. Influence of the Individual Pruning Rules
We also verified that the improvements that we introduced indeed are all necessary. In Section 3.3 we introduced six pruning rules. Rule @slowromancapi@, i.e., ?Empty Inputs?, is essential. If we were to omit it, we would hardly improve over the naive free-space exploration algorithm. The remaining five rules can potentially be omitted. Thus, for each of these pruning rules, we let our implementation run on the decider benchmark with this single rule disabled; and once with all rules enabled. See Table 2 for the results. Clearly, all pruning rules yield significant improvements when considering the timings of the GeoLife benchmark. All rules, except Rule @slowromancapiv@, also show significant speed-ups for the other two data sets. Additionally, note that omitting Rule @slowromancapiii@b drastically increases the running time. This effect results from Rule @slowromancapiii@b being the main rule to prune large reachable parts, which we otherwise have to explore completely. One can clearly observe this effect in Figure 9.
Filters.
In Figure 13 we show what percentage of the queries are decided by the filters. We can see that the closer we get to the actually distance of two curves, the less likely it gets that the filters can make a decision. Furthermore, for the distances that are greater than the filters perform worse than for distances less than . We additionally observe that on Characters the filters perform significantly worse than on the other two data sets. Also the running times are inversely correlated with the percentage of decisions of the filters as returning earlier in the decider naturally reduces the overall runtime.
7.4. Query Setting
We now turn to the experiments conducted for our query data structure, which we explained in Section 5.
Benchmark.
We build a query benchmark similar to the one used in [7]. For each , we select a random curve and then pick a threshold distance such that a query of the form returns exactly curves (note that the curve itself is also always returned). We repeat this 1000 times for each value of and also create such a benchmark for each of the three data sets.
Running Times.
We compare our implementation with the top three implementations of the GIS Cup on this benchmark. The results are shown in Table 3. Again the running time improvement of our implementation depends on the data set. For Characters the maximal improvement factor over the second best implementation is , for Sigspatial , and for GeoLife . For Sigspatial and Characters it is attained at , while for GeoLife it is reached at but shows a very similar but slightly smaller factor.
To give deeper insights about the single parts of our decider, a detailed analysis of the running times of the single parts of the algorithm is shown in Table 4. Again we witness different behavior depending on the data set. It is remarkable that for Sigspatial the running time for is dominated by the greedy filter. This suggests that improving the filters might still lead to a significant speed-up in this case. However, for most of the remaining cases the running time is clearly dominated by the complete decider, suggesting that our efforts of improving the state of the art focused on the right part of the algorithm.
7.5. Other Experiments
The main goal of the complete decider was to reduce the number of recursive calls that we need to consider during the computation of the free-space diagram. Due to our optimized algorithm to compute simple boundaries with adaptive step size, we expect roughly a constant (or possibly polylogarithmic) running time effort per box, essentially independent of the size of the box. To test this hypothesis, we ask whether the number of recursive calls is indeed correlated with the running time. To test this, we measured the time for each complete decider call in the query benchmark and plotted it over the number of boxes that were considered in this call. The result of this experiment is shown in Figure 14. We can see a practically (near-)linear correlation between the number of boxes and the running time.
8. Certificates
Whenever we replace a naive implementation in favor of a fast, optimized, but typically more complex implementation, it is almost unavoidable to introduce bugs to the code. As a useful countermeasure the concept of certifying algorithms has been introduced; we refer to [25] for a survey. In a nutshell, we aim for an implementation that outputs, apart from the desired result, also a proof of correctness of the result. Its essential property is that the certificate should be simple to check (i.e., much simpler than solving the original problem). In this way, the certificate gives any user of the implementation a simple means to check the output for any conceivable instance.
Following this philosophy, we have made our implementation of the Fréchet decider certifying: for any input curves and query distance , we are able to return, apart from the output whether the Fréchet distance of and is at most , also a certificate . On our realistic benchmarks, constructing this certificate slows down the Fréchet decider by roughly 50%. The certificate can be checked by a simple verification procedure consisting of roughly 200 lines of code.
In Sections 8.1 and 8.2, we define our notion of YES and NO certificates, prove that they indeed certify YES and NO instances and discuss how our implementation finds them. In Section 8.3, we describe the simple checking procedure for our certificates. Finally, we conclude with an experimental evaluation in Section 8.4.
8.1. Certificate for YES Instances
To verify that , by definition it suffices to give a feasible traversal, i.e., monotone and continuous functions and such that for all , we have , where denotes the free-space (see Section 3.1). We slightly simplify this condition by discretizing , as follows.
Definition 8.1**.**
We call with a YES certificate if it satisfies the following conditions: (See also Figure 15 for an example.)
- (1)
(start) , 2. (2)
(end) , 3. (3)
(step) For any and , we have either
- (a)
* and : In this case, we require that for all ,* 2. (b)
* and : In this case, we require that for all ,* 3. (c)
, for some : In this case, we require that .
It is straightforward to show that a YES certificate proves correctness for YES instances as follows.
Proposition 8.2**.**
Any YES certificate with proves that .
Proof 8.3**.**
View as a polygonal curve in and let be a reparameterization of . Let be the projection of to the first and second coordinate, respectively. Note that by the assumption on , and are monotone and satisfy and . We claim that for all , which thus yields by definition.
To see the claim, we recall that for any cell , the free-space restricted to this cell, i.e., , is convex (as it is the intersection of an ellipse with , see [5]). Observe that for any segment from to , we (implicitly) decompose it into subsegments contained in single cells (e.g., for and , the segment from to is decomposed into the segments connecting the sequence . As each such subsegment is contained in a single cell, by convexity we see that the whole subsegment is contained in if the corresponding endpoints of the subsegment are in . This concludes the proof.
It is not hard to prove that for YES instances, such a certificate always exists (in fact, there always is a certificate of length ). Furthermore, for each YES instance in our benchmark set, our implementation indeed finds and returns a YES instance, in a way we describe next.
Certifying positive filters.
It is straightforward to construct YES certificates for instances that are resolved by our positive filters (Bounding Box Check, Greedy and Adaptive Equal-Time): All of these filters implicitly construct a feasible traversal. In particular, for any instance for which the Bounding Box Check applies (which shows that any pair of points of and are within distance ), already the sequence yields a YES certificate.
For Greedy, note that the sequence of positions visited in Algorithm 4 yields a YES certificate: Indeed, any step from is either a vertical step to (corresponding to case 3a), a horizontal step to (corresponding to case 3b), or a diagonal step within a cell to (corresponding to Case 3c of Definition 8.1). Furthermore, such a step is only performed if it stays within the free-space.
Finally, for Adaptive Equal-Time, we also record the sequence of positions visited in Algorithm 4 (recall that here, we change the set of possible steps for to with ) – with the only difference that we need to replace any step from to by the sequence . Note that this sequence satisfies Condition (step) of Definition 8.1, as Adaptive Equal-Time only performs this step if it can verify that all pairwise distances between and are bounded by .
Certifying YES instances in the complete decider.
Recall that the complete decider via free-space exploration decides an instance by recursively determining, given the inputs of a box , the corresponding outputs . In particular, YES instances are those with (or equivalently ) for the box . To certify such instances, we memorize for each point in and a predecessor of a feasible traversal from to this point. Note that here, it suffices to memorize such a predecessor only for the first, i.e., lowest or leftmost, point of each interval in and (as any point in this interval can be reached by traversing to the first point of the interval and then along this reachable interval to the destination point). This gives rise to a straightforward recursive approach to determine a feasible traversal.
In the complete decider, whenever we determine some output , it is because of one of the following reasons: (1) one of of our pruning rules is successful, (2) the box is on the cell-level, or (3) we determine as the union of the outputs , of the boxes obtained by splitting vertically. Note that we only need to consider the case in which is determined as non-empty (otherwise nothing needs to be memorized). Let us consider each case separately.
If reason (1) determines a non-empty , then this happens either by Rule @slowromancapiii@b or by Rule @slowromancapiii@c. Note that in both cases, consists of a single interval. If Rule @slowromancapiii@b applies, then the last, i.e., topmost, point on is reachable and proves that the free prefix of is reachable. Thus, we store the last interval of as the responsible interval for the (single) interval in . Similarly, if Rule @slowromancapiii@c applies, then consider the first, i.e., leftmost, point on . Since the rule applies, the opposite point on must be reachable and the path must be free. Thus, we can store the interval of containing as the responsible interval for the (single) interval in .
If reason (2) determines a non-empty , then we are on a cell-level. In this case, either or is a non-empty interval, and we can store such an interval as the responsible interval for the (single) interval in . Finally, if reason (3) determines a non-empty , then we simply keep track of the responsible interval for each interval in and (to be precise, if the last interval of and the first interval of overlap by the boundary point, we merge the two corresponding intervals and only keep track of the responsible interval of the last interval of and can safely forget about the responsible interval of the first interval of .
Note that we proceed analogously for outputs . Furthermore, the required memorization overhead is very limited.
It is straightforward to use the memorized information to compute a YES certificate recursively: Specifically, to compute a YES certificate reaching some point on an output interval , we perform the following steps. Let be the responsible interval of . We recursively determine a YES certificate reaching the first point . Then we append a point to the certificate to traverse to the point of from which we can reach the first point of (this point is easily determined by distinguishing whether we are on the cell-level, and whether is opposite to or intersects in a corner point). We append the first point of to the certificate, and finally append the point to the certificate.444To be precise, we only append a point if it is different from the last point of the current certificate. By construction, the corresponding traversal never leaves the free-space. Using this procedure, we can compute a YES certificate by computing a YES certificate reaching on the last interval of for the initial box .
8.2. Certificate for NO Instances
We say that a point lies on the bottom boundary if , on the right boundary if , on the top boundary if , and on the left boundary if . Likewise, we say that a point lies to the lower right of a point , if and .
Definition 8.4**.**
We call with a NO certificate if it satisfies the following conditions: (See also Figure 16 for an example.)
- (1)
(start) lies on the right or bottom boundary and , 2. (2)
(end) lies on the left or upper boundary and , 3. (3)
(step) For any and , we have either
- (a)
* and : In this case, for any neighboring elements in , we require that ,* 2. (b)
* and : In this case, for any neighboring elements in , we require that ,* 3. (c)
* lies to the lower right of , i.e., and .*
We prove that a NO certificate proofs correctness for NO instances as follows.
Proposition 8.5**.**
Any NO certificate with proves that .
Proof 8.6**.**
We inductively prove that no feasible traversal from to can visit any point to the lower right of , for all . As an immediate consequence, , since lies on the left or upper boundary and thus any feasible traversal must visit a point to the lower right of – hence, such a traversal cannot exists.
As base case, note that lies on the right or bottom boundary and is not contained in the free-space. Thus, by monotonicity, no feasible traversal can visit any point to the lower right of . Thus, assume that the claim is true for and consider the next point in the sequence. If lies to the lower right of , the claim is trivially fulfilled for by monotonicity. If, however, and , then Condition 3a of Definition 8.4 is equivalent to . Note that any feasible traversal visiting a point to the lower right of must either visit a point to the lower right of – which is not possible by inductive assumption – or must cross the path – which is not possible as . We argue symmetrically for the case that and . This concludes the proof.
Note that our definition of NO certificate essentially coincides with the definition of a cut of positive width in [15]. For NO instances, such a NO certificate always exists (in contrast to YES certificates, the shortest such certificate is of length in the worst case). For all NO instances in our benchmark sets, our implementation manages to find and return such a NO certificate, in a way we describe next.
Certifying the negative filter.
It is straightforward to compute a NO certificate for instances resolved by our negative filter. Note that this filter, if successful, determines an index such that is far from all points on , or symmetrically an index such that is far from all points on . Thus, in these cases, we can simply return the NO certificate or , respectively.
Certifying NO instances in the complete decider.
Whenever the complete decider via free-space exploration returns a negative answer, the explored parts of the free-space diagram must be sufficient to derive a negative answer. This gives rise to the following approach: Consider all non-free segments computed by the complete decider. We start from a non-free segment touching the bottom or right boundary and traverse non-free segments (possibly also making use of monotonicity steps according to Case 3c of Definition 8.4) and stop as soon as we have found a non-free segment touching the left or top boundary.
Formally, consider Algorithm 7. Here, we use the notation that denotes the lower right endpoint of , i.e., the right endpoint if is a horizontal segment and the lower endpoint if is a vertical segment. Analogously, denotes the upper left endpoint of .
The initial set of non-free segments in Algorithm 7 consists of the non-free segments of all simple boundaries determined by the complete decider via free-space exploration. We maintain a queue of non-free segments, which initially contains all non-free segments touching the right or bottom boundary. Furthermore, we maintain a data structure of yet unreached non-free intervals. Specifically, we require to store intervals under the corresponding key in a way to support the query : Such a query returns all such that lies to the lower right of and deletes all returned intervals from .
Equipped with such a data structure, we can traverse all elements in the queue as follows: We delete any interval from and check whether it reaches the upper or left boundary. If this is the case, we have (implicitly) found a NO certificate, which we then reconstruct (by memorizing why each element of the queue was put into the queue). Otherwise, we add to all intervals from that can be reached by a monotone step (according to Case 3c of Definition 8.4) from ; these intervals are additionally deleted from .
To implement , we observe that it essentially asks for a 2-dimensional orthogonal range search data structure where the ranges are unbounded in two directions (and bounded in the other two). Already for the case of 2-dimensional ranges with only a single unbounded direction (sometimes called 1.5-dimensional), a very efficient solution is provided by a classic data structure due to McCreight, the priority search tree data structure [26]. We can adapt it in a straightforward manner to implement such that it (1) takes time and space to construct on an initial set of size and (2) supports queries in time , where denotes the number of reported elements. Thus, Algorithm 7 can be implemented to run in time .
8.3. Certificate Checker
It remains to describe how to check the correctness of a given certificate . For this, we simply verify that all properties of Definition 8.1 or Definition 8.4 are satisfied.
Checking YES certificates.
Observe that the only conditions in the definition of YES instances are either simple comparisons of neighboring elements in the sequence or freeness tests, specifically, whether a give position is free, i.e, whether and have distance at most . The latter test only requires interpolation along a curve segment (to obtain and ) and a Euclidean distance computation. Thus, YES certificates are extremely simple to check.
Checking NO certificates.
Checking NO certificates involves a slightly more complicated geometric primitive than the freeness tests of YES certificates. Apart from simple comparisons of neighboring elements , the conditions in the definition involve the following non-freeness tests: Given a (sub)segment with for some , as well as a point with , determine whether all points on have distance strictly larger than from . Besides the (simple) interpolation along a line segment to obtain , we need to determine intersection points of the line containing and the circle of radius around (if these exists). From these intersection points, we verify that and the circle do not intersect, concluding the check.
In summary, certificate checkers are straightforward and simple to implement.
8.4. Certification Experiments
We evaluate the overhead introduced by computing certificates using our benchmark sets for the query setting. In particular, as our implementation can be compiled both as a certifying and a non-certifying version, we compare the running times of both versions. The results are depicted in Table 5. Notably, the slowdown factor introduced by computing certificates ranges between 1.29 and 1.46 (Sigspatial), 1.44 and 1.67 (Characters) and 1.42 and 1.73 (GeoLife). As expected, the certificate computation time is dominated by the task of generating NO certificates (which is more complex than computing YES certificates), even for large values of for which most unfiltered instances are YES instances.
At first sight, it might be surprising that checking the certificates takes longer than computing them. However, this is due to the fact that our filters often display sublinear running time behavior (by using the heuristic checks and adaptive step sizes). However, to keep our certificate checker elementary, we have not introduced any such improvements to the checker, which thus has to traverse essentially all points on the curves. This effect is particularly prominent for large values of .
9. Conclusion
In this work we presented an implementation for computing the Fréchet distance which beats the state-of-the-art by one to two orders of magnitude in running time in the query as well as the decider setting. Furthermore, it can be used to compute certificates of correctness with little overhead. To facilitate future research, we created two benchmarks on several data sets – one for each setting – such that comparisons can easily be conducted. Given the variety of applications of the Fréchet distance, we believe that this result will also be of broader interest and implies significant speed-ups for other computational problems in practice.
This enables a wide range of future work. An obvious direction to continue research is to take it back to theory and show that our pruning approach provably has subquadratic runtime on a natural class of realistic curves. On the other hand, one could try to find further pruning rules or replace the divide-and-conquer approach by some more sophisticated search. To make full use of the work presented here, it would make sense to incorporate this algorithm in software libraries. Currently, we are not aware of any library with a non-naive implementation of a Fréchet distance decider or query. Finally, another possible research direction would be to work on efficient implementations for similar problems like the Fréchet distance under translation, rotation or variants of map matching with respect to the Fréchet distance. In summary, this paper should lay ground to a variety of improvements for practical aspects of curve similarity.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] ACM SIGSPATIAL GIS Cup 2017 Data Set. https://www.martinwerner.de/datasets/san-francisco-shortest-path.html . Accessed: 2018-12-03.
- 2[2] Character Trajectories Data Set. https://archive.ics.uci.edu/ml/datasets/Character+Trajectories . Accessed: 2018-12-03.
- 3[3] Geo Life GPS Trajectories. https://www.microsoft.com/en-us/download/details.aspx?id=52367 . Accessed: 2018-12-03.
- 4[4] Pankaj K. Agarwal, Rinat Ben Avraham, Haim Kaplan, and Micha Sharir. Computing the discrete Fréchet distance in subquadratic time. SIAM J. Comput. , 43(2):429–449, 2014. URL: https://doi.org/10.1137/130920526 , doi:10.1137/130920526 . · doi ↗
- 5[5] Helmut Alt and Michael Godau. Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications , 5(01n 02):75–91, 1995.
- 6[6] Maria Astefanoaei, Paul Cesaretti, Panagiota Katsikouli, Mayank Goswami, and Rik Sarkar. Multi-resolution sketches and locality sensitive hashing for fast trajectory processing. In Proc. 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS) , 2018.
- 7[7] Julian Baldus and Karl Bringmann. A fast implementation of near neighbors queries for Fréchet distance (GIS Cup). In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems , SIGSPATIAL’17, pages 99:1–99:4, New York, NY, USA, 2017. ACM. URL: http://doi.acm.org/10.1145/3139958.3140062 , doi:10.1145/3139958.3140062 . · doi ↗
- 8[8] Karl Bringmann. Why walking the dog takes time: Fréchet distance has no strongly subquadratic algorithms unless SETH fails. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on , pages 661–670. IEEE, 2014.
