Distributions of Matching Distances in Topological Data Analysis

So Mang Han; Taylor Okonek; Nikesh Yadav; Xiaojun Zheng

arXiv:1812.11258·cs.CG·January 10, 2020

Distributions of Matching Distances in Topological Data Analysis

So Mang Han, Taylor Okonek, Nikesh Yadav, Xiaojun Zheng

PDF

Open Access

TL;DR

This paper explores the behavior of matching distances in two-parameter persistent homology, revealing how geometric differences in data influence topological similarity measures, and provides foundational insights for analyzing complex data structures.

Contribution

It introduces key results on matching distances in two-parameter persistence modules, addressing a less-studied area in topological data analysis with practical implications.

Findings

01

Matching distance varies with geometric differences in point clouds

02

Results serve as a foundation for analyzing complex data structures

03

Provides insights into two-parameter persistent homology behavior

Abstract

In topological data analysis, we want to discern topological and geometric structure of data, and to understand whether or not certain features of data are significant as opposed to simply random noise. While progress has been made on statistical techniques for single-parameter persistence, the case of two-parameter persistence, which is highly desirable for real-world applications, has been less studied. This paper provides an accessible introduction to two-parameter persistent homology and presents results about matching distance between 2-D persistence modules obtained from families of point clouds. Results include observations of how differences in geometric structure of point clouds affect the matching distance between persistence modules. We offer these results as a starting point for the investigation of more complex data.

Equations16

X_{1} \subset X_{2} \subset X_{3} \subset \dots \subset X_{n} \subset \dots .

X_{1} \subset X_{2} \subset X_{3} \subset \dots \subset X_{n} \subset \dots .

R_{ϵ_{0}} \subset R_{ϵ_{1}} \subset R_{ϵ_{2}} \subset \dots \subset R_{ϵ_{n}}

R_{ϵ_{0}} \subset R_{ϵ_{1}} \subset R_{ϵ_{2}} \subset \dots \subset R_{ϵ_{n}}

H_{i} (X_{1}) \to H_{i} (X_{2}) \to H_{i} (X_{3}) \to \dots \to H_{i} (X_{n}) \to \dots .

H_{i} (X_{1}) \to H_{i} (X_{2}) \to H_{i} (X_{3}) \to \dots \to H_{i} (X_{n}) \to \dots .

∣∣ x - η (x) ∣ ∣_{\infty} = max (∣ x_{1} - y_{1} ∣, ∣ x_{2} - y_{2} ∣) .

∣∣ x - η (x) ∣ ∣_{\infty} = max (∣ x_{1} - y_{1} ∣, ∣ x_{2} - y_{2} ∣) .

d_{B} (D_{1}, D_{2}) = η in f x sup ∣∣ x - η (x) ∣ ∣_{\infty},

d_{B} (D_{1}, D_{2}) = η in f x sup ∣∣ x - η (x) ∣ ∣_{\infty},

f^{- 1} (- \infty, r] = {p \in P ∣ f (p) \leq r} .

f^{- 1} (- \infty, r] = {p \in P ∣ f (p) \leq r} .

C_{i, j} ↪ C_{i^{'}, j^{'}}

C_{i, j} ↪ C_{i^{'}, j^{'}}

d_{M} = ℓ sup {d_{B} (D (M_{ℓ}), D (N_{ℓ})) \cdot weight (slope (ℓ))},

d_{M} = ℓ sup {d_{B} (D (M_{ℓ}), D (N_{ℓ})) \cdot weight (slope (ℓ))},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis

Full text

Distributions of Matching Distances in Topological Data Analysis

So Mang Han

St. Olaf College, Northfield, Minnesota, USA

Taylor Okonek

St. Olaf College, Northfield, Minnesota, USA

Nikesh Yadav

St. Olaf College, Northfield, Minnesota, USA

Xiaojun Zheng

St. Olaf College, Northfield, Minnesota, USA

Abstract

Topological data analysis seeks to discern topological and geometric structure of data, and to understand whether or not certain features of data are significant as opposed to random noise. While progress has been made on statistical techniques for single-parameter persistence, the case of two-parameter persistence, which is highly desirable for real-world applications, has been less studied. This paper provides an accessible introduction to two-parameter persistent homology and presents results about matching distance between 2-parameter persistence modules obtained from families of simple point clouds. Results include observations of how differences in geometric structure of point clouds affect the matching distance between persistence modules. We offer these results as a starting point for the investigation of more complex data.

1 Introduction and Motivation

Topological data analysis (TDA) is a collection of methods used to discern the shape of data. TDA detects topological features, such as clusters, holes, and voids. Topological methods are especially useful for high-dimensional, noisy data. TDA has been applied in numerous settings, including image analysis [1], protein structure [3], texture representation in images [11], astronomical data [14], and neuroscience [15].

One of the main tools in TDA is persistent homology. Persistent homology associates to a dataset an algebraic object known as a persistence module, which encodes topological features of the data. The study of persistence modules can then reveal insights about the data that underlies the modules.

One common problem is to compare two datasets via their persistence modules. In this setting, notions of distance between persistence modules are useful for quantifying the amount of difference between persistence modules. This paper examines one such distance, the matching distance, which is easily computed. Our goal is to understand how the matching distance quantifies similarity between datasets.

We computed matching distances between persistence modules arising from datasets of two types. The first type of dataset consists of three points in various configurations. The second type of dataset consists of two circles with radii $r$ , and circles were separated by a distance $d$ . We examined how changes to $r$ or $d$ affect the matching distance.

The organization of the paper is as follows: In Section 2, we provide mathematical background for persistent homology and the matching distance. In Section 3, we describe our data analysis and matching distance computations. Discussion and directions for future research are provided in Section 4.

2 Mathematical Background

Persistent homology is one of the main tools in TDA and can be applied to many types of data, including real-valued functions and sets of points in Euclidean space. It quantifies multi-scale topological features of data: connected components, holes, voids, and their higher-dimensional analogs.

Previous research has used persistent homology to discern topological structure in data from many fields [1, 3, 11, 14, 15]. Nearly all of this prior work has used one-parameter persistent homology, which produces easily-visualized descriptors called barcodes, but which is sensitive to outliers. This sensitivity can be avoided by using two-parameter persistence. We give here a brief introduction to persistent homology in both the one- and two-parameter settings; more detailed surveys of the subject are found in [6, 7].

2.1 One-parameter persistence

Given a set of point-cloud data, we first build a simplicial complex. Our building blocks are simplices: a point is a [math]-simplex; an edge is a $1$ -simplex; a triangular face is a $2$ -simplex. More formally, an $n$ -simplex is an $n$ -dimensional geometric object that is the convex hull of $n+1$ points which are not contained in any $(n-1)$ -dimensional plane. A simplicial complex $X$ is a set of simplices such that if $v\in X$ , then every face of $v$ is also in $X$ , and if $v,w\in X$ , then $v\cap w$ is also in $X$ . A common type of simplicial complex built from a point cloud is the Rips complex, which we now define.

Definition 1

Given a collection of points $\{x_{\alpha}\}$ in Euclidean space $\mathds{E}^{n}$ and $\epsilon>0$ , the Rips complex $\mathcal{R}_{\epsilon}$ is the simplicial complex whose $k$ -simplices correspond to unordered $(k+1)$ -tuples of points $\{x_{\alpha}\}^{k}_{0}$ whose pairwise distances are at most $\epsilon$ .

In other words, the Rips complex depends on a scale parameter $\epsilon$ . The complex contains an edge between two points if and only if the distance between the points is at most $\epsilon$ . The complex contains a triangular face for any three points whose pairwise distances are at most $\epsilon$ . Figure 1 shows three Rips complexes built from the same underlying point cloud, but with different scale parameters. For illustration purposes, we have drawn a ball of diameter $\epsilon$ around each data point. The complex contains an edge for each pair of balls that intersect and a triangular face for each three balls that intersect pairwise.

A Rips complex is built with a fixed scale parameter $\epsilon$ , but usually no single choice of $\epsilon$ reveals all structure of the data. Instead, we consider many Rips complexes, one for every positive value $\epsilon$ . Imagine growing balls of diameter $\epsilon$ centered at each point: as $\epsilon$ increases from zero, an $n$ -simplex appears whenever $n+1$ balls pairwise intersect. Figure 1 shows three snapshots of this process, which leads us to the concept of a filtration.

A filtration is a sequence of simplicial complexes, each a subcomplex of the next:

[TABLE]

Figure 2 illustrates a filtration. In the figure, $X_{i}\hookrightarrow X_{i+1}$ denotes a map that takes each simplex in $X_{i}$ to its corresponding simplex in $X_{i+1}$ ; this is possible because $X_{i}$ is a subcomplex of $X_{i+1}$ . Note that if a simplex appears in $X_{i}$ , it must be present in $X_{j}$ for all $j>i$ .

If every complex in a filtration is a Rips complex, then we call the filtration a Rips filtration. Given a finite point set, simplices appear at only finitely many values of $\epsilon$ . Thus, a Rips filtration of a finite point set can be denoted

[TABLE]

for some sequence $0=\epsilon_{0}<\epsilon_{1}<\epsilon_{2}<\cdots<\epsilon_{n}$ .

Figure 2 illustrates a Rips filtration; note that we have not drawn a complex for every $\epsilon$ at which a simplex appears. The shaded areas represent the triangular faces and, in $X_{6}$ , the boundary of a $3$ -simplex.

Any topological feature (such as a component or a hole) appears in the filtration at some scale parameter $\epsilon_{1}$ and disappears at some scale parameter $\epsilon_{2}$ ; the pair $(\epsilon_{1},\epsilon_{2})$ gives the persistence of the feature. Plotting each persistence pair as a interval along the scale axis produces a barcode, as seen at the bottom of Figure 2. The orange bars in Figure 2 represent components: each of the five points corresponds to an orange bar starting at $\epsilon=0$ . An orange bar ends at each $\epsilon$ value at which two components become connected. The blue bar represents the hole in the simplicial complex, which persists over a range of $\epsilon$ values.

The information in a barcode can also be visualized as a persistence diagram, which is a collection of points above the diagonal in the $xy$ -plane. Bars in the barcode are in one-to-one correspondence with points in the persistence diagram. A bar from $a$ to $b$ is plotted as the point $(a,b)$ in the persistence diagram.

In order to quantify the topological features of a simplicial complex, as illustrated in the barcode, we use the mathematics of homology. Homology associates a vector space to each simplicial complex and a linear map to each inclusion map in the filtration. The homology vector space $H_{k}(X)$ is generated by the $k$ -dimensional holes of simplicial complex $X$ . Making this precise requires some definitions, which we introduce briefly; for more details, see [16].

Let $C_{k}$ be a vector space whose basis consists of all $k$ -simplices in simplicial complex $X$ . That is, $C_{k}$ contains $k$ -chains, which are sums of $k$ -simplices with coefficients in a field $\mathbb{F}$ . In TDA, $\mathbb{F}$ is usually chosen to be the $2$ -element field, a choice we make in this paper as well. The boundary operator $\partial_{k}:C_{k}\to C_{k-1}$ maps a $k$ -simplex to the sum of its $(k-1)$ -faces, extending by linearity to $k$ -chains. Let $B_{k}\subseteq C_{K}$ be the subspace of boundaries, which are images of $\partial_{k+1}$ . Let $Z_{k}\subseteq C_{k}$ be the subspace of cycles, defined by the property that $v\in Z_{k}$ if and only if $\partial_{k}(v)=0$ . Crucially, $B_{k}\subseteq Z_{k}$ , since $\partial\circ\partial=0$ . We then define the homology vector space $H_{k}=Z_{k}/B_{k}$ . Thus, $H_{k}$ is a vector space consisting of all cycles that are not boundaries. The dimension of $H_{k}$ is the number of equivalence classes of holes, in the sense that two holes are equivalent if they differ by a boundary.

The homology of a filtration is a one-parameter persistence module. The inclusion maps in the filtration induce linear maps between the homology vector spaces. Specifically, the degree $i$ homology of the filtration in equation (1) is a persistence module consisting of the following vector spaces and linear maps:

[TABLE]

The structure theorem for persistence modules says each persistence module is the sum of interval modules; each interval gives the persistence of one topological feature in the filtration [5]. Thus, a barcode is a visualization of a persistence module, which each interval module shown as a bar.

In order to compare barcodes, we need a notion of distance between barcodes. We use the bottleneck distance, which is easily computable, though other options exist [2]. Before defining the bottleneck distance, we introduce the concept of a matching, which we explain in terms of persistence diagrams.

A matching $\eta$ between persistence diagrams $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ pairs each point in $\mathcal{D}_{1}$ with a point in $\mathcal{D}_{2}$ or a point on the diagonal line, and pairs each point in $\mathcal{D}_{2}$ with a point in $\mathcal{D}_{1}$ or a point on the diagonal. For an illustration of a matching between two persistence diagrams, see Figure 3. By convention, we use the $L_{\infty}$ metric to obtain the distance from a point $x=(x_{1},x_{2})$ to its matched point $\eta(x)=(y_{1},y_{2})$ :

[TABLE]

Let the size of a matching refer to the supremum of the $L_{\infty}$ distance between matched points. Among all possible matchings, we seek a matching with the smallest size. The bottleneck distance between $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ is the size of this optimal matching, as defined below [6].

Definition 2

The bottleneck distance between persistence diagrams $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ is

[TABLE]

where the supremum is taken over all matched points $x$ and the infimum is taken over all matchings $\eta$ .

Figure 3 shows the optimal matching between two persistence diagrams $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ . The size of this matching is given by the max $L_{\infty}$ distance between matched points, which is $\max(|a-c|,|b-d|)$ . Since no other matching between these persistence diagrams has smaller size, the bottleneck distance $d_{B}(\mathcal{D}_{1},\mathcal{D}_{2})$ is equal to $\max(|a-c|,|b-d|)$ .

While one-parameter persistence is stable with respect to perturbation of the point cloud data, it is unstable with respect to the presence of outliers. A density estimator on the points (i.e., a function that indicates whether each point has many nearby neighbors) might be able to identify outliers, but this requires introducing a threshold. Instead, we prefer to use the density estimator as a second filtration parameter, which brings us into the realm of two-parameter persistence.

2.2 Two-parameter persistence

Two-parameter persistence arises from data that is simultaneously indexed by two parameters. For example, suppose we have a point cloud $\mathcal{P}$ and a real-valued function $f:\mathcal{P}\to\mathbb{R}$ on each point. In particular, $f$ may arise from a density estimator. For any $r\in\mathbb{R}$ , let

[TABLE]

We can then construct a Rips filtration from $f^{-1}(-\infty,r]$ . Repeating this construction for an increasing sequence $r_{1},r_{2},\ldots,r_{n}$ , we obtain a sequence of Rips filtrations, which yields a bifiltration.

A bifiltration is a set of simplicial complexes, each indexed by two parameters, with inclusion maps in the direction of each increasing parameter. Specifically, the set of simplicial complexes $\{C_{i,j}\}_{i,j}$ forms a bifiltration if there exist commuting inclusion maps

[TABLE]

whenever $i\leq i^{\prime}$ and $j\leq j^{\prime}$ . Figure 4 (left) gives an example of a bifiltration.

The homology of a bifiltration is a 2-parameter persistence module, which is a set of vector spaces $H_{p}(C_{i,j})$ with commuting linear maps in the directions of increase of $i$ and $j$ , as illustrated in Figure 4 (right).

Unfortunately, the algebraic structure of 2-parameter persistence modules is extremely complicated, and there is no reasonable “barcode” for such modules [9]. Instead, we can obtain a barcode along any line with nonnegative slope in the 2-parameter space by restricting the 2-parameter persistence module to such a line, as we now explain.

Let $\mathcal{M}$ be a 2-parameter persistence module with parameter values in discrete indexing sets $I$ and $J$ . Denote the vector spaces in $\mathcal{M}$ as $M_{i,j}$ for every $i\in I$ and $j\in J$ , with linear maps $M_{i,j}\to M_{i^{\prime},j^{\prime}}$ whenever $i\leq i^{\prime}$ and $j\leq j^{\prime}$ . We may then adopt a continuous perspective, assigning a vector space from $\mathcal{M}$ to every point in $(x,y)\in\mathbb{R}^{2}$ . If $x<\min(I)$ or $y<\min(J)$ , then the point $(x,y)$ is assigned the zero vector space; otherwise, the vector space assigned to $(x,y)$ is $M_{a,b}$ , where $a=\max\{i\in I\mid i\leq x\}$ and $b=\max\{j\in J\mid j\leq y\}$ , as illustrated in Figure 5.

Let $\ell$ be a line in $\mathbb{R}^{2}$ with non-negative slope. Let $\mathcal{M}_{\ell}$ be the 1-parameter persistence module obtained by restricting $\mathcal{M}$ to line $\ell$ : every point along $\ell$ is assigned the homology vector space of $\mathcal{M}$ at that point in $\mathbb{R}^{2}$ , with linear maps induced from $\mathcal{M}$ (as in Figure 5). Since $\mathcal{M}_{\ell}$ is a 1-parameter persistence module, it has a barcode, or equivalently, a persistence diagram.

Furthermore, we can define a distance between 2-parameter persistence modules by considering the bottleneck distances between persistence diagrams along all possible lines through the 2-parameter space. In the following definition, $\mathcal{D}(\mathcal{M}_{\ell})$ denotes the persistence diagram of the 1-parameter module $\mathcal{M}_{\ell}$ .

Definition 3

The matching distance, $d_{M}$ , between two 2-parameter persistence modules $\mathcal{M}$ and $\mathcal{N}$ is the supremum of the bottleneck distances between the persistence diagrams on corresponding lines of non-negative slope in the two modules. Precisely,

[TABLE]

where the supremum is over all lines of nonnegative slope and $\mathrm{weight}(m)=\frac{1}{\sqrt{1+q^{2}}}$ , where $q=\max\left(m,\frac{1}{m}\right)$ .

In the definition of the matching distance, a weight is assigned to each line $\ell$ , which depends on the slope $\ell$ . A line with slope $1$ gets the maximum weight, and the weight approaches zero as the slope approaches zero or infinity. The weight is chosen such that if the interleaving distance between two persistence modules is 1, then the weighted bottleneck distance is at most 1 [8, 9].111Prior to computing the matching distance, the persistence modules $\mathcal{M}$ and $\mathcal{N}$ are often normalized. That is, the parameter axes for each module are rescaled so that the parameter values for all generators and relations occur in specified intervals on each axis. For details, see [13].

3 Computations and Analysis

Our datasets are point clouds with simple structure depending on a few parameters. Adjusting these parameters allows us to change the size of topological features (namely, components and holes) that are captured by the persistence modules. We expect the matching distance between these persistence modules to reflect the parameter differences in the underlying datasets.

From each dataset, we construct a density-Rips bifiltration. That is, our two parameters are a density estimator and Euclidean distance. A density estimator $f$ is assigned to each point $p$ such that $f(p)$ is small if $p$ has many nearby neighbors and $f(p)$ is large if $p$ has no nearby neighbors; this causes points with more neighbors to appear before points with few neighbors in the density filtration.222We used a $k$ -nearest-neighbor density estimator, but many other options are available. For any $r\in\mathbb{R}$ , the Rips filtration is constructed on $f^{-1}(-\infty,r]$ . This produces a 2-parameter family of simplicial complexes with inclusion maps in the increasing directions of both density and distance.

We begin with very simple point-cloud datasets, each consisting of three points in the $xy$ -plane. We examine the matching distances while keeping two points fixed and moving the third point around the plane. Three points are enough to produce a density-Rips bifiltration that is nontrivial in both density and distance. Thus, we regard these datasets as the simplest datasets that allow us to study the effect of moving a single point on the matching distance.

The second collection of datasets consists of points sampled from two circles with radii $r$ and separated by distance $d$ . These datasets have nontrivial structure in persistent homology of degree zero and one. We generated many datasets with various values of $d$ and fixed $r$ , investigating how varying $d$ affects the matching distance computed from zero-degree persistent homology modules. We also fix $d$ and vary $r$ , investigating the matching distance between first-degree persistent homology modules.

We computed two-parameter persistent homology using RIVET, an interactive visualization software developed by Michael Lesnick and Matthew Wright [12]. A detailed description of RIVET and its algorithms appears in a comprehensive preprint by Lesnick and Wright [10]. Given point-cloud data, RIVET computes a 2-parameter persistence module, and then computes barcodes along linear slices of the persistence module. We approximated the matching distance from these barcodes using Python code written by Bryn Keller and Michael Lesnick [13], which uses a finite set of lines to approximate the supremeum in Definition 3.333The approximation algorithm requires us to specify the number of lines used in the approximation. This is achieved by specifying a grid-size parameter, which determines the number of different slope and intercept values that the algorithm uses. For example, if grid-size is $20$ , then the algorithm uses $20$ slope values and $20$ intercept values to produce $400$ lines, computing the bottleneck distance along each. In our experiments, we found we found little difference in the approximated matching distance when grid-size was set to $20$ or to a large value, such as $50$ , though the computation time increases according to the square of the grid-size value. We regard the matching distance as giving a measure of similarity between two point-cloud datasets.

3.1 Three-Point Datasets

For our first investigation, each data set consisted of two fixed points $A=(1,1)$ and $B=(6.1,1)$ , and a third point $C_{i}$ in the $xy$ -plane. Figure 6 shows the locations of $A$ and $B$ (in red), as well as the locations of all $C_{i}$ (blue). Since $A$ and $C_{i}$ are always the closest pair of points, we assigned to these points a density parameter of $1$ , and then we assigned point $B$ a density parameter of $2$ .

For concreteness, let $X_{r,s}$ denote a 3-point dataset where $C_{i}$ has coordinates $(r,s)$ . That is, $X_{r,s}=\{A,B,(r,s)\}$ . Let $X_{t,u}=\{A,B,(t,u)\}$ be another 3-point dataset. We compute the matching distance between the 2-parameter persistence modules constructed from $X_{r,s}$ and $X_{t,u}$ .

First, we fix $s=u=3$ . Figure 7 displays the matching distance between the 2-parameter persistence modules constructed from $X_{r,3}$ and $X_{t,3}$ , for various choices of $r$ and $t$ . The horizontal axis gives $r$ , and the color of each curve represents the value of $t$ . Specifically, $t$ ranges from [math], colored dark green, to $3.3$ , colored brown, and the step size is $.184$ .

We observe that when $s$ and $u$ are small, such as $s=u=3$ , as shown in Figure 7, there is a relatively linear increase in matching distances as the $t$ increases (above some threshold) with $r$ fixed. Some of the curves display a nonlinear region for small values of $t$ ; We note that these curves have $t$ smaller than $1$ , which is the $x$ -coordinate for $A$ .

Next, we fix $s=u=10$ . Figure 7 displays the matching distance between the 2-parameter persistence modules constructed from $X_{r,10}$ and $X_{t,10}$ , for various choices of $r$ and $t$ . Similarly, Figure 8 shows how the matching distance depends on $r$ and $t$ in this case. It is clear that no linear trend in matching distances is present in these cases, as the matching distance increases faster as $r$ increases. We note a prominent feature in Figure 8 is that the matching distance attains the value [math]; the following proposition explains why this occurs.

Proposition 1

Suppose $A$ and $B$ are points on the $xy$ -plane, such that the Euclidean distance between $A$ and $B$ is $d>0$ . Now suppose there are two points $C_{1}$ and $C_{2}$ , both distance $r<d$ away from $A$ , with the distance between $C_{i}$ and $B$ as $h_{i}>d$ for $i\in\{1,2\}$ . Then the matching distance between the 2-parameter persistence modules constructed from the two point clouds $\{A,B,C_{1}\}$ and $\{A,B,C_{2}\}$ is [math]. (See Figure 9)

Proof: Since the distance between each $C_{i}$ and $A$ is smaller than the distance between each $C_{i}$ and $B$ , the density parameter $1$ is assigned to $C_{1}$ , $C_{2}$ , and $A$ , while point $B$ is assigned density parameter $2$ . From such a point cloud $\{A,B,C_{i}\}$ , we construct a bifiltration as shown on the left side of Figure 10, as we now explain, considering each density parameter value in turn.

Density 1: When the distance parameter $\epsilon=0$ , only $C_{1}$ and $A$ appear, so we have two isolated points. When $\epsilon$ increases to $r$ , $C_{1}$ and $A$ will connect to form an edge. As $\epsilon$ increases from $r$ , the simplicial complex remains unchanged, since no other points exist to produce edges at this density.

Density 2: When the scale parameter $\epsilon=0$ , all points $A,B,C_{1}$ in the point cloud appear, so we have three isolated points. When $\epsilon$ increases to $r$ , $C_{1}$ and $A$ will form an edge. This results in one connected component and one isolated point. When $\epsilon$ increases to $d$ , an edge connects $B$ to $A$ . Since all points are connected at distance $d$ , $H_{0}$ homology doesn’t change as $\epsilon$ increases further.

Now consider the point cloud $\{A,B,C_{2}\}$ . The bifiltration constructed from this point cloud is nearly the same as that described above; the only difference is the distance value $h_{2}$ at which the longest edge appears. However, this edge does not connect any components that were not already connected at distance $d$ , so no new (zeroth) homology appears at distance $h_{2}$ . Thus, we obtain topologically equivalent bifiltrations for the two data sets. This implies that the two 2-parameter persistence modules are the same, and so barcodes along any linear slice of the two modules are also the same. Therefore, the matching distance between these modules is [math].

According to the proposition above, it would be easy to compute the probability of having matching distance of [math] if $C_{1}$ and $C_{2}$ are randomly selected on circle $O_{A}$ and circle $O_{B}$ in Figure 9. The following corollary gives such a probabilistic interpretation of the proposition; we leave the proof to the reader.

Corollary 1

Suppose $A$ and $B$ are points on the $xy$ -plane with distance between them $d>0$ . Let $O_{A}$ be a circle of radius $r<d$ centered at $A$ , and $O_{B}$ a circle of radius $r<d$ centered at $B$ . Now suppose there are two points $C_{1}$ and $C_{2}$ , which are randomly selected on the two circles. Then, the two point clouds $\{A,B,C_{1}\}$ and $\{A,B,C_{2}\}$ clouds have a matching distance of [math] with probability $\left(1-\frac{1}{\pi}\arccos^{-1}\left(\frac{r}{2d}\right)\right)^{2}$ .

The following theorem gives a $n$ -point generalization of the 3-point proposition. In the theorem, a dataset consists of $n$ vertices of a regular polygon in the $xy$ -plane, as well as one additional point.

Theorem 1

Suppose $A_{1},A_{2},\ldots,A_{n}$ are points on the $xy$ -plane forming the vertices of a regular polygon with side length $d>0$ . Now suppose there are two points $C_{1}$ and $C_{2}$ , both distance $r>0$ (also $r<d)$ away from the $A_{i}$ that they are closest to, with the distance away from all other $A_{i}$ greater than $d$ . Then, the matching distance between the 2-parameter persistence modules constructed from the two point clouds $\{A_{1},A_{2},\ldots,A_{n},C_{1}\}$ , and $\{A_{1},A_{2},\ldots,A_{n},C_{2}\}$ will be [math].

Proof: Similar to the proof for the proposition above, each point will be assigned a density parameter of $1$ or $2$ . Since the distance between points $C_{1},C_{2}$ and the respective $A_{i}$ that they are closest to is smaller than the distance between the $C_{i}$ and all other $A_{i}$ , density $1$ is assigned to $C_{1},C_{2},$ and the respective $A_{i}$ that they are closest to (note that they could both be closest to the same $A_{i}$ ). All other $A_{i}$ are assigned a density of $2$ . In the point clouds consisting of these points, we have a bifiltration for both density $1$ and density $2$ .

In the bifiltration for density $1$ , only $C_{1}$ and the $A_{i}$ that it is closest to appear, so the proof follows the same steps as the one for proposition.

In the bifiltration for density $2$ , all points in the point cloud appear (all $A_{i}$ and $C_{1}$ ). At $\epsilon=r$ , $C_{1}$ and the $A_{i}$ it is closest to form an edge, but no other edges form between any points. As $\epsilon$ increases to $d$ , all edges of the regular polygon appear.

Similarly, we have the exact same bifiltration for the point cloud consisting of $C_{2}$ and the $A_{i}$ . Followed by the same reasonings in the proof for proposition, the matching distance between the two point clouds is [math].

3.2 Two-Circle Datasets

Our second investigation involved datasets consisting of points sampled from two circles with radii $r$ , separated by distance $d$ . For each dataset, two hundred points were selected randomly from a uniform distribution on each circle; Figure 11 displays an example. Each point was assigned a density estimator defined as the distance to the $20$ th nearest neighbor.

Suppose that we fix the radius $r$ and vary the separation distance $d$ . Intuitively, the larger the change in separation distance, the more different we regard the point clouds. Thus, we expect that a large change in $d$ will result in a large matching distance between the 2-parameter persistence modules constructed from the point clouds.

We generated $60$ datasets using $r=3$ and $d\in\{0.5,1,1.5,\ldots,30\}$ . We computed 2-parameter persistence modules for each dataset and computed the matching distance between each pair of modules.

Consider a pair of datasets, one with separation distance $d_{1}$ and the other with separation distance $d_{2}$ . Figure 12 displays the matching distance between the persistence modules plotted against the difference in separation distance $d_{2}-d_{1}$ . For the plot, we chose four representative values of $d_{2}$ from those listed above: $d_{2}\in\{10,15,22,28\}$ . For each of these four values of $d_{2}$ , we plot the matching distance for those datasets with separation distance $d_{1}$ such that the difference $d_{2}-d_{1}$ in the range from $0.5$ to $d_{2}-0.5$ .

Figure 12 (left) shows the trend when $d_{2}$ is $10$ or $15$ . As the separation distance $d_{2}-d_{1}$ increases, the matching distance of the pairs of data sets increases initially, but then remains nearly constant. In the region of increasing matching distance, we note three linear segments, with a jump between each. Though the plots display nearly the same shape for $d_{2}=10$ and $d_{2}=15$ , we note that the matching distance attains a higher value for $d_{2}=15$ , indicating that the matching distance reveals a greater difference between the persistence modules when one dataset consists of circles that are farther apart.

When the separation distance $d_{2}$ is $22$ or $28$ , as shown in Figure 12 (right), the matching distance of that pair follows almost the same trend as in Figure 12 (left). However, when $d_{2}=22$ , we note some randomness, possibly due to sampling irregularities, when $d_{2}-d_{1}$ is small. Also, we note only two linearly increasing segments, with a single jump, when $d_{2}=28$ . Interestingly, the near-constant part of the plot is higher when $d_{2}$ is the smaller of the two values, which is contrary to what we observed in Figure 12 (left).

In order to understand the structure observed in Figure 12, we looked into the barcodes involved in the matching distance calculation. Recall that the matching distance between two 2-parameter persistence modules is the minimum weighted bottleneck distance between two barcodes, one from each persistence module. RIVET provides us access to these barcodes.

For example, consider the pair of datasets for $d_{1}=8$ and $d_{2}=10$ . The matching distance between the persistence modules is plotted as one of the red dots in Figure 12 (left). The barcodes that realize this matching distance are shown (as persistence diagrams) in Figure 13. We compared these persistence diagrams to understand the matching distance between the persistence modules. We observe that the persistence diagram for the dataset with separation distance $d_{1}=8$ has only three dots with finite coordinates444Our persistence diagrams for zero-degree persistent homology always have a dot at $(0,\infty)$ since there is one connected component that persists at all distance scales. (i.e., the barcode has only three finite bars). In comparison, the persistence diagram for the dataset with separation distance $d_{2}=10$ has five dots with finite coordinates (i.e., the barcode has only five finite bars). Given the small number of points, we manually matched the points and computed the matching distance. In this example, the two dots far from the diagonal are matched together in the optimal matching, and the distance between that pair is what gives the matching distance. By identifying how points are matched in persistence diagrams, we could better understand the structure in Figure 12.

To understand the near-constant part of the plot, we looked at the persistence diagram that realize the matching distances between pairs of datasets. Let $\mathcal{M}_{d}$ be the 2-parameter persistence module computed from the dataset with separation distance $d$ . In Figure 12 (left), the near-constant portion of dots for $d_{2}=10$ extends from $d_{2}-d_{1}=5$ (that is, $d_{1}=5$ ) to $d_{2}-d_{1}=9.5$ (that is, $d_{1}=0.5$ ). We examined the the calculation of the matching distance between $\mathcal{M}_{5}$ and $\mathcal{M}_{10}$ , and also the matching distance between $\mathcal{M}_{0.5}$ and $\mathcal{M}_{10}$ . We found that in both matching distance calculations, the line $\ell$ that minimizes the bottleneck distance (thus realizing the matching distance) is the same. Let $\mathcal{D}^{\ell}_{d}$ be the persistence diagram obtained from $\mathcal{M}_{d}$ along line $\ell$ . We found that the barcode $\mathcal{D}^{\ell}_{10}$ has only one finite bar. When matching $\mathcal{D}^{\ell}_{10}$ to $\mathcal{D}^{\ell}_{0.5}$ or $\mathcal{D}^{\ell}_{5}$ , this finite bar is always matched to the diagonal, and this gives the maximum distance between matched points. Thus, for the persistence modules that produce the near-constant portion of the plot, the matching distance is determined by the distance of the finite bar in $\mathcal{D}^{\ell}_{10}$ from the diagonal.

Since circles have nontrivial first homology, we performed a second experiment to investigate the effect of varying the circle radius on the matching distance between first persistent homology modules. Specifically, we generated data sets with circle radius $r\in\{0.2,0.4,0.6,\ldots,6\}$ and fixed separation distance $d=3$ . We computed the 2-parameter persistence modules using first homology for each dataset, and compared them pointwise using the matching distance. In this experiment, we observed patterns very similar to those displayed in Figure 12: as the difference in radius increases, the matching distance first increases, but then becomes near-constant. Furthermore, the near-constant portion is slightly decreasing as the difference in radius increases.

Lastly, we wanted to determine whether our two-circle experiments are sensitive to the presence of outliers, given that an important advantage of 2-parameter persistent homology is robustness against noise. In our previous experiments, we sampled points precisely from the circles, but now we introduced some noise. We added a small error to $20\%$ or $40\%$ of the data points, and re-computed the matching distances. We found that the distribution of the matching distance for noisy data shares almost the same features as for the original data sets. We conclude that the matching distance still provides information about the change in separation distance or circle radius, even in the presence of noisy data. This confirms that two-parameter persistent homology is robust against outliers, which is one of the primary motivations for this study.

4 Discussion and Future Research

Our findings, though for very simple datasets, suggest that the matching distance can provide a notion of similarity for point-cloud data. This research provides a step towards a deeper understanding of what matching distance reveals about the similarity or difference between point-cloud datasets. Moreover, this leads to further questions regarding how to quantify the similarity between geometric data.

In order to better understand our results in Figure 12, we would like to study why the jumps appear in the increasing segments in the plot of matching distances. We would like to determine why the near-constant part of the plot is slightly decreasing as $d_{2}-d_{1}$ increases. We are intrigued by the fact that the value of the matching distance along the near-constant part of the plot initially increases with $d_{2}$ , but then decreases when $d_{2}$ gets sufficiently large. In our experiments, the maximum matching distance occurs when $d_{2}=21$ . We would like to study this further.

In this work, we obtained 1-parameter persistence modules by restricting 2-parameter persistence modules along lines of nonnegative slope. We would like to explore the structures that exist along lines of negative slope, but this is algebraically complicated and would likely involve zigzag persistence, as discussed in [4].

Furthermore, we would like to extend this research to the analysis of real-world data. To give one example, textual data such as Wikipedia articles can be converted to high-dimensional vectors — e.g., using a word2vec algorithm. We would like to use the matching distance to explore similarities between articles, and to compare collections of the article vectors with random point-cloud data.

Acknowledgements

We would like to thank Professor Matthew Wright and Professor Matthew Richey for their guidance in this project. We would also like to thank the anonymous reviewer for many suggestions that greatly improved this paper. In addition we are grateful for the Collaborative Undergraduate Research and Inquiry (CURI) program for the generous support of undergraduate research at St. Olaf College. This work was supported by NSF DMS-1606967 and NSF DMS-1045015.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Asaad and S. Jassim , Topological data analysis for image tampering detection , in International Workshop on Digital Watermarking, Springer, 2017, pp. 136–146.
2[2] P. Bubenik, V. de Silva, and J. Scott , Metrics for generalized persistence modules , Foundations of Computational Mathematics, 15 (2015), pp. 1501–1531.
3[3] Z. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.-W. Wei , A topological approach for protein classification , Computational and Mathematical Biophysics, 3 (2015).
4[4] G. Carlsson and V. De Silva , Zigzag persistence , Foundations of Computational Mathematics, 10 (2010), pp. 367–405.
5[5] F. Chazal, W. Crawley-Boevey, and V. De Silva , The observable structure of persistence modules , Ar Xiv e-prints, (2014).
6[6] H. Edelsbrunner and J. Harer , Persistent homology - a survey , Contemporary Mathematics, 453 (2008), pp. 257–282.
7[7] R. Ghrist , Barcodes: The persistent topology of data , Bulletin of the American Mathematical Society, 45 (2008), pp. 61–75.
8[8] C. Landi , The rank invariant stability via interleavings , Ar Xiv e-prints, (2014).