Same But Different: Distance Correlations Between Topological Summaries

Katharine Turner; Gard Spreemann

arXiv:1903.01051·math.AT·June 24, 2019

Same But Different: Distance Correlations Between Topological Summaries

Katharine Turner, Gard Spreemann

PDF

Open Access

TL;DR

This paper explores how distance correlation can compare different topological summaries of data, highlighting its application in analyzing complex data structures across various metric spaces.

Contribution

It introduces the use of distance correlation for comparing topological summaries in different metric spaces, providing a non-parametric statistical tool for topological data analysis.

Findings

01

Distance correlation effectively compares topological summaries across different metric spaces.

02

Different topological summaries can yield varying statistical conclusions.

03

The method is applicable to various data models and scalar measures.

Abstract

Persistent homology allows us to create topological summaries of complex data. In order to analyse these statistically, we need to choose a topological summary and a relevant metric space in which this topological summary exists. While different summaries may contain the same information (as they come from the same persistence module), they can lead to different statistical conclusions since they lie in different metric spaces. The best choice of metric will often be application-specific. In this paper we discuss distance correlation, which is a non-parametric tool for comparing data sets that can lie in completely different metric spaces. In particular we calculate the distance correlation between different choices of topological summaries. We compare some different topological summaries for a variety of random models of underlying data via the distance correlation between the samples.…

Tables4

Table 1. Table 1: Counterexample showing that the space of persistence diagrams with W p subscript 𝑊 𝑝 W_{p} is not of negative when p < ln ⁡ ( 2 ) / ln ⁡ ( 4 / 3 ) 𝑝 2 4 3 p<\ln(2)/\ln(4/3) . The off-diagonal points come from Figure 2 . The weight columns refer to the α 𝛼 \alpha ’s in inequality ( 2.2 ).

Diagram	Off-diagonal points	Weight
$x_{1}$	${a_{1}, b_{1}, a_{2}, d_{2}}$	$1$
$x_{2}$	${a_{1}, c_{1}, a_{2}, d_{2}}$	$1$
$x_{3}$	${b_{1}, d_{1}, a_{2}, d_{2}}$	$1$
$x_{4}$	${c_{1}, d_{1}, a_{2}, d_{2}}$	$1$
$x_{5}$	${a_{1}, b_{1}, b_{2}, c_{2}}$	$1$
$x_{6}$	${a_{1}, c_{1}, b_{2}, c_{2}}$	$1$
$x_{7}$	${b_{1}, d_{1}, b_{2}, c_{2}}$	$1$
$x_{8}$	${c_{1}, d_{1}, b_{2}, c_{2}}$	$1$

Table 2. Table 2: Distances d p ( x , x ′ ) subscript 𝑑 𝑝 𝑥 superscript 𝑥 ′ d_{p}(x,x^{\prime}) for x ∈ X 𝑥 𝑋 x\in X and x ′ = { A 1 , A 2 , e 1 , e 2 } superscript 𝑥 ′ subscript 𝐴 1 subscript 𝐴 2 subscript 𝑒 1 subscript 𝑒 2 x^{\prime}=\{A_{1},A_{2},e_{1},e_{2}\} . The off-diagonal points are those shown in Figure 2 .

Same	Share an	Diag. opp.	Example	No. of such
corner	edge	corners	diagram	$x \in X$	$d_{p} (x, x^{'})$
$2$	$0$	$0$	${A_{1}, A_{2}, e_{1}, e_{2}}$	$1$	$0$
$1$	$1$	$0$	${A_{1}, B_{2}, e_{1}, e_{2}}$	$4$	$1$
$1$	$0$	$1$	${A_{1}, D_{2}, e_{1}, e_{2}}$	$2$	$2^{1 / p}$
$0$	$2$	$0$	${B_{1}, C_{2}, e_{1}, e_{2}}$	$4$	$2^{1 / p}$
$0$	$1$	$1$	${B_{1}, D_{2}, e_{1}, e_{2}}$	$4$	$3^{1 / p}$
$0$	$0$	$2$	${D_{1}, D_{2}, e_{1}, e_{2}}$	$1$	$4^{1 / p}$

Table 3. Table 3: (Square roots of) distance correlation between topological summaries and the parameter γ 𝛾 \gamma .

Topological summary	$dCov (∙, 𝒫)$
Persistence scale space kernel, $σ = 0.001$	$0.96$
$1$ -Wasserstein	$0.95$
$β_{1}$ with $L^{1}$	$0.95$
$β_{1}$ with $L^{2}$	$0.95$
$2$ -Wasserstein	$0.94$
Persistence landscape with $L^{\infty}$	$0.94$
Persistence scale space kernel, $σ = 0.01$	$0.93$
Persistence landscape with $L^{2}$	$0.92$
Persistence landscape with $L^{1}$	$0.92$
Sliced Wasserstein kernel, $σ = 1$	$0.66$
Persistence scale space kernel, $σ = 1$	$0.60$
Sliced Wasserstein kernel, $σ = 0.01$	$0.40$

Table 4. Table 4: (Square roots of) distance correlation between elevation topological summaries and the terrain ruggedness index and the geodesic distance.

Summary (topological, apart from first two)	$dCor (∙, TRI)$	$dCor (∙, d_{geodesic})$
$TRI$	$1$	$0.72$
$d_{geodesic}$	$0.72$	$1$
Bottleneck	$0.62$	$0.52$
$2$ -Wasserstein	$0.92$	$0.74$
Persistence scale space kernel, $σ = 1$	$0.74$	$0.64$
Persistence scale space kernel, $σ = 10$	$0.75$	$0.63$
Persistence landscape with $L^{1}$	$0.73$	$0.61$
Persistence landscape with $L^{2}$	$0.72$	$0.59$
Persistence landscape with $L^{\infty}$	$0.62$	$0.52$
Sliced Wasserstein kernel, $σ = 1$	$0.44$	$0.55$
Sliced Wasserstein kernel, $σ = 0.01$	$0.44$	$0.55$
$β_{1}$ with $L^{1}$	$0.75$	$0.65$
$β_{1}$ with $L^{2}$	$0.77$	$0.63$

Equations85

H_{k} (a, b) := Z_{k} (K_{a}) / (B_{k} (K_{b}) \cap Z_{k} (K_{a})) .

H_{k} (a, b) := Z_{k} (K_{a}) / (B_{k} (K_{b}) \cap Z_{k} (K_{a})) .

ρ_{X, Y} = \frac{E [( X - X ) ( Y - Y )]}{σ _{X} σ _{Y}},

ρ_{X, Y} = \frac{E [( X - X ) ( Y - Y )]}{σ _{X} σ _{Y}},

τ := \frac{N _{s} - N _{d}}{N _{s} + N _{d}} and γ := \frac{N _{s} - N _{d}}{n ( n - 1 ) /2},

τ := \frac{N _{s} - N _{d}}{N _{s} + N _{d}} and γ := \frac{N _{s} - N _{d}}{n ( n - 1 ) /2},

d_{μ} (x, x^{'}) := d_{X} (x, x^{'}) - a_{μ} (x) - a_{μ} (x^{'}) + D (μ) .

d_{μ} (x, x^{'}) := d_{X} (x, x^{'}) - a_{μ} (x) - a_{μ} (x^{'}) + D (μ) .

dcov (θ) = \int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

dcov (θ) = \int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

dvar (θ^{X}) = \int d_{μ} (x, x^{'})^{2} d θ^{2} ((x, x^{'})),

dvar (θ^{X}) = \int d_{μ} (x, x^{'})^{2} d θ^{2} ((x, x^{'})),

dcor (X, Y) = \frac{dcov ( X , Y )}{dvar ( θ ^{X} ) dvar ( θ ^{Y} )} .

dcor (X, Y) = \frac{dcov ( X , Y )}{dvar ( θ ^{X} ) dvar ( θ ^{Y} )} .

\int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

\int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

dCov (X, Y) = dcov (X, Y) = \int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

dCov (X, Y) = dcov (X, Y) = \int d_{μ} (x, x) d_{ν} (y, y^{'}) d θ^{2} ((x, y), (x^{'}, y^{'})) .

dCor (X, Y) = \frac{dCov ( X , Y )}{dVar ( X ) dVar ( Y )} = dcor (X, Y) .

dCor (X, Y) = \frac{dCov ( X , Y )}{dVar ( X ) dVar ( Y )} = dcor (X, Y) .

dcov_{n} = \frac{1}{n ^{2}} k, l = 1 \sum n A_{k, l} B_{k, l}

dcov_{n} = \frac{1}{n ^{2}} k, l = 1 \sum n A_{k, l} B_{k, l}

dcov (θ)

dcov (θ)

= \int d_{μ} (x, x^{'}) d θ_{X}^{2} (x, x^{'}) \int d_{ν} (y, y^{'}) d θ_{Y}^{2} (y, y^{'}) .

i, j = 1 \sum n α_{i} α_{j} d (x_{i}, x_{j}) \leq 0.

i, j = 1 \sum n α_{i} α_{j} d (x_{i}, x_{j}) \leq 0.

\int d (x, x^{'}) d (μ_{1} - μ_{2})^{2} (x, x^{'}) \leq 0.

\int d (x, x^{'}) d (μ_{1} - μ_{2})^{2} (x, x^{'}) \leq 0.

∥ (a, b) - Δ ∥_{p} = t \in R in f ∥ (a, b) - (t, t) ∥_{p} = 2^{\frac{1}{p} - 1} ∣ b - a ∣

∥ (a, b) - Δ ∥_{p} = t \in R in f ∥ (a, b) - (t, t) ∥_{p} = 2^{\frac{1}{p} - 1} ∣ b - a ∣

d_{p} (X, Y) = ϕ : X \to Y bijection in f x \in X \sum ∥ x - ϕ (x) ∥_{p}^{p}^{1/ p}

d_{p} (X, Y) = ϕ : X \to Y bijection in f x \in X \sum ∥ x - ϕ (x) ∥_{p}^{p}^{1/ p}

d_{\infty} (X, Y) = ϕ : X \to Y bijection in f x \in X sup ∥ x - ϕ (x) ∥_{\infty} .

d_{\infty} (X, Y) = ϕ : X \to Y bijection in f x \in X sup ∥ x - ϕ (x) ∥_{\infty} .

ϕ : X \to Y in f (x \in X \sum ∥ x - ϕ (x) ∥_{q}^{p})^{1/ p}

ϕ : X \to Y in f (x \in X \sum ∥ x - ϕ (x) ∥_{q}^{p})^{1/ p}

0 2^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 0 2^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 0 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 0 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 0 2^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 0 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 0 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 0

0 2^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 0 2^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 0 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 0 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 0 2^{1/ p} 2^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 0 2^{1/ p} 2^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 0 2^{1/ p} 4^{1/ p} 4^{1/ p} 4^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 2^{1/ p} 0

i, j = 1 \sum 8 d_{p} (x_{i}, x_{j}) = i, j = 1 \sum 8 d_{p} (y_{i}, y_{j}) = 32 \cdot 2^{1/ p} + 24 \cdot 4^{1/ p},

i, j = 1 \sum 8 d_{p} (x_{i}, x_{j}) = i, j = 1 \sum 8 d_{p} (y_{i}, y_{j}) = 32 \cdot 2^{1/ p} + 24 \cdot 4^{1/ p},

i, j = 1 \sum 8 d_{p} (x_{i}, y_{j}) = i, j = 1 \sum 8 d_{p} (y_{i}, x_{j}) = 64 \cdot 2^{1/ p} .

i, j = 1 \sum 8 d_{p} (x_{i}, y_{j}) = i, j = 1 \sum 8 d_{p} (y_{i}, x_{j}) = 64 \cdot 2^{1/ p} .

i, j = 1 \sum 8 d_{p} (x_{i}, x_{j})

i, j = 1 \sum 8 d_{p} (x_{i}, x_{j})

= 64 \cdot 2^{1/ p} + 48 \cdot 4^{1/ p} - 128 \cdot 2^{1/ p}

= 48 \cdot 4^{1/ p} - 64 \cdot 2^{1/ p} .

32 \cdot (4 + 6 \cdot 2^{1/ p} + 4 \cdot 3^{1/ p} + 4^{1/ p}) - 32 \cdot 16 \cdot 8^{1/ p} /2 > 0.

32 \cdot (4 + 6 \cdot 2^{1/ p} + 4 \cdot 3^{1/ p} + 4^{1/ p}) - 32 \cdot 16 \cdot 8^{1/ p} /2 > 0.

4 (1/8)^{1/ p} + 6 (1/4)^{1/ p} + 4 (3/8)^{1/ p} + (1/2)^{1/ p} > 8.

4 (1/8)^{1/ p} + 6 (1/4)^{1/ p} + 4 (3/8)^{1/ p} + (1/2)^{1/ p} > 8.

4 (1/8)^{1/ p} + 6 (1/4)^{1/ p} + 4 (3/8)^{1/ p} + (1/2)^{1/ p} > 4 \cdot 0.42 + 6 \cdot 0.56 + 4 \cdot 0.66 + 0.74 = 8.42

4 (1/8)^{1/ p} + 6 (1/4)^{1/ p} + 4 (3/8)^{1/ p} + (1/2)^{1/ p} > 4 \cdot 0.42 + 6 \cdot 0.56 + 4 \cdot 0.66 + 0.74 = 8.42

λ_{k} (t) = sup {m \geq 0 ∣ β^{t - m, t + m} \geq k} .

λ_{k} (t) = sup {m \geq 0 ∣ β^{t - m, t + m} \geq k} .

∥ λ ∥_{p}^{p} = k = 1 \sum \infty ∥ λ_{k} ∥_{p}^{p}

∥ λ ∥_{p}^{p} = k = 1 \sum \infty ∥ λ_{k} ∥_{p}^{p}

X_{1} = I_{[0, 1)} \oplus I_{[3, 4)},

X_{1} = I_{[0, 1)} \oplus I_{[3, 4)},

X_{2} = I_{[1, 2)} \oplus I_{[2, 3)},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis · Data Management and Algorithms · Geochemistry and Geologic Mapping

Full text

Same But Different

Distance Correlations Between Topological Summaries

Katharine Turner and Gard Spreemann

Abstract

Persistent homology allows us to create topological summaries of complex data. In order to analyse these statistically, we need to choose a topological summary and a relevant metric space in which this topological summary exists. While different summaries may contain the same information (as they come from the same persistence module), they can lead to different statistical conclusions since they lie in different metric spaces. The best choice of metric will often be application-specific. In this paper we discuss distance correlation, which is a non-parametric tool for comparing data sets that can lie in completely different metric spaces. In particular we calculate the distance correlation between different choices of topological summaries. We compare some different topological summaries for a variety of random models of underlying data via the distance correlation between the samples. We also give examples of performing distance correlation between topological summaries and other scalar measures of interest, such as a paired random variable or a parameter of the random model used to generate the underlying data. This article is meant to be expository in style, and will include the definitions of standard statistical quantities in order to be accessible to non-statisticians.

1 Introduction

The development and application of statistical theory and methods within Topological Data Analysis (TDA) are still in their infancies. The main reason is that distributions of topological summaries are harder to study than distributions of real numbers, or of vectors. Complications arise both from the geometry of the spaces that the summaries lie in, and, more importantly, from the complete lack of nice parameterised families which one could expect the distributions of topological summaries to follow. Even when the distribution of filtrations of topological spaces is parametric, topological summaries do not necessarily preserve distributions in any meaningful way, so the resulting topological summaries will generally not be in the form of a tractable exponential family. Effectively none of the methods from basic statistics can be directly applied, at least not without significant caveats and great care. We should instead turn to the world of non-parametric statistics, in which the methods are usually distribution-free and can sometimes be applied to random elements lying in quite general metric spaces.

A quintessential example of the challenges faced when improving the statistical rigour in TDA is that of correlation. The Pearson correlation coefficient is the correlation, the “r-value”, taught in every introductory course on statistics. However, it is only defined for real-valued functions, and inference involving Pearson correlation often assumes the variables follow normal distributions. It is very much a method from parametric statistics. It measures the strength of the linear relationship between normally distributed random variables. Here the parametric families are the normal distributions with the mean and covariance as the parameters, and the correlation coefficient is a straightforward function of the covariance matrix. The Pearson correlation is very useful and appropriate if the distributions are normal. However, that is a very big “if”; and one that will rarely hold for topological summaries.

Thankfully for practitioners of TDA, correlation as a concept is not defined by the formula of the Pearson correlation coefficient, but rather should be thought of more philosophically as some quantity measuring the extent of interdependence of random variables. This research is the result of a treasure hunt within the field of non-parametric statistics for an appropriate notion of correlation applicable to topological summaries. Our finding was distance correlation. Effectively it considers the correlations between pairwise distances (appropriately recentred) instead of the raw values. This makes it applicable to distributions over any pair of metric spaces.

Distance correlation is a distribution-free method and exemplifies a non-parametric approach. It can detect relationships between variables that are not linear, and not even monotonic. If the variables are independent, then the distance correlation is zero. In the other direction, if the metric spaces are of strong negative type, then a distance correlation of zero implies the variables are independent. This is true for any joint distribution. In contrast, we can only conclude from a Pearson correlation coefficient of zero that the variables are independent if we assume that the joint distribution is bivariate normal.

There are two take-home messages. The first is that distance correlation is a useful tool in the statistical analysis of topological summaries. The current exposition serves as an introduction to the potential of distance correlation for statistical analysis in TDA. In section 6 we outline some further opportunities that distance correlation can offer. The second message is the simple observation that the choice of topological summary statistic matters. A responsible topological data analyst should carefully consider which is the most appropriate topological summary. A better choice is one where the pairwise distances better reflect the differences of interest in the raw data. This will be domain- and application-specific. There may be other considerations for the choice of topological summary in terms of the statistical methods available, computational complexity and inference power, but this is beyond the scope of this discussion.

2 Background theory

We summarize the relevant basic concepts from TDA, the statistical analysis of topological summaries, and of metric spaces of strong negative type.

2.1 Topological summary statistics

TDA is usually concerned with analysing complex and hard-to-visualize data. This data may have complicated geometric or topological structure, and one creates a family of topological spaces which can then be studied using algebraic-topological methods in order to reveal information about said structure. We call a family of spaces $(K_{a})_{a\in\mathbb{R}}$ such that $K_{a}\subset K_{b}$ whenever $a\leq b$ a filtration. The inclusion $K_{a}\subset K_{b}$ for $a\leq b$ induces a homomorphism $H_{k}(K_{a})\to H_{k}(K_{b})$ between the homology groups. The persistent homology group is the image of $H_{k}(K_{a})$ in $H_{k}(K_{b})$ . It encodes the $k$ -cycles in $K_{a}$ that are independent with respect to boundaries in $K_{b}$ , i.e.

[TABLE]

where $Z_{k}$ and $B_{k}$ are, respectively, the kernel and the image of the $k$ ’th boundary map in the given homology theory.

Under very general assumptions on the filtration, and assuming one works with coefficients in a field, persistent homology is fully described by two equivalent representations: the barcode and the persistence diagram. The barcode is a collection of intervals $[b,d)$ each representing the first appearance (“birth”), $b$ , and first disappearance (“death”), $d$ , of a persistent homology class. This collection of intervals satisfies the condition that for every $b\leq d$ , the number of intervals containing $[b,d)$ is $\dim(H_{k}(b,d))$ . The corresponding persistence diagram is the multi-set of points in the plane with birth filtrations as one coordinate and death filtrations as the other.111We will consider only persistent homology with intervals with finite death.

A summary statistic is a object that is used to summarise a set of observations, in order to communicate the largest amount of information as simply as possible. Simple examples in the case of real-valued distributions include the mean, the variance, the median and the box plot. Many summary statistics in TDA are created via persistent homology. We have a filtration of topological spaces built from our observations, and by applying persistent homology we can summarise this filtration in terms of the evolution of homology. Notably, one creates a summary from a single complex object, whether it be a point cloud, a graph, etc.

There are now a wide array of topological summaries that can be computed directly from a persistence diagram or barcode. Each of these is a different expression of the persistent homology in the form of a topological summary statistic. The practitioner wanting to perform statistical analysis using topological summaries needs to choose which type of summary to represent their data with, as well as the metric on the space where that summary takes values. For some of these different topological summaries there are parameters to choose which play roles like bandwidth, and some depend on a choices like norm order (for us, $p\in\{1,2,\infty\}$ ) akin to choosing $p$ in the $L^{p}$ distance for function spaces. In addition, there are topological summaries not based on persistent homology, such as simplex count functions.

In this paper we will consider a range of different topological summaries and distances defined on them, namely:

•

Persistence diagrams, with Wasserstein distances for $p=1,2,\infty$

•

Persistence landscapes [4], with $L^{p}$ distances for $p=1,2,\infty$

•

Persistence scale space kernel [19] for two different bandwidths, with $L^{2}$ distances

•

Betti and Euler characteristic curves, with $L^{p}$ distances for $p=1,2$

•

Sliced Wasserstein kernel [6] distance

Note that all of these topological summaries can be computed from the persistence diagram and that with the exception of the Betti and Euler curves, which collapse information, they all are distance functions of the information provided in the original persistence diagram. In this sense they are the “same”. It is merely the metric space structure that is different.

It is worth noting that the above list is by no means an exhaustive list of topological summaries. Other examples include the persistent homology rank function [21], the accumulation persistence function [3], the persistence weighted Gaussian kernel [12], the persistence Fisher kernel [13], using tangent vectors from the mean of the square root framework with principal geodesic analysis [1], using points in the persistence diagrams as roots of a complex polynomial for concatenated-coefficient vector representations [9], or using distance matrices of points in persistence diagrams for sorted-entry vector representations [7]. Notably, most of these are functional summaries with an $L^{2}$ metric or lie in a reproducing kernel Hilbert space. Analogous arguments to those for the persistence scale space discussed later could be used to show that many of them lie in metric spaces of strong negative type as a corollary of being separable Hilbert spaces.

2.2 Distance correlation

A random element is a map from a probability space $\Omega$ to a set $\mathcal{X}$ . Its distribution is the pushforward measure on $\mathcal{X}$ . Given two random elements $X:\Omega\to\mathcal{X}$ and $Y:\Omega\to\mathcal{Y}$ , one can consider the paired samples $(X,Y):\Omega\to\mathcal{X}\times\mathcal{Y}$ . This has a joint distribution on $\mathcal{X}\times\mathcal{Y}$ . The marginal distributions for this joint distribution are the pushforwards via the projection maps onto each of the coordinates. An important notion in statistics is whether two variables are independent. This occurs precisely when the joint distribution is the product of the marginal distributions.

The most common measure of correlation between two random variables is the Pearson correlation coefficient. It is defined as the covariance of the two variables divided by the product of their standard deviations. For paired random variables $X,Y$ , the Pearson correlation coefficient is defined by

[TABLE]

where $\overline{X},\overline{Y}$ are the means of $X$ and $Y$ , $\sigma_{X},\sigma_{Y}$ their standard deviations, and $\mathbb{E}$ denotes expectation. Note that if $X$ and $Y$ are independent, then $\mathbb{E}[(X-\overline{X})(Y-\overline{Y})]=\mathbb{E}[(X-\overline{X})]\mathbb{E}[(Y-\overline{Y})]=0$ . This implies that a non-zero correlation is evidence of a lack of independence, and hence the variables are related somehow (though possible only indirectly).

The Pearson correlation is designed to analyse bivariate Gaussian distributions. In this case, a correlation of [math] implies that the variables are independent. Furthermore, Pearson correlation determines the ellipticity of the distribution. We can calculate the Pearson correlation for more general distributions, but in that case it detects linear relationships, and nonlinear relationships can be lost. Some examples of the Pearson correlation coefficient are illustrated in Figure 1. Any test using the correlation coefficient (such as significance testing) depends on the bivariate Gaussian assumption.

In parametric statistics we make some assumption about the parameters (defining properties) of the population distribution(s) from which the data are drawn, while in non-parametric statistics we do not make such assumptions. Given the lack of parametric families of topological summary statistics, it makes sense to consider non-parametric methods. One option when studying real-valued random variables which are not normally distributed, or when the relationship between the variables is not linear, is to use the rankings of the samples. It should also be mentioned that such ranking correlations are designed to detect monotone relationships, which — although more general than the linearity of Pearson’s correlation — is still a significant restriction. There are multiple ways to measure the similarity of the orderings of the data when ranked by each of the quantities. The most common is the Spearman rank correlation method, which is the Pearson correlation of the ranks. Alternatives are Kendall’s $\tau$ and Goodman and Kruskall’s $\gamma$ , which measure pairwise concordance. We say that a pair of samples is concordant if the cases are ranked in the same order for both variables. They are reversed if the orders differ. We drop any pair of samples where the values in either of the variables is equal. We then define

[TABLE]

where $N_{\mathrm{s}}$ is the number of concordant pairs, $N_{\mathrm{d}}$ is the number of reversed pairs, and $n$ is the total number of samples. The only difference between these rank correlations is the treatment of pairs with equal rank; $\tau$ penalises ties while $\gamma$ does not. While these methods are distribution free, they are not suitable to be directly applied to topological summaries as the summaries do not lie in spaces with an order. We cannot rank the samples, and thus we can not apply tests that use the ranks.

A new, non-parametric alternative is to work with the pairwise distances. Given paired samples $(X,Y)=\{(x_{i},y_{i})\mid i=1,...,n\}$ , where the $x_{i}$ and $y_{i}$ lie in metric spaces $\mathcal{X}$ and $\mathcal{Y}$ , respectively, we can ask what the joint variability of the pairwise distances is (i.e. how related $d_{\mathcal{Y}}(y_{i},y_{j})$ is to $d_{\mathcal{X}}(x_{i},x_{j})$ ). The statistical tools of distance covariance and distance correlation are apt for this purpose. The notion was introduced in [23] for the case of samples lying in Euclidean space.

Distance correlation can be applied to distributions of samples lying in more general metric spaces. It can detect relationships between variables that are not linear, and not even monotonic, as can be seen in Figure 1. There are strong theoretical results about independence. If the variables are independent then the distance correlation is zero. In the other direction, if the metrics spaces are separable and of strong negative type then distance correlation of zero implies the variables are independent. This is discussed in more detail in section 2.3.

In contrast, we can only conclude from Pearson correlation coefficient being zero that the variables are independent when we can assume that the joint distribution is bivariate normal. This difference between Pearson correlation and distance correlation is illustrated with real valued random variables in Figure 1.

The formal definitions follow.

Definition 2.1.

Let $X$ be a random element taking values in a connected metric space $(\mathcal{X},d_{\mathcal{X}})$ with distribution $\mu$ . For $x\in\mathcal{X}$ we call $\mathbb{E}[d_{\mathcal{X}}(x,X)]$ the expected distance of $X$ to $x$ , and denote it by $a_{\mu}(x)$ . We say that $X$ has finite first moment if for any $x\in\mathcal{X}$ the expected distance to $x$ is finite. In this case we set $D(\mu):=\mathbb{E}[a_{\mu}(x)]$ . For $X$ with finite first moment we define its doubly centred distance function as

[TABLE]

It is worth observing that $d_{\mu}$ is not a distance function. Lyons showed [15] that $a_{\mu}(x)>D(\mu)/2$ for all $x$ as long as the support of $\mu$ contains at least two points. This implies $d_{\mu}(x,x)<0$ for all $x$ .

Definition 2.2.

Let $\mathcal{X}$ and $\mathcal{Y}$ be metric spaces. Let $\theta=(X,Y)$ be a probability distribution over the product space $\mathcal{X}\times\mathcal{Y}$ with marginals $\mu$ and $\nu$ such that $X$ and $Y$ both have finite first moment. We define the distance covariance of $\theta$ as

[TABLE]

The distance variance is a special case where have two identical copies as the joint distributions $\theta^{X}=(X,X)$ . Here we have

[TABLE]

which is always non-negative, and zero only in the case of a distribution supported on a single point.

The distance correlation of $\theta=(X,Y)$ is defined as

[TABLE]

*Remark**.*

There are some variations of notation with regard to whether to include a square root in the definition of distance covariance and correlation. In the introduction of distance correlation in [23], the authors restricted their analysis to Euclidean spaces. Euclidean spaces are metric spaces of negative type, and such spaces have the property that the distance correlation is always non-negative. They could thus define the distance covariance as

[TABLE]

We will follow the notation of [15] and use $\operatorname{dCov}$ to denote the square root of $\operatorname{dcov}$ , i.e.

[TABLE]

We will also use $\operatorname{dVar}$ as the square root of the distance variation and $\operatorname{dCor}$ to denote the square root of the the distance correlation, i.e.

[TABLE]

In the simulations and calculations involving topological summaries, it turns out that all the values of the distance correlation are non-negative, even for those involving spaces that are not of negative type.

Given a set of paired samples drawn from a joint distribution, we can compute a sample distance covariance. This is an estimator of the distance covariance of a joint distribution from which the paired samples was taken.

The estimation of the distance correlation of a joint distribution by sample distance covariances is reasonable. In other words this means that if $\theta_{n}$ is the sampled joint distribution from $n$ i.i.d. samples of $\theta$ then $\operatorname{dcov}(\theta_{n})\rightarrow\operatorname{dcov}(\theta)$ as $n\to\infty$ with probability $1$ . See Proposition 2.6 in [15]. This justifies the approximation of the distance correlation via simulations. This is particularly important when dealing with distributions for which there is no closed expressions, which is usually the case when dealing with topological summaries

The following procedure computes the sample distance covariance between paired samples $(X,Y)=\{(x_{i},y_{i})\mid i=1,...,n\}$ , which we denote $\operatorname{dcov}_{n}(X,Y)$ :

Compute the pairwise distance matrices $a=(a_{i,j})_{i,j}$ , $b=(b_{i,j})_{i,j}$ with $a_{i,j}=d_{\mathcal{X}}(x_{i},x_{j})$ and $b_{i,j}=d_{\mathcal{Y}}(y_{i},y_{j})$ . 2. 2.

Compute the means of each row and column in $a$ and $b$ as well as the total means of the matrices. Let $\bar{a}^{i}$ and $\bar{b}^{i}$ denote the row means and $\bar{a}_{j}$ and $\bar{b}_{j}$ the column means. Let $\bar{a}$ and $\bar{b}$ denote the total matrix means. 3. 3.

Compute doubly centered matrices $(A_{k,l})_{k,l}$ and $(B_{k,l})_{k,l}$ with $A_{k,l}=a_{k,l}-\bar{a}^{k}-\bar{a}_{l}+\bar{a}$ and $B_{k,l}=b_{k,l}-\bar{b}^{k}-\bar{b}_{l}+\bar{b}$ 4. 4.

The sample distance covariance is

[TABLE]

Note that the matrices $A$ and $B$ have the property that all rows and columns sum to zero.

2.3 Metric spaces of strong negative type

As straightforward application of the definition shows that the distance correlation of a product measure is always zero. To see this, observe that when $\theta$ is a product of $\theta_{X}$ and $\theta_{Y}$ , then

[TABLE]

By construction of $d_{\mu}$ and $d_{\nu}$ , we have $\int d_{\mu}(x,x^{\prime})d\theta^{2}_{X}(x,x^{\prime})=0=\int d_{\nu}(y,y^{\prime})d\theta^{2}_{Y}(y,y^{\prime})$ . The converse of this statement holds under conditions on the metric spaces the distributions are over (not the distributions themselves).

Definition 2.3.

A metric space $(X,d)$ has negative type if for all $x_{1},\ldots,x_{n}\in X$ and $\alpha_{1},\ldots,\alpha_{n}\in\mathbb{R}$ with $\sum_{i}\alpha_{i}=0$

[TABLE]

For spaces of negative type it is always true that the distance covariance is non-negative [15]. We have further nice properties when the metric space is of strong negative type.

Definition 2.4.

A metric space has strict negative type if it is a space of negative type where equality in (2.2) implies that the $\alpha_{i}$ are all zero. By extending to distributions of infinite support we get the definition of strong negative type: A metric space $(\mathcal{X},d)$ has strong negative type if it has negative type and for all probability measures $\mu_{1},\mu_{2}$ we have

[TABLE]

Lyons [15] used the notion to characterize the spaces where one can test for independence of random variables using distance correlation.

Theorem 2.5 ([15] [15], \citeyearLyons).

Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are separable metric spaces of strong negative type and that $\theta$ is a probability measure on $\mathcal{X}\times\mathcal{Y}$ whose marginals have finite first moment. If $\operatorname{dcov}(\theta)=0$ , then $\theta$ is a product measure.

This means that given paired random variable $(X,Y)$ with joint distribution $\theta$ , we can test for independence by computing $\operatorname{dcov}(\theta)$ and decide they are independent if $\operatorname{dcov}(\theta)=0$ , and not independent if $\operatorname{dcov}(\theta)>0$ . The challenge is then how to implement such a test given a sample distance correlation. We expect the sample distance correlation to be non-zero even when the variables are independent.

There is a range of spaces that are proven to be of strong negative type, including all separable Hilbert spaces.

Theorem 2.6 ([15] [15], \citeyearLyons).

Every separable Hilbert space is of strong negative type. Moreover, if $(X,d)$ has negative type, then $(X,d^{r})$ has strong negative type when $0<r<1$ .

A list of metric spaces of negative type appears as Theorem 3.6 of [16]; in particular, this includes all $L^{p}$ spaces for $1\leq p\leq 2$ . On the other hand, $\mathbb{R}^{n}$ with the $l^{p}$ -metric is not of negative type whenever $3<n<\infty$ and $2<p<\infty$ .

The distance correlation still contains useful information even when the spaces are not of strong negative type. It is just more powerful as a test statistic when the spaces are of strong negative type. This is analogous to how the Pearson correlation coefficient still can be evidence of a relationship between two variables even when the joint distribution is not Gaussian. Here the Pearson correlation coefficient is detecting linear relationships. It is an open problem to characterise which relationships are, and which are not, detected by the distance correlation in spaces that are not of strong negative type.

Distance correlation lends itself to non-parametric methods. One possibility is to combine it with permutation tests to construct $p$ -values for independence. Permutation tests construct a sampling distribution by resampling the observed data. We can permute the observed data without replacement to create a null distribution (in this case a distribution of distance correlation values under the assumption that the random variables are independent). The use and exploration of permutation tests in relation to distance correlation is beyond the scope of this paper. We direct the interested reader to section 6 for more details.

3 A veritable zoo of topological summaries, some of which are of strong negative type

Persistent homology has become a very important tool in TDA. Certainly there are many choices that are made in any persistent homology analysis, with much of the focus being on the filtration. In this paper we want to highlight another choice, namely the metric space structure to put on the topological summary of choice. Examples include persistence diagrams with bottleneck and Wasserstein distances, persistence landscapes or rank function with an $L^{p}$ distance, or one of the many kernel representations. The choice of which topological summary we use to represent persistent homology, and the choice of metric on this space of topological summaries, will affect any statistical analysis and will influence whether or not the summary captures the information that is of relevance to the application.

For spaces of strong negative type, distance correlation is known to have the additional nice properties. As a rule, functional spaces with an $L^{2}$ metric and those lying in a reproducing kernel Hilbert space are of strongly negative type. This implies that the Euler characteristic and Betti curves with the $L^{2}$ metric are of strong negative type, and that the space of persistence scale shape kernels is of strong negative type. In this section we will characterise which of the spaces of persistence landscapes are of strong negative type and show that the space of persistence diagrams is never of strong negative type. The main results are as follows.

Theorem (Theorem 3.2).

The space of persistence diagrams is not of negative type under the bottleneck metric or under any of the Wasserstein metrics.

Theorem (Theorem 3.4).

(a)

The space of persistence landscapes with the $L^{2}$ norm is of strong negative type. 2. (b)

The space of persistence landscapes with the $L^{p}$ norm is of negative type when $1\leq p\leq 2$ 3. (c)

The space of persistence landscapes with the $L^{1}$ norm is not of strong negative type, even when restricting to persistence landscapes that arise from persistence diagrams. 4. (d)

The space of persistence landscapes with the $L^{\infty}$ norm is not of negative type, even when restricting to persistence landscapes that arise from persistence diagrams.

It is an open question as to whether the sliced Wasserstein metric is of strong negative type; if it is separable then it will be.

3.1 Betti and Euler characteristic curves

Some of the first topological summaries often considered for parameterised families of topological spaces ( $\{K_{a}\}$ ) are the Betti and the Euler characteristic curves, which we denote by $\beta_{k}:\mathbb{R}\to\mathbb{N}_{0}$ and $\chi:\mathbb{R}\to\mathbb{Z}$ . These are integer valued functions with $\beta_{k}(a)=\dim H_{k}(K_{a})$ and $\chi(a)=\chi(K_{a})$ . From the point of view of barcodes, one thinks of $\beta_{k}(a)$ as the number of bars that contain the point $a$ . The Euler curve is then the alternating sum of the Betti curves, $\chi(a)=\sum_{k=0}^{\infty}(-1)^{k}\beta_{k}(a)$ , as one would expect.

Clearly the Betti and Euler curves contain less information than the persistence diagrams; in particular, the Betti curves can be thought of as encoding point-wise homological information without considering the induced maps $H_{k}(a)\to H_{k}(b)$ .

Since $\beta_{k}$ and $\chi$ are functions, we can consider functional distances between them. In this paper we consider both $L^{1}$ and $L^{2}$ distances. Since $L^{2}(\mathbb{R})$ is a separable Hilbert space, it is of strong negative type. In comparison $L^{1}(\mathbb{R})$ is of negative type, but not of strict negative type (see [15]). For an explicit counterexample, the reader can modify the one used for the $p=1$ case in section 3.3.

3.2 Persistence diagrams

Persistence diagrams are arguably the most common way of representing persistent homology. A persistence diagrams is a multiset of points above the diagonal in the real plane, with lines at $\pm\infty$ in the second coordinate.

In what follows, let $\mathbb{R}^{2+}=\{(x,y)\in\mathbb{R}^{2}\mid x<y\}$ be the subset of the plane above the diagonal $\Delta=\{(x,x)\mid x\in\mathbb{R}\}$ , and let $\mathcal{L}_{\pm\infty}=\{(x,\pm\infty)\mid x\in\mathbb{R}\}$ denote horizontal lines at infinity.

Definition 3.1.

A persistence diagram $X$ is a multiset in $\mathcal{L}_{\infty}\cup\mathcal{L}_{-\infty}\cup\mathbb{R}^{2+}\cup\Delta$ such that

•

The number of elements in $X|_{\mathcal{L}_{\infty}}$ and $X|_{\mathcal{L}_{-\infty}}$ are finite

•

$\sum_{(x_{i},y_{i})\in X\cap{\mathbb{R}^{2+}}}(y_{i}-x_{i})<\infty$

•

$X$ contains countably infinite copies of $\Delta$ .

For our purposes, it suffices to consider persistence diagrams with only finitely many off-diagonal points.

Let $\mathcal{D}$ denote the set of all persistence diagrams. We will consider a family of metrics which are analogous to the $p$ -Wasserstein distances on the set of probability measures, and to the $L^{p}$ distances on the set of functions on a discrete set. $\mathbb{R}^{2+}$ inherits natural $L^{p}$ distances from $\mathbb{R}^{2}$ . For $p\in[1,\infty)$ we have $\|(a_{1},b_{1})-(a_{2},b_{2})\|_{p}^{p}=|a_{1}-a_{2}|^{p}+|b_{1}-b_{2}|^{p}$ and $\|(a_{1},b_{1})-(a_{2},b_{2})\|_{\infty}=\max\{|a_{1}-a_{2}|,|b_{1}-b_{2}|\}$ .

With a slight abuse of notation we write $\|(a,b)-\Delta\|_{p}$ to denote the shortest $L^{p}$ distance to $\Delta$ from a point $(a,b)$ in a persistence diagram. Thus

[TABLE]

for $p<\infty$ , and $\|(a,b)-\Delta\|_{\infty}=\inf_{t\in\mathbb{R}}\|(a,b)-(t,t)\|_{\infty}=|y-x|/2$ . Both $\mathcal{L}_{-\infty}$ and $\mathcal{L}_{\infty}$ inherit natural $L^{p}$ distances from the $L^{p}$ metric on $\mathbb{R}$ , i.e. $\|(-\infty,b_{1})-(-\infty,b_{2})\|_{p}=|b_{1}-b_{2}|$ and $\|(a_{1},\infty)-(a_{2},\infty)\|_{p}=|a_{1}-a_{2}|$ .

Given persistence diagrams $X$ and $Y$ , we can consider all the bijections between them. This set is non-empty due to the presence of $\Delta$ in the diagrams. Each bijection can be thought of as providing a transport plan from $X$ to $Y$ . One defines a family of metrics in terms of the cost of the most efficient transport plan.

For each $p\in[1,\infty)$ , define

[TABLE]

and

[TABLE]

These distances may be infinite. Indeed, if $X$ and $Y$ contain a different number of points in $\mathcal{L}_{\infty}$ , then $d_{p}(X,Y)=\infty$ for all $p$ .

In theory, for every pair $p,q\in[1,\infty]$ one can construct a distance function of the form

[TABLE]

with $p$ and $q$ potentially different. Some of the computational topology literature uses a family of metrics $d_{W_{p}}$ where $p$ varies but $q=\infty$ is fixed. The families $\{d_{p}\}$ and $\{d_{W_{p}}\}$ share many properties. The metrics $d_{p}$ and $d_{W_{p}}$ are bi-Lipschitz equivalent, as for any $x,y\in\mathbb{R}^{2}$ we have $\|x-y\|_{\infty}\leq\|x-y\|_{p}\leq 2\|x-y\|_{\infty}$ , implying $d_{W_{p}}(X,Y)\leq d_{p}(X,Y)\leq 2d_{W_{p}}(X,Y)$ . Any stability results (i.e. results pertaining to the change in persistence diagrams due to perturbations of the underlying filtration) for $\{d_{p}\}$ or $\{d_{W_{p}}\}$ extend (with minor changes in the constants involved) to stability results for the other.

We feel that the choice of $q=p$ is cleaner in theory and in practice. The coordinates of the points within a persistence diagram have particular meanings; one is the birth time and one is the death time. They are often locally independent (even though not globally so). For example, if we have generated our persistence diagram from the distance function to a point cloud, then each persistence class has its birth and death time locally determined by the location of two pairs of points, which are often distinct. Whenever these pairs are distinct, moving any of these four points will change either the birth or the death but not both. The distinctness of the treatment of birth and death times as separate qualities may seem more philosophically pleasing to the reader in the setting of barcodes.

Unfortunately, the geometry of the space of persistence diagrams is complicated and statistical methods not easy to apply. For example, there are challenges even in computing the mean or median of finite samples (see [26, 25]). Given this it is perhaps not surprising that the space of persistence diagrams is not of negative type (let alone of strong negative type) under the bottleneck or indeed any of the Wasserstein metrics. Although this has been indirectly mentioned or suggested before (notably in [19, 6]), we include here explicit counterexamples.

Theorem 3.2.

The space of persistence diagrams is not of negative type under the bottleneck or any of the Wasserstein metrics.

Proof.

We will construct two different counterexamples; one for small $p$ and one for large $p$ . Note that the bottleneck metric is the Wasserstein metric with $p=\infty$ .

For small $p$ , consider the two separate unit squares formed by the points $a_{1},b_{1},c_{1},d_{1}$ and $a_{2},b_{2},c_{2},d_{2}$ in Figure 2. Each persistence diagram will be a union of a pair of corners sharing an edge in one of the squares, together with a pair of corners diagonally opposite on the other square. We then choose the weights (the $\alpha$ ’s in inequality (2.2)) to be $1$ if the off-diagonal points are diagonally opposite in the rightmost square, and $-1$ if they are diagonally opposite in the leftmost square. A list of the diagrams is in Table 1.

We have the following distance matrix for the within-group distances, i.e. the symmetric matrix with entries $(d_{p}(x_{i},x_{j}))_{i,j}=(d_{p}(y_{i},y_{j}))_{i,j}$ :

[TABLE]

This implies that

[TABLE]

and similarly

[TABLE]

The sum of interest, using the weighting in Table 1, is

[TABLE]

Now $48\cdot 4^{1/p}-64\cdot 2^{1/p}>0$ exactly when $p<\ln(2)/\ln(4/3)$ . This thus shows that the metric space of persistence diagrams with $W_{p}$ is not of negative type when $p<\ln(2)/\ln(4/3)$ .

We will now construct a counterexample for space of persistence diagrams under $p$ -Wasserstein distance with $p\geq 2.4$ . We will construct our counterexample with persistence diagrams containing points listed in Figure 3. This has separate squares with unit edge length that are sufficiently far apart. We will have two sets of persistence diagrams, $X$ and $Y$ , and we will be giving a weight of $1$ to all the persistence diagrams in $X$ and a weight of $-1$ to all the persistence diagrams in $Y$ .

Each persistence diagram in $X$ will have 4 off-diagonal points; one corner point from each of the squares labelled with upper case letters, and $e_{1}$ and $e_{2}$ . An example is $\{A_{1},B_{2},e_{1},e_{2}\}$ . There are a total of 16 such persistence diagrams.

Each persistence diagram in $Y$ will have 4 off-diagonal points; one corner point from each of the squares labelled with lower case letters, and $E_{1}$ and $E_{2}$ . An example is $\{c_{1},c_{2},E_{1},E_{2}\}$ .

For every pair of persistence diagrams $(x,y)\in X\times Y$ we have $d_{p}(x,y)=8^{1/p}/2$ . This implies that the total between-group pairwise distance is $32\cdot 16\cdot 8^{1/p}/2$ .

To compute the within group distances we first observe that the symmetry of the counterexample ensures that the sum of distance $\sum_{x\in X}d_{p}(x,x^{\prime})$ is the same for all $x^{\prime}\in X$ and that this is also the same as $\sum_{y\in Y}d_{p}(y,y^{\prime})$ for all $y^{\prime}\in Y$ . This means we can compute for a fixed $x^{\prime}\in X$ . We can split the remaining $x\in X$ into cases depending on how many of the off-diagonal points in the persistence diagrams are the same as that in $x^{\prime}$ , are on the same edge of the corresponding square as that in $x^{\prime}$ , or are diagonally opposite corners of the corresponding square. We describe this distribution in Table 2, giving example persistence diagrams.

Using this table, we calculate $\sum_{x\in X}d_{p}(x,x^{\prime})=4+6\cdot 2^{1/p}+4\cdot 3^{1/p}+4^{1/p}$ . To prove this is a counterexample we need to show that

[TABLE]

This is equivalent to $4+6\cdot 2^{1/p}+4\cdot 3^{1/p}+4^{1/p}>8$ and by diving through by $8^{1/p}$ this is equivalent to the condition that

[TABLE]

Now $\lambda^{1/p}$ is an increasing function in $p$ , when $\lambda<1$ and $p>1$ . Thus for all $p\geq 2.4$ we know $(1/8)^{1/p}\geq(1/8)^{1/2.4}>0.42$ , $(1/4)^{1/p}\geq(1/4)^{1/2.4}>0.56$ , $(3/8)^{1/p}\geq(3/8)^{1/2.4}>0.66$ and $(1/2)^{1/p}\geq(1/2)^{1/2.4}>0.74$ . Together these imply that

[TABLE]

and (3.1) holds for all $p\geq 2.4$ . ∎

When performing computations with Wasserstein distances, we used the approximate Wasserstein distance algorithm implemented in Hera [11]. The algorithm computes the distances up to arbitrarily chosen relative errors, that we set very low.

3.3 Persistence landscapes

Recall that $H_{*}(a,b):=Z_{*}(K_{a})/(B_{*}(K_{b})\cap Z_{*}(K_{a}))$ is the vector space of non-trivial homology classes in $H_{*}(K_{a})$ that are still distinct when thought of as elements of $H_{*}(K_{b})$ under the induced map $H_{*}(K_{a})\to H_{*}(K_{b})$ . For $a\leq b$ let $\beta^{a,b}=\dim(H_{*}(a,b))$ . We can think of $\beta^{\bullet,\bullet}$ as a persistent version of the ordinary Betti numbers. Indeed, $\beta^{a,a}$ is the Betti number of $K_{a}$ . Notably, persistent Betti numbers are non-negative integer valued functions. Furthermore, when $a\leq b\leq c\leq d$ , then $\beta^{a,d}\leq\beta^{c,d}$ . We can construct the persistence landscape as a sequence of functions which together completely describe the level sets of these functions.

Definition 3.3.

The persistence landscape of some filtration is a function $\lambda:\mathbb{N}\times\mathbb{R}\to\overline{\mathbb{R}}$ , where $\overline{\mathbb{R}}=[-\infty,\infty]$ denotes the extended real numbers, defined by

[TABLE]

We alternatively think of the landscapes as a sequence of functions $\lambda_{k}:\mathbb{R}\to\overline{\mathbb{R}}$ with $\lambda_{k}(t)=\lambda(k,t)$ .

Since persistence landscapes are real-valued functions, we can consider the space of these functions with the $L^{p}$ norm

[TABLE]

for $1\leq p\leq\infty$ .

Theorem 3.4.

The following are true for the space of persistence landscapes under different $L^{p}$ norms:

$p=2$ : It is of strong negative type. 2. 2.

$1\leq p\leq 2$ : It is of negative type. 3. 3.

$p=1$ : It is not of strong negative type, even when restricting to persistence landscapes that arise from persistence diagrams. 4. 4.

$p=\infty$ : It is not of negative type, even when restricting to persistence landscapes that arise from persistence diagrams.

Proof.

The space of persistence landscapes with the $L^{2}$ norm is a separable Hilbert space. Applying Theorem 2.6 shows it is of strong negative type. 2. 2.

As discussed in [4], these function spaces are $L^{p}$ function spaces. From Theorem 3.6 in [16] we know that these are of negative type when $1\leq p\leq 2$ . 3. 3.

The space of persistence landscapes with $L^{1}$ norm is of negative type but not of strong negative type. We can construct a counterexample using only distributions of landscapes that arise from persistent homology. To this end it is sufficient to provide appropriate barcodes, each with finitely many bars, as every such barcode can be realised. Let

[TABLE]

Since all the bars in each barcode are disjoint, only the first persistence landscape in non-zero.

Let $\operatorname{PL}(Z)$ denote the persistence landscape of $Z$ , and $d_{1}$ the metric induced by the $L^{1}$ norm.

We have $d_{1}(\operatorname{PL}(X_{1}),\operatorname{PL}(X_{2}))=2=d_{1}(\operatorname{PL}(Y_{1}),\operatorname{PL}(Y_{2}))$ and $d_{1}(\operatorname{PL}(X_{i}),\operatorname{PL}(Y_{j}))=1$ for all $i,j$ . If we weight $X_{1}$ and $X_{2}$ by $1$ , and the $Y_{1}$ and $Y_{2}$ by $-1$ , then the weighted sum from inequality (2.2) is [math], which means that the space of persistence landscapes with $L^{1}$ norm is of non-strict negative type. 4. 4.

For $p=\infty$ the space of persistence landscapes is not of negative type. We can construct a counterexample using only distributions of landscapes that arise from persistent homology. Again we do to this via barcodes. Let

[TABLE]

Since all the bars in each barcode are disjoint, only the first persistence landscape is non-zero.

It is straightforward to compute the $L^{\infty}$ distances between the corresponding persistence landscapes. Let $\operatorname{PL}(Z)$ denote the persistence landscape of $Z$ , and $d_{\infty}$ the metric induced by the $L^{\infty}$ norm. We see that $d_{\infty}(\operatorname{PL}(X_{i}),\operatorname{PL}(X_{j}))=1=d_{\infty}(\operatorname{PL}(Y_{i}),\operatorname{PL}(Y_{j}))$ when $i\neq j$ and $d_{\infty}(\operatorname{PL}(X_{i}),\operatorname{PL}(Y_{j}))=0.5$ for all $i,j$ . If we weight each of the $X_{i}$ with $1$ and the $Y_{i}$ by $-1$ we get the desired counterexample showing that the space of persistence landscapes with the $L^{\infty}$ distance is not of negative type.

∎

Persistence landscapes computations were performed using the persistence landscapes toolkit [5].

3.4 Persistence scale space kernel

The persistence scale space kernel is a modification of scale space theory to a persistence diagram setting. Extra care is needed to consider the role of the diagonal. The idea is to consider the heat kernel with an initial heat energy of Dirac masses at each of the points in the persistence diagram with the boundary condition that it is zero on the diagonal. The amount of time over which the heat diffusion takes place is a parameter. More formally, it is defined in [19] as follows.

Definition 3.5.

Let $\delta_{p}$ denote a Dirac delta centered at the point $p$ . For a given finite persistence diagram $D$ with only finite lifetimes222When analyzing real data, one often cones off the space at some more or less meaningful maximum filtration so as to avoid infinite intervals., we now consider the solution $u:\mathbb{R}^{2+}\times\mathbb{R}_{\geq 0}\to\mathbb{R}$ of the partial differential equation

[TABLE]

The solution $u(\bullet,t)$ lies in $L_{2}(\mathbb{R}^{2+})$ whenever $D$ has finitely many points. It has a nice closed expression using the observation that it is the restriction of the solution of a PDE with an initial condition where below the diagonal we start with the negative of the Dirac masses over the reflection of the points in the diagram above the diagonal. For $x\in\mathbb{R}^{2+}$ and $t>0$ we have

[TABLE]

The metric for the space of persistence scale shape kernels is that of $L^{2}(\mathbb{R}^{2+})$ . The closed form for the persistence scale space kernel allows a closed form of the pairwise distances in terms of the points in the original diagrams. In particular for diagrams $F$ and $G$ and fixed $\sigma>0$ , this distance can be written in terms of a kernel $k_{\sigma}(F,G)$ , where

[TABLE]

and the corresponding distance function is

[TABLE]

Since $L^{2}(\mathbb{R}^{2+})$ is a separable Hilbert space, this metric is of strong negative type.

3.5 Sliced Wasserstein kernel distance

The sliced Wasserstein distance between persistence diagrams, introduced in [6], works with projections onto lines through the origin. For each choice of line, one intuitively computes the Wasserstein distance between the two projections (a computationally much easier problem, being a matching of points in one dimension), and then integrates the result over all choices of lines. More formally the definition in [6] is as follows.

Definition 3.6.

Given $\theta\in\mathbb{R}^{2}$ with $\|\theta\|_{2}=1$ , Let $L(\theta)$ denote the line $\{\lambda\theta:\lambda\in\mathbb{R}\}$ , and let $\pi_{\theta}:\mathbb{R}^{2}\to L(\theta)$ be the orthogonal projection onto $L(\theta)$ . Let $D_{1}$ and $D_{2}$ be two persistence diagrams, and let $\mu_{i}^{\theta}=\sum_{p\in D_{i}}\delta_{\pi_{\theta}(p)}$ and $\mu_{i\Delta}^{\theta}=\sum_{p\in D_{i}}\delta_{\pi_{\theta}\circ\pi_{(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})}(p)}$ for $i=1,2$ . Then the sliced Wasserstein distance is defined as

[TABLE]

where the $1$ -Wasserstein distance $\mathcal{W}(\mu,\nu)$ is defined as $\inf_{P\in\Pi(\mu,\nu)}\int\int_{\mathbb{R}\times\mathbb{R}}|x-y|P(dx,dy)$ where $\Pi(\mu,\nu)$ is the set of measures on $\mathbb{R}^{2}$ with marginals $\mu$ and $\nu$ .

It was shown in [6] that the sliced Wasserstein distance is conditionally seminegative definite on the space of finite and bounded persistence diagrams. This is equivalent to the condition of being of negative type. It is an open question as to whether it is of strong negative type.

In [6], the authors construct a kernel with bandwidth parameter $\sigma>0$ in the standard way (see [27]), namely

[TABLE]

It being a kernel in the sense that

[TABLE]

for some function $\phi$ into a Hilbert space $\mathcal{H}$ , one obtains a distance function $d_{\text{kSW}}$ with

[TABLE]

If this reproducing kernel Hilbert space $\mathcal{H}$ is separable, then the space of persistence diagrams with $d_{\text{kSW}}$ will be of strong negative type. This separability property is an open question.

In our computations, we always projected onto $10$ equidistributed lines.

4 Distance correlation between different topological summaries

The differences between the metrics used can dramatically affect the statistical analysis of a data set. It is important to choose a summary such that the domain-specific differences in the input data that are of interest are reflected in the distances between their corresponding topological summaries.

The key idea in this section is to take the same object, for example generated through a random process, and then to record different topological summaries of it. As we have seen, this gives us different metric space structures on the data. We then compare the pairwise distances using distance correlation.

We consider a variety of more or less standard or well known families of random cell complexes and their filtrations, as well as some non-random data.

4.1 Erdős–Rényi

We constructed the weighted version of $100$ -vertex Erdős–Rényi random graphs, which is to say we endow the complete graph on $100$ vertices with uniform random independent edge weights. The flag complexes of each of these are then the filtrations we consider. We generated $100$ such filtrations to get sample the distribution of degree- $1$ persistent homology of such complexes. An example persistence diagram is shown in Figure 4. We then computed the distance correlation between the different topological summaries, with the result shown in Figure 4.

The persistent homology computations were performed using Ripser [2].

4.2 Directed Erdős–Rényi

A directed analog of the flag complex of undirected graphs was introduced in [18]. To construct such flag complexes, we generated $100$ instances of the independently random uniform weights on the complete directed graph on $100$ vertices (taking “complete directed graph” to mean having opposing edges between every pair of vertices), and computed the corresponding filtrations and degree- $1$ persistent homology of directed flag complexes using Flagser [14]. An example persistence diagram is shown in Figure 5. We then computed the distance correlation between the different topological summaries, with the result shown in Figure 5.

4.3 Geometric random complexes for points sampled on a torus

For this dataset, we randomly sampled $500$ points independently from a flat torus in $\mathbb{R}^{4}$ by sampling $[0,2\pi)^{2}$ uniformly and considering the image of $(s,t)\mapsto(\cos s,\sin s,\cos t,\sin t)$ . We then built the alpha complex over this set of points. This was performed $100$ times to construct samples of the distribution of persistent homology in degree $1$ for such complexes. An example persistence diagram is shown in Figure 6. We then computed the distance correlation between the different topological summaries, which is shown in Figure 6.

The computations of alpha complexes and persistent homology for this dataset were done using GUDHI [24].

4.4 Geometric random complexes for point sampled from a unit cube

For this dataset, we uniformly randomly sampled $500$ points independently from the unit cube $[0,1]^{3}$ . We then constructed the alpha complex over this set of points. This was performed $100$ times to sample the distribution of persistent homology for such complexes. An example persistence diagram is in Figure 7. We then computed the distance correlation between the different topological summaries which is shown in Figure 7.

For this particular dataset, we also computed a very non-topological summary based on the same underlying complex, namely the counts of $1$ -simplices ( $\#_{1}$ in Figure 7). These were considered as “count curves” in the obvious way, and endowed with the $L^{1}$ and $L^{2}$ metrics. They, unsurprisingly, correlate little with the topological summaries.

The computations of the alpha complexes and persistent homology for this dataset were done using GUDHI [24].

4.5 Observations about the distance correlation in these simulations

The first general comment is that the sampled distance correlations for the tological summaries split these different simulations into two groups; one group contains the directed Erdös Rényi filtrations and the Erdös Rényi filtrations, and the other group of simulations contain filtrations built from random point clouds either on the flat torus on is the unit cube. This is not too surprising as the both directed Erdös Rényi filtrations and the Erdös Rényiscenarios represent completely random types of complexes without correlations on the simplex values. In contrast, for fitrations built on point clouds there are geometric constriants which imply correlations between the filtration values on neighbouring simplices. This in turn affects the observed topology.

For both the persistence diagrams and the persistence landscapes sampled the distance correlations for $p=1,2$ and $\infty$ . In all of the simulation studies the metrics from $p=1$ and $p=2$ of the same topological summary generally have high distance correlation, but that they are quite different to the $p=\infty$ version of that same topological summary. This is particularly prenounced in the Erdös Rényi and directed Erdös Rényi filtrations. In fact here the distance correlation between digrams and landscapes with $p=1$ and $p=2$ is higher than the distance correlation between bottleneck distances and $p$ -Wasserstein distances for $p=1$ or $p=2$ , and similarly for landscapes. One explanation is that in the completely random scenario we can have more extremal persistent homology classes and these extremal persistent homology classes dominate the $p=\infty$ metrics more that in the $p=1$ and $p=2$ metrics.

Another observation is that overall we see high correlation between the Sliced Wasserstein distances and the Wasserstein ( $p=1$ or $p=2$ ) distances. Perhaps not surprising since both are geometrically measuring similar quantities with a pairing process of points involved in both distances (though the pairing potentially varying between slices in the Sliced Wasserstein).

5 Distance correlation to another parameter

Instead of considering the correlation between two distances of topological summaries, one may want to consider the correlation between a metric on topological summaries and some real number relating to the underlying model. The real number may for example parameterise the underlying model, or it may be some function of the model that has domain-specific meaning. We will here consider only parameters and functions with codomain in (intervals in) $\mathbb{R}$ , and consider that space as a metric space equipped with the absolute value distance.

For brevity, we will from now on refer also to the value of certain domain-specific functions on the underlying model as “parameters”, even though they strictly speaking are not (see for example the case of elevation data below, where terrain smoothness will incorrectly be referred to as a parameter of the landscape). We will also use the letter $\mathcal{P}$ to denote the parameter space as a metric space with the absolute value distance.

We can use distance correlation to quantify how well the distances between some topological summaries relate to the differences in the parameter. The varying performances of the different topological summaries in correlating to the parameter highlights how the choice of topological summary has statistical significance.

5.1 Parameterised interpolation between Erdős–Rényi and geometric complexes

Our parameter space is now $[0,1]$ . Each sampled filtered complex with parameter $\gamma\in[0,1]$ is built by sampling $100$ points $X=\{x_{1},\dotsc,x_{100}\}$ i.i.d. uniformly from the unit cube $[0,1]^{3}$ , and sampling the entries of a symmetric matrix $E\in\mathbb{R}^{100\times 100}$ i.i.d. uniformly from $[0,1]$ . We endow a complete graph on $100$ vertices with weights $w_{i,j}$ for each pair $1\leq i<j\leq 100$ by letting $w_{i,j}=E_{i,j}$ with probability $\gamma$ and $w_{i,j}=\|x_{i}-x_{j}\|$ with probability $1-\gamma$ . The filtered complex generated is then the flag complex of this graph. Observe that this is a (Vietoris–Rips) version of the random geometric complex considered before when $\gamma=0$ , and the Erdős–Rényi complex when $\gamma=1$ .

A correlation between the parameter space and a given metric on a topological summary is then a measure of how well that metric detects the parameter.

For this experiment, we let $\gamma$ take the $100$ equally spaced values from $[0,1]$ , including the endpoints. These distance correlations are displayed in Table 3. The higher the distance correlation, the better the topological summary reflects the effect of the parameter $\gamma$ . We see that generally the function distances between Betti curves, Wasserstein distances and bottleneck distances between and the function distances between persistence landscapes had a higher correlation, all with a distance correlation greater than $0.9$ . This illustrates that these topological summaries would be good choices if we wish to do learning problems or statistical analysis with regards to this parametrised random model, such as parameter estimation. We also see the importance of the choice of bandwidth with dramatic effect on the distance correlation of the persistence scale space kernel and the Sliced Wasserstein kernel.

The persistent homology computations were performed using Ripser [2].

5.2 Digital elevation models and terrain ruggedness

As a simple example of “real world” data, we considered digital elevation model (DEM) data for a $50$ km by $50$ km patch around the city of Trondheim, Norway333The data was provided by the Norwegian Mapping Authority [10] under a CC-BY-4.0 license.. The DEM data set maps elevation data with a horizontal resolution of $10$ m $\times$ $10$ m and a vertical resolution of about $1$ m, and as such provides a terrain height map. Figure 8 shows the DEM our data was based on.

The data can, at the aforementioned horizontal resolution, be considered as a $5000\times 5000$ integer-valued matrix $Z$ , where each entry is interpreted as the height above a reference elevation in the vertical resolution unit. Each filtered complex we consider comes from $1000\times 1000$ block in $Z$ . The blocks overlap up to $50\%$ , and we keep a total of $64$ blocks. Each block is then considered as a two-dimensional cubical complex with the elevation data as the height filtration on the $2$ -cells.

The terrain ruggedness indicator (TRI) is an extremely simple measure of local terrain ruggedness that is widely employed in GIS and topography [20]. The TRI itself is a real-valued function defined on each map point/pixel, and a high value indicates a locally more rugged terrain. We simplify the measure even further by averaging the TRI for the whole map chunk considered, thus assigning a single real number to each map chunk. It is this number that will play the role of one parameter assigned to each of the $64$ map chunks considered, which we will call $\mathrm{TRI}$ . The results are shown in Table 4.

Another natural metric that can be defined on the raw data (the $1000\times 1000$ chunks) itself is the actual geodesic distance between the centers of the chunks. We also computed distance correlations between the metrics on topological summaries and this geodesic distance, although one must remember that it is perhaps not reasonable to expect a high correlation here; indeed, topographies with topologically highly interesting height functions may exist on a coastline, and thus be very close topologically trivial terrain. The results in Table 4 are therefore quite surprising.

The cubical complex persistent homology calculations were doing using GUDHI [24].

6 Future directions

Non-parametric statistics is a fruitful area for ideas and inspiration for methods that can be applied in conjunction with TDA. There are already a variety of options that only use pairwise distances, including null-hypothesis testing, clustering, classification, and parameter estimation. In all these cases, we would expect that distance correlation would be a good estimator for similarity of statistical analyses.

We can perform null hypothesis testing with topological summaries via a permutation text with a loss function a function of the pairwise distances (see [22]). Intuitively, when there is a high distance correlation, the pairwise distances are correlated and the corresponding loss functions should be similar for each permutation of the labels. This implies we should expect that the $p$ -values given a sample distribution should be close, at least with high probability. It may be possible to show that the power of the null hypothesis tests are close. An experimental and theoretical exploration of this relationship is a future direction.

We can also think of considering a modification of the permutation test for independence using distance correlation (instead of Pearson correlation). This can then be applied to topological summaries. One can get a $p$ -value that for whether two variables are independent by permuting the coupling of the variables but keeping the marginal distributions the same. A high ranking of the distance correlation for the original joint distribution would indicate that the variables are not independent, with high probability. Exploring the power of this is a future direction of research.

Another non-parametric method is parameter estimation using nearest neighbours. One method for estimating a real valued parameter which is unknown on a particular sample, but is known on a training set, is to take a weighted average of the values of the parameter on the training set with the weighting dependent on the pairwise distances from the sample of interest to those in the training set. We would expect better estimation when the distance correlation between the samples and the parameter of interest is high. Future directions for research can include experimental and theoretical results along these lines with respect to topological summaries. In particular, we would expect that we should be able to create confidence intervals for the parameter, dependent on the distance correlation. This is also an area where we should expect similar statistical analysis when the samples have high distance correlation.

Completely analogous to the above comments, clustering methods using pairwise distances should have similar results when the sets of samples have high distance correlation and future work could explore this with respect to topological summaries.

Acknowledgments

G.S. would like to thank Andreas Prebensen Korsnes of the Norwegian Mapping Authority for going out of his way to facilitate bulk downloads of DEM data before a single region was decided upon for the experiment in section 5.2.

G.S. was supported by Swiss National Science Foundation grant number 200021_172636.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Rushil Anirudh, Vinay Venkataraman, Karthikeyan Natesan Ramamurthy and Pavan Turaga “A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2016, pp. 68–76
2[2] Ulrich Bauer “Ripser” URL: https://github.com/Ripser/ripser
3[3] Christophe Biscio and Jesper Møller “The accumulated persistence function, a new useful functional summary statistic for topological data analysis, with a view to brain artery trees and spatial point process applications” In Journal of Computational and Graphical Statistics Taylor & Francis, 2019, pp. 1–20
4[4] Peter Bubenik “Statistical topological data analysis using persistence landscapes” In The Journal of Machine Learning Research 16.1 JMLR.org, 2015, pp. 77–102
5[5] Peter Bubenik and Paweł Dłotko “A persistence landscapes toolbox for topological statistics” In Journal of Symbolic Computation 78 Elsevier, 2017, pp. 91–114
6[6] Mathieu Carrière, Marco Cuturi and Steve Oudot “Sliced Wasserstein kernel for persistence diagrams” In Proceedings of the 34th International Conference on Machine Learning 70 , 2017, pp. 664–673 JMLR.org
7[7] Mathieu Carrière, Steve Y Oudot and Maks Ovsjanikov “Stable topological signatures for points on 3d shapes” In Computer Graphics Forum 34.5 , 2015, pp. 1–12 Wiley Online Library
8[8] Wikimedia Denis Boigelot “Examples of correlations” In the public domain., 2011 URL: https://commons.wikimedia.org/wiki/File:Correlation_examples 2.svg

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Same But Different

Abstract

1 Introduction

2 Background theory

2.1 Topological summary statistics

2.2 Distance correlation

Definition 2.1**.**

Definition 2.2**.**

Remark*.*

2.3 Metric spaces of strong negative type

Definition 2.3**.**

Definition 2.4**.**

Theorem 2.5** ([15] [15], \citeyearLyons).**

Theorem 2.6** ([15] [15], \citeyearLyons).**

3 A veritable zoo of topological summaries, some of which are of strong negative type

Theorem** (Theorem 3.2).**

Theorem** (Theorem 3.4).**

3.1 Betti and Euler characteristic curves

3.2 Persistence diagrams

Definition 3.1**.**

Theorem 3.2**.**

Proof.

3.3 Persistence landscapes

Definition 3.3**.**

Theorem 3.4**.**

Proof.

3.4 Persistence scale space kernel

Definition 3.5**.**

3.5 Sliced Wasserstein kernel distance

Definition 3.6**.**

4 Distance correlation between different topological summaries

4.1 Erdős–Rényi

4.2 Directed Erdős–Rényi

4.3 Geometric random complexes for points sampled on a torus

4.4 Geometric random complexes for point sampled from a unit cube

4.5 Observations about the distance correlation in these simulations

5 Distance correlation to another parameter

5.1 Parameterised interpolation between Erdős–Rényi and geometric complexes

5.2 Digital elevation models and terrain ruggedness

6 Future directions

Acknowledgments

Definition 2.1.

Definition 2.2.

*Remark**.*

Definition 2.3.

Definition 2.4.

Theorem 2.5 ([15] [15], \citeyearLyons).

Theorem 2.6 ([15] [15], \citeyearLyons).

Theorem (Theorem 3.2).

Theorem (Theorem 3.4).

Definition 3.1.

Theorem 3.2.

Definition 3.3.

Theorem 3.4.

Definition 3.5.

Definition 3.6.