A Stable Cardinality Distance for Topological Classification

Vasileios Maroulas; Cassie Putman Micucci; Adam Spannaus

arXiv:1812.01664·stat.ML·November 11, 2019

A Stable Cardinality Distance for Topological Classification

Vasileios Maroulas, Cassie Putman Micucci, Adam Spannaus

PDF

TL;DR

This paper introduces a new stable distance metric for persistence diagrams that improves topological classification of point cloud data, especially in noisy, sparse materials science applications.

Contribution

It proposes a novel distance measure on persistence diagrams that accounts for cardinality differences and proves its stability, enhancing topological data analysis for material classification.

Findings

01

Effective classification of crystal structures from noisy data

02

The new distance is stable under data perturbations

03

Successful application to synthetic atom probe tomography data

Abstract

This work incorporates topological features via persistence diagrams to classify point cloud data arising from materials science. Persistence diagrams are multisets summarizing the connectedness and holes of given data. A new distance on the space of persistence diagrams generates relevant input features for a classification algorithm for materials science data. This distance measures the similarity of persistence diagrams using the cost of matching points and a regularization term corresponding to cardinality differences between diagrams. Establishing stability properties of this distance provides theoretical justification for the use of the distance in comparisons of such diagrams. The classification scheme succeeds in determining the crystal structure of materials on noisy and sparse data retrieved from synthetic atom probe tomography experiments.

Tables1

Table 1. Table 1 . The atomic positions in the APT data is 𝒩 ( 0 , τ 2 ) 𝒩 0 superscript 𝜏 2 \mathcal{N}(0,\tau^{2}) distributed with 67% of the atoms missing. We employ the d p c superscript subscript 𝑑 𝑝 𝑐 d_{p}^{c} classifier, where c 𝑐 c has been optimized in each noise level case. The accuracy in the 10-fold cross validation is listed in the third column.

$τ$	$c$ -value	Accuracy
0.0	0.01	99%
0.25	0.05	99.4%
0.75	0.03	96.5%
1.0	0.13	96.4%

Equations17

d_{p}^{c} (X, Y) = (\frac{1}{m} (π \in Π_{m} min ℓ = 1 \sum n min (c, ∥ x_{ℓ} - y_{π (ℓ)} ∥_{\infty})^{p} + c^{p} ∣ m - n ∣))^{\frac{1}{p}},

d_{p}^{c} (X, Y) = (\frac{1}{m} (π \in Π_{m} min ℓ = 1 \sum n min (c, ∥ x_{ℓ} - y_{π (ℓ)} ∥_{\infty})^{p} + c^{p} ∣ m - n ∣))^{\frac{1}{p}},

d_{p}^{c} (A, A_{i}) \geq (c^{p} \frac{∣ A _{i} ∣ - ∣ A ∣}{∣ A _{i} ∣})^{\frac{1}{p}} \geq c \frac{∣ A _{i} ∣ - ∣ A ∣}{∣ A _{i} ∣} .

d_{p}^{c} (A, A_{i}) \geq (c^{p} \frac{∣ A _{i} ∣ - ∣ A ∣}{∣ A _{i} ∣})^{\frac{1}{p}} \geq c \frac{∣ A _{i} ∣ - ∣ A ∣}{∣ A _{i} ∣} .

∥ E ∥_{\infty}

∥ E ∥_{\infty}

\displaystyle=\max_{k,l}\big{|}\|a_{k}-a_{l}\|_{d}+\|a_{l}-a_{k}^{i}\|_{d}-\|a_{l}-a_{k}^{i}\|_{d}-\|a_{k}^{i}-a_{l}^{i}\|_{d}\big{|}

\displaystyle\leq\big{|}\|a_{k}-a_{l}\|_{d}-\|a_{l}-a_{k}^{i}\|_{d}\big{|}+\big{|}\|a_{k}^{i}-a_{l}^{i}\|_{d}-\|a_{l}-a_{k}^{i}\|_{d}\big{|}

\leq ∥ a_{k} - a_{k}^{i} ∥_{d} + ∥ a_{l} - a_{l}^{i} ∥_{d}

M_{d} (ρ) \leq (K_{d} - 1) ρ .

M_{d} (ρ) \leq (K_{d} - 1) ρ .

(π \in Π_{m} min ℓ = 1 \sum n min (c, ∥ x_{ℓ}^{1} - y_{π (ℓ)}^{1} ∥_{\infty})^{p} + c^{p} 2 t_{1 - α, N - 2} s [1 μ] (b_{0}^{T} W b_{0})^{- 1} [1 μ]^{T} + μ)^{\frac{1}{p}} .

(π \in Π_{m} min ℓ = 1 \sum n min (c, ∥ x_{ℓ}^{1} - y_{π (ℓ)}^{1} ∥_{\infty})^{p} + c^{p} 2 t_{1 - α, N - 2} s [1 μ] (b_{0}^{T} W b_{0})^{- 1} [1 μ]^{T} + μ)^{\frac{1}{p}} .

lo g (\frac{π _{j}}{1 - π _{j}}) = α + i = 1 \sum L φ_{i} (Σ_{i}),

lo g (\frac{π _{j}}{1 - π _{j}}) = α + i = 1 \sum L φ_{i} (Σ_{i}),

Σ_{i} = (E_{i, B}^{0}, E_{i, B}^{1}, Var_{i, B}^{0}, Var_{i, B}^{1}, E_{i, F}^{0}, E_{i, F}^{1}, Var_{i, F}^{0}, Var_{i, F}^{1}) .

Σ_{i} = (E_{i, B}^{0}, E_{i, B}^{1}, Var_{i, B}^{0}, Var_{i, B}^{1}, E_{i, F}^{0}, E_{i, F}^{1}, Var_{i, F}^{0}, Var_{i, F}^{1}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Stable Cardinality Distance for Topological Classification

Vasileios Maroulas

Department of Mathematics - University of Tennessee, Knoxville, TN 37996

[email protected]

,

Cassie Putman Micucci

Department of Mathematics - University of Tennessee, Knoxville, TN 37996

[email protected]

and

Adam Spannaus

Department of Mathematics - University of Tennessee, Knoxville, TN 37996

[email protected]

Abstract.

This work incorporates topological features via persistence diagrams to classify point cloud data arising from materials science. Persistence diagrams are multisets summarizing the connectedness and holes of given data. A new distance on the space of persistence diagrams generates relevant input features for a classification algorithm for materials science data. This distance measures the similarity of persistence diagrams using the cost of matching points and a regularization term corresponding to cardinality differences between diagrams. Establishing stability properties of this distance provides theoretical justification for the use of the distance in comparisons of such diagrams. The classification scheme succeeds in determining the crystal structure of materials on noisy and sparse data retrieved from synthetic atom probe tomography experiments.

Key words and phrases:

Stability, Classification, Persistent Homology, Persistence Diagrams, Crystal Structure of Materials

This work has been partially supported by the ARO Grant

W911NF-17-1-0313, and the NSF DMS-1821241.

1. Introduction

A crucial first step in understanding properties of a crystalline material is determining its crystal structure. For highly disordered metallic alloys, such as high-entropy alloys (HEAs), atom probe tomography (APT) gives a snapshot of the local atomic environment. APT has two main drawbacks: experimental noise and missing data. Approximately 65% of the atoms in a sample are not registered in a typical experiment, and those atoms that are captured have their spatial coordinates corrupted by experimental noise. As noted by [21] and [31], APT has a spatial resolution approximately the length of the unit cell we consider, as seen in Fig. 1. Hence the process is unable to see the finer details of a material, making the determination of a lattice structure a challenging problem. Existing algorithms for detecting the crystal structure [8, 18, 19, 22, 32, 37] are not able to establish the crystal lattice of an APT dataset, as they rely on symmetry arguments. Consequently, the field of atom probe crystallography, i.e., determining the crystal structure from APT data, has emerged in recent years [15] and [32]. These algorithms rely on knowing the global lattice structure a priori and aim to determine local small-scale structures within a larger sample. For some materials this information is readily known, for others, such as HEAs, the global structure is unknown and must be inferred. A recent work by [40] proposes a machine-learning approach to classifying crystal structures of a noisy and sparse materials dataset, without knowing the global structure a priori. The authors employ a convolutional neural network for classifying the crystal structure by looking at a diffraction image, a computer-generated diffraction pattern. The authors suggest their method could be used to determine the crystal structure of APT data or other noisy and sparse data from materials science. However, the synthetic data considered in [40] is not a realistic representation of experimental APT data, where about 65% of the data is missing [35] and is corrupted by more observational noise [31]. Most importantly, their synthetic data is either sparse or noisy, not a combination of both. We consider a combination of noise and sparsity, such as is the case in real APT data.

In this work, we provide a machine learning approach to classify the crystal structure of a noisy and sparse materials dataset. Specifically, we consider materials that are either body-centered cubic (BCC) or face-centered cubic (FCC), as these lattice structures are the essential building blocks of HEAs [39] and have fundamental differences that set them apart in the case of noise-free, complete materials data. The BCC structure has a single atom in the center of the cube, while the FCC has a void in its center but has atoms on the center of the cubes’ faces, see Fig. 1. These two crystal structures are distinct when viewed through the lens of Topological Data Analysis (TDA). Differentiating between the holes and connectedness of these two lattice structures allows us to create an accurate classification rule. This fundamental distinction between BCC and FCC point clouds is captured well by topological methods and explains the high degree of accuracy in the classification scheme presented herein. TDA provides input features for machine learning algorithms, as well as a useful toolbox for classification. Several authors have used TDA on real-world problems, see [4, 12, 24, 26, 27, 28, 38, 41] and the references therein. Persistent homology, which measures changes in topological features over different scales, is the main framework considered by these authors.

Persistent homology is applicable to classification problems as it studies and differentiates holes within data as viewed in different dimensions, e.g., the space enclosed by a loop is a one-dimensional hole. Overall, persistent homology provides a summary of the connectedness and holes (empty space in atomic cells) of data, which indirectly gives information about the shape of the data as well. Indeed, persistent homology records when different homological features emerge and vanish in the data. This analysis quantifies the significance of a homological feature and provides a tool to contend with noisy data. The appearance and disappearance of each homological feature is calculated and recorded in a persistence diagram. Persistence diagrams yield topological summaries of the persistent homology of a dataset and are rich sources of detail about underlying topological features. The diagrams could be used in distance-based classifiers [5, 25] or vectorized and input into standard classification algorithms, such as support vector machines [1, 3].

Distances on the space of persistence diagrams yield a means of comparison between diagrams. The Wasserstein and bottleneck distances compute the cost of the optimal matching between the points in each persistence diagram, while allowing matching to additional points on the diagonal to allow for cardinality differences and to prove stability properties as in [9]. Motivated by [25], we consider here the $d_{p}^{c}$ distance, a distance on the space of persistence diagrams. This distance employs the cardinality of the persistence diagrams, as well as distances between points in the diagrams. It calculates the cost of an optimal matching between the persistence diagrams without any points added to the diagonal. A regularization term then considers the cardinality differences between persistence diagrams.

The stability of the $d_{p}^{c}$ distance is also verified in this paper. This property guarantees that when the distances between point clouds go to zero, the distances between the associated persistence diagrams go to zero as well. Another formulation of this stability is given in [7]; using a related approach, we show continuity of the mapping of point cloud to persistence diagram under the $d_{p}^{c}$ distance. This analysis provides insight into how the cardinality of the diagrams changes with the size of the input point clouds. Additionally, using statistics on the diagram’s cardinality generates corresponding prediction intervals, which give probabilistic bounds on the $d_{p}^{c}$ distances between persistence diagrams. The idea is that point clouds generated from the same process have small variability with respect to cardinality of the persistence diagrams.

The contributions of this work is:

(1)

The stability of the $d_{p}^{c}$ distance in a continuous fashion. 2. (2)

Theoretical and statistical bounds on the number of 1-dim holes represented in a persistence diagram based on the cardinality of the underlying point cloud. 3. (3)

A $d_{p}^{c}$ distance based classification algorithm for the crystal structure of high entropy alloys using synthetic atom probe tomography experiments.

The work is organized as follows. Relevant definitions and concepts necessary for persistent homology are presented in Section 2. Stability results of the $d_{p}^{c}$ distance are in Section 3, as well as prediction interval bounds. Section 4 demonstrates a classification scheme for materials science data retrieved from synthetic APT experiments. We conclude and provide future directions in Section 5.

2. Persistent Homology Background

This section succinctly explains the construction of persistence diagrams, which are topological summaries of the underlying space. The Vietoris-Rips complex provides the necessary computational link between the point cloud, a subset of $\mathbb{R}^{d}$ under the Euclidean distance, and its persistence diagram. Below we give a brief summary of the necessary background. For a detailed treatment, see [11].

Definition 1.

A $\nu$ -simplex is the convex hull of an affinely independent point set of size $\nu+1$ .

Definition 2.

For a set of points $\mathcal{P}$ , an abstract simplicial complex $\sigma$ is a collection of finite subsets of $\mathcal{P}$ such that for every set $A$ in $\sigma$ and every nonempty set $B\subset A$ , we have that $B$ is in $\sigma$ . The elements of $\sigma$ are called abstract simplices and are the combinatorial analogues of the geometric simplices in Def. 1.

Definition 3.

For a given threshold $\epsilon$ , the Vietoris-Rips complex is a simplicial complex formed from a set such that corresponding to each subset of $\nu$ points of the set, an $\nu$ -simplex is included in the Vietoris-Rips complex each time the subsets have pairwise distances at most $\epsilon$ .

The Vietoris-Rips complex can be visualized by placing a ball of radius $\epsilon/2$ at each point in the set and then adding a $\nu$ -simplex at the points corresponding to the intersection of $\nu$ balls. See Fig. 2 for an illustration. For the Vietoris-Rips complex corresponding to $\epsilon$ , denoted by $VR_{\epsilon}$ , it is clear that $VR_{\epsilon}\subset VR_{\epsilon^{\prime}}$ for $\epsilon<\epsilon^{\prime}$ . Thus we need only examine specific $\epsilon$ values corresponding to the emergence and disappearance of homological features. These $\epsilon$ values are recorded as ordered pairs $(b,d)$ in a persistence diagram, where $b$ denotes the birth of a feature and $d$ its death.

As can be seen in Fig. 2, a 0-dim homological feature is a connected component of a simplex, a 1-dim homological feature is a hole, such as those created by a loop or the circle $S^{1}$ , and a 2-dim homological feature describes voids, e.g., the inside of a sphere; see [38] for details. Higher dimensional data analogously yields higher dimensional holes.

Remark 1.

Persistence diagrams can also be computed using a pertinent function $g$ from a topological space to $\mathbb{R}$ . Such a function can act as an approximation to a point cloud; typical functions used are kernel density estimators as in [14] and the distance to measure function as in [6]. Homological features are born and die within the sublevel sets $g^{-1}(-\infty,t]$ as $t$ increases. These birth and death times create another persistence diagram, see Fig. 2(f).

To calculate the similarity between diagrams for classification problems, a distance on the space of persistence diagrams is needed. A typical distance is the Wasserstein distance.

Definition 4.

The $p$ -Wasserstein distance between two persistence diagrams $X$ and $Y$ is given by $W_{p}(X,Y)=\left(\inf_{\eta:X\to Y}\sum_{x\in X}\|x-\eta(x)\|_{\infty}^{p}\right)^{\frac{1}{p}}$ , where the infimum is taken over all bijections $\eta$ , and the points of the diagonal are added with infinite multiplicity to each diagram. If $p\to\infty$ , then $W_{\infty}(X,Y)\;=\inf_{\eta:X\to Y}\sup_{x\in X}\|x-\eta(x)\|_{\infty}$ is the bottleneck distance between diagrams $X$ and $Y$ .

The Wasserstein distance yields the penalty of matched points under the optimal bijection. Points can be matched to the diagonal of each persistence diagram, which is assumed to have infinitely many points with infinite multiplicity; this ensures that a bijection between $X$ and $Y$ actually exists, since $X$ and $Y$ may not have the same cardinality. In other words, the Wasserstein distance gives no explicit penalty for differences in cardinality between two diagrams. Instead, the Wasserstein distance penalizes unmatched points by using their distance to the diagonal. However, cardinality differences may play a key role in machine learning problems, and to that end, [25] proposed the $d_{p}^{c}$ distance given below.

Definition 5.

Let $X$ and $Y$ be two persistence diagrams with cardinalities $n$ and $m$ respectively such that $n\leq m$ and denoted $X=\{x_{1},\ldots,x_{n}\}$ , $Y=\{y_{1},\ldots,y_{m}\}$ . Let $c>0$ and $1\leq p<\infty$ be fixed parameters. The $d_{p}^{c}$ distance between two persistence diagrams $X$ and $Y$ is

[TABLE]

where $\Pi_{m}$ is the set of permutations of $(1,\dots,m)$ . If $m<n$ , define $d_{p}^{c}(X,Y):=d_{p}^{c}(Y,X)$ .

Remark 2.

Note that this distance can be applied to arbitrary point clouds with finite cardinality as well. As shown in [25], a smaller $c$ in Eq. (1) accounts for local geometric differences, while a larger $c$ focuses on global geometry. It is precisely by considering differences in cardinality that the $d_{p}^{c}$ distance can distinguish between features of the point cloud that other distances may miss. Also in Eq. (1), if $X$ is fixed and $m\to\infty$ , then $d_{p}^{c}(X,Y)\to c$ .

3. Stability Properties for $d_{p}^{c}$ distance

The stability of the $d_{p}^{c}$ distance is proved in this section. Stability of the distance under investigation means that small perturbations in the underlying space result in small perturbations of the generated persistence diagrams. Adopting the approach of estimating a point cloud via a pertinent function, e.g., a kernel density estimator as in [14], persistence diagrams may be constructed using sublevel sets as in Fig. 2(f) and Remark 1. Their differences can be computed using the Wasserstein and bottleneck distances. Using this functional representation, stability of the Wasserstein and bottleneck distances has been shown in [10] and [9] respectively, by verifying Lipschitz (and respectively Hölder) continuity of the mapping from the underlying function of the data to its persistence diagram in the bottleneck and Wasserstein distances. Considering discrete point clouds whose distances shrink to zero, Theorem 1 shows that the distance between persistence diagrams goes to zero as well.

Theorem 1 (Stability Theorem).

Consider $c>0$ and $\,1\leq p<\infty$ . Let $A$ be a finite nonempty point cloud in $\mathbb{R}^{d}$ . Suppose that $\{A_{i}\}_{i\in\mathbb{N}}$ is a sequence of finite nonempty point clouds such that $d_{p}^{c}(A,A_{i})\to 0$ as $i\to\infty$ . Let $X^{k}and\;X_{i}^{k}$ be the $k$ -dim persistence diagrams created from the Vietoris-Rips complex for $A$ and $A_{i}$ respectively. Then $d_{p}^{c}(X^{k},X_{i}^{k})\to 0$ as $i\to\infty$ .

Note that Theorem 1 does not depend on a function created from the points such as a kernel density estimator as in [14], but simply on the points themselves and the Vietoris-Rips complex generated from these points. In fact, Theorem 1 shows that the mapping from a point cloud to the persistence diagram of its Vietoris-Rips complex is continuous under the $d_{p}^{c}$ distance. This continuous-type stability result is weaker than Lipschitz stability. In order to prove Theorem 1, we first show that if the $d_{p}^{c}$ distance between the underlying point clouds goes to 0, then eventually the size of the point clouds must be the same.

Lemma 6.

Let $A$ and $A_{i}$ be as in Theorem 1 such that $d_{p}^{c}(A,A_{i})\to 0$ as $i\to\infty$ . Then $A_{i}$ and A have the same number of points for $i\geq N_{0}$ for some $N_{0}\in\mathbb{N}$ .

Proof.

Denote by —A— the number of points in the point cloud A. Suppose that $|A_{i}|\neq|A|$ infinitely often. Since $d_{p}^{c}(A,A_{i})\to 0$ , for every $\epsilon>0$ , there is an $N\in\mathbb{N}$ such that $i\geq N$ implies that $d_{p}^{c}(A,A_{i})<\epsilon$ . Let $\epsilon=\frac{c}{|A|+1}$ , noting that $|A|$ is fixed. By assumption $|A_{i}|<|A|$ , $|A_{i}|>|A|$ , or both, infinitely often. If $|A|<|A_{i}|$ , then by Def. 5

[TABLE]

The function $h:\mathbb{N}\to\mathbb{R}$ given by $h(z)=\frac{z-|A|}{z}$ is strictly increasing. Whenever $|A|<|A_{i}|$ , we have $|A_{i}|\geq|A|+1$ . The restriction of $h$ to $\{|A|+1,|A|+2,|A|+3,\ldots\}$ achieves its minimum at $|A|+1$ . This shows that the RHS of Eq. (2) is greater than or equal to $\frac{c}{|A|+1},$ whenever $|A|<|A_{i}|$ , which by assumption happens infinitely often. This contradicts $d_{p}^{c}(A,A_{i})<\epsilon$ for all $i\geq N$ . The case where $|A|>|A_{i}|$ follows similarly. ∎ ∎

Lemma 7.

Let $A$ and $A_{i}$ be as in Theorem 1. Suppose the points of each point cloud $A_{i}$ are ordered so that $A_{i}=\{a_{\pi_{i}(1)},a_{\pi_{i}(2)},\ldots,a_{\pi_{i}(|A|)}\}$ , where $\pi_{i}$ is the permutation used to calculate the $d_{p}^{c}$ distance between $A_{i}$ and $A$ as in Eq. (1). Let $D_{A}$ and $D_{A_{i}}$ be the distance matrices for the points of $A$ and $A_{i}$ respectively, i.e., the $kl$ -th entry of $D_{A}$ is $\|a_{k}-a_{l}\|_{d}$ . Then,

(i)

$\|D_{A}-D_{A_{i}}\|_{\infty}\to 0$ * as $i\to\infty$ , and* 2. (ii)

for some $N_{1}\in\mathbb{N}$ , the order of the entries of the upper triangular portion of $D_{A}$ and $D_{A_{i}}$ is the same for $i\geq N_{1}$ , up to permutation when either $D_{A}$ or $D_{A_{i}}$ have duplicate entries.

Proof.

(i) Let $A=\{a_{1},\ldots a_{k}\}$ , $A_{i}=\{a^{i}_{1},\ldots a^{i}_{k}\}$ , and ${\lambda}^{i}_{\alpha}=\|a_{\alpha}-a^{i}_{\pi_{i}(\alpha)}\|_{d}$ for the permutation $\pi_{i}$ in the $d_{p}^{c}$ distance between $A_{i}$ and $A$ . Suppose that $d_{p}^{c}(A,A_{i})\to 0$ . Note that since $c$ is fixed, then by Lemma 6, there is some $N_{c}$ such that eventually $d_{p}^{c}(A_{i},A)=\left(\frac{1}{|A|}\min_{\pi_{i}\in\prod_{|A|}}\sum_{\ell=1}^{|A|}\|a_{\ell}-a_{\pi_{i}(\ell)}\|_{d}^{p}\right)^{\frac{1}{p}}$ for $i\geq N_{c}$ . By assumption $d_{p}^{c}(A,A_{i})\to 0$ , which shows that $|A|^{-\frac{1}{p}}\|\lambda\|_{p}\to 0$ as $i\to\infty$ . Thus $\|\lambda^{i}\|_{p}\to 0$ as $i\to\infty$ .

Now, let $E=D_{A}-D_{A_{i}}$ .

[TABLE]

The last term in Eq. (3) goes to 0 as $i\to\infty$ , proving (i).

(ii) Suppose that the $m$ distinct upper triangular entries of $D_{A}$ are ordered from smallest to largest, say $d_{1}^{A}<d_{2}^{A}<\cdots d_{m}^{A}$ , where $m\leq{|A|(|A|-1)/2}$ . For $\eta\in\{1,\ldots,m+1\}$ let $h_{\eta}\subset[0,\infty)$ be a sequence such that $h_{1}<d_{1}^{A}<h_{2}<d_{2}^{A}<\cdots<h_{m}<d_{m}^{A}<h_{m+1}$ . Let $\|D_{A}-D_{A_{i}}\|_{\infty}<\frac{h}{2}$ , where $h=\min_{\eta_{1},\eta_{2}\in\{1,\ldots,m+1\}}\{|h_{\eta_{1}}-h_{\eta_{2}}|\}$ . We show that there exists a sequence $g_{\eta}$ such that $|h_{\eta}-g_{\eta}|<2h$ for each $\eta\in\{1,\ldots,m+1\}$ and $h_{\eta}<d_{j}^{A}<h_{\eta+1}$ implies $g_{\eta}<d_{j}^{A_{i}}\leq g_{\eta+1}$ . Let $h_{\eta}<d_{j}^{A}<h_{\eta+1}$ , and suppose that it is not the case that $h_{\eta}<d_{j}^{A_{i}}\leq h_{\eta+1}$ . Since $\|D_{A}-D_{A_{i}}\|_{\infty}<\frac{h}{2}$ , then either $d_{j}^{A_{i}}\in(h_{\eta-1},h_{\eta}]$ or $d_{j}^{A_{i}}\in(h_{\eta+1},h_{\eta+2}]$ . If the first case is true, then take $g_{\eta}=d_{j}^{A}-\frac{h}{2}$ . If the second, then take $g_{\eta}=d_{j}^{A}+\frac{h}{2}$ . This proves the existence of the sequence. Now proceeding by contradiction, if the lemma does not hold for some entries $d^{A}_{j}\in D_{A}$ and $d^{A_{i}}_{j}\in D_{A_{i}}$ , then take $\|D_{A}-D_{A_{i}}\|_{\infty}<\frac{1}{2}|d^{A}_{j}-d^{A_{i}}_{j}|$ . ∎ ∎

Proof of Theorem 1.

By Lemma 6, take $|A_{i}|=|A|$ without loss of generality. By Lemma 7 (i), $\|D_{A}-D_{A_{i}}\|_{\infty}\to 0$ as $i\to\infty$ . If the Vietoris-Rips complex were computed at every threshold value in $[0,\infty)$ , then the birth and death times of all features of all dimensions would be distances between points in the underlying point cloud (including the birth time of 0 in the 0-dim diagram). Since the order of the entries of $D_{A}$ and $D_{A_{i}}$ may be taken to be the same from Lemma 7 (ii), the same number of simplices are formed in the complexes for $A$ and $A_{i}$ for each dimension of simplex. Also, the labels of the simplices according to the points of $A$ and $A_{i}$ are given from the permutation $\pi_{i}$ in Lemma 7 (i).

Now, for 0-dim it is clear that for the cardinalities of the persistence diagrams, $|X^{0}|=|X_{i}^{0}|$ since for the sizes of their associated point clouds, $|A_{i}|=|A|$ . For a higher dimensional feature ( $k\geq 1$ ) to appear in the complex, we must have that a certain number of the distances are less than or equal to the threshold $\epsilon$ and a certain number of the distances are greater than $\epsilon$ . Lemma 7 (ii) shows that although the thresholds where the features are created may be different, the same number of features are formed in the Vietoris-Rips complexes of $A$ and $A_{i}$ , and these features are formed in the same order and with the points that correspond under $\pi_{i}$ .

If $X^{k}=\{x_{1},x_{2},\dots,x_{|X^{k}|}\}$ and $X_{i}^{k}=\{x_{1},x_{2},\dots,x_{|X_{i}^{k}|}\}$ , then we have that $|X^{k}|=|X_{i}^{k}|$ and that $d_{p}^{c}(X^{k},X_{i}^{k})<2h$ . Thus $d_{p}^{c}(X^{k},X_{i}^{k})\to 0$ as $i\to\infty$ . ∎ ∎

To provide a practical way to control $c$ in computing the $d_{p}^{c}$ distance of Eq. (1) and consequently compute the possible fluctuations of the $d_{p}^{c}$ distance, a probabilistic upper bound, which relies on least squares, is provided. Specifically, the following analysis gives predictions on the number of 1-dim holes represented in the persistence diagram, which we denote by $b_{1}$ . The parameter $b_{1}$ relies on the number of connected components (or equivalently the number of points in the point cloud) represented in the persistence diagram, denoted by $b_{0}$ .

Definition 8 ([33]).

The kissing number in $\mathbb{R}^{d}$ is the maximum number of nonoverlapping unit spheres that can be arranged so that each touches another common central unit sphere.

Lemma 9 ([17]).

For a finite point cloud with no more than $\rho$ points in $\mathbb{R}^{d}$ under the Euclidean distance, let $M_{d}(\rho)$ denote the maximum possible number of 1-dim holes in the Vietoris-Rips complex for the point cloud for a given threshold. Then

[TABLE]

Proposition 10.

Consider a point cloud in $\mathbb{R}^{d}$ with $\rho$ points and its associated persistence diagram. Let $B_{1}$ denote the possible range of the number of 1-dim holes $b_{1}$ . Then $B_{1}$ is such that $\{0,1,\ldots,\lfloor\frac{\rho}{2}\rfloor-1\}\subseteq B_{1}\subseteq\{0,1,\ldots,\frac{1}{2}(K_{d}-1)\rho^{2}(\rho-1)\},$ i.e., the possible range of $b_{1}$ is expanding as the number of points, $b_{0}$ , in the point cloud increases.

Proof.

We first show the inclusion $\{0,1,\ldots,\lfloor\frac{\rho}{2}\rfloor-1\}\subseteq B_{1}$ . To form a point cloud with $\rho$ points that has $b_{1}=0$ , simply take the $\rho$ points and arrange them on a line. To form a point cloud with $\rho$ points that has $b_{1}=\lfloor\frac{\rho}{2}\rfloor-1$ , arrange the $\rho$ points in two rows each with $\lfloor\frac{\rho}{2}\rfloor$ points. Set the spacing between adjacent points in each of the rows to be 1 and then place the two rows directly beside each other so that for each point in the first row, there is exactly one point in the second row at a distance of 1. Adding edges appropriately creates $b_{1}=\lfloor\frac{\rho}{2}\rfloor-1$ squares with side length 1. Thus, creating the Vietoris-Rips complex and corresponding diagram gives $b_{1}=\lfloor\frac{\rho}{2}\rfloor-1$ . For an illustration of the arrangement, see Fig. 4(a).

To form a point cloud with $\rho$ points that has $b_{1}\in\{1,2,\dots\lfloor\frac{\rho}{2}\rfloor-2\}$ , arrange $2(b_{1}+1)$ points in two rows as in Fig. 4(a). Arrange the other $\rho-2(b_{1}+1)$ points in a line with the minimum distance from any points in the line to points of the two rows such that it is greater than or equal to 1. Then exactly $b_{1}$ holes are formed from the two rows, with no holes formed by the line. For an illustration, see Fig. 4(b).

Next, we verify the inclusion $B_{1}\subseteq\{0,1,\ldots,\frac{1}{2}(K_{d}-1)\rho^{2}(\rho-1)\}$ . By Lemma 4, the number of 1-dim holes in the Vietoris-Rips complex for a fixed radius $\epsilon$ for the point cloud is bounded above by $(K_{d}-1)\rho$ . The homology of the Vietoris-Rips complex changes at most ${\rho\choose 2}$ times as the radius $\epsilon$ increases due to the maximum of ${\rho\choose 2}$ distinct distances between points in the point cloud. Therefore, there can be at most $\frac{1}{2}(K_{d}-1)\rho^{2}(\rho-1)$ 1-dim holes formed over the entire evolution of the Vietoris-Rips complex. This gives the desired bound of $b_{1}\leq\frac{1}{2}(K_{d}-1)\rho^{2}(\rho-1)$ . ∎ ∎

Now, let $N$ point clouds be generated from some process, and $N$ corresponding persistence diagrams be created. For each persistence diagram $X_{i}^{k},k\in\{0,1\},i=1,\ldots,N$ , record the cardinality $b_{0}^{i}$ of the 0-dim diagram and the cardinality $b_{1}^{i}$ of the 1-dim diagram. Let $\bm{b_{0}}\in\mathbb{R}^{N\times 2}$ be the predictor matrix whose rows are $[1\;b_{0}^{i}]$ and $\bm{b_{1}}\in\mathbb{R}^{N}$ be the vector of responses with entries $b_{1}^{i}$ . Proposition 10 gives that the possible range of $\bm{b_{1}}$ is increasing as $\bm{b_{0}}$ grows, which yields that an increase in variance as $\bm{b_{0}}$ grows may be present, i.e., heteroscedasticity exists. Thus the analysis of the change in number of 1-dim holes as the size of the point cloud changes needs to account for heteroscedasticity in order to capture the non-constant variance behavior. Therefore to estimate the number of 1-dim holes, we use weighted least squares as in [13]. If $\mathbf{W}\in\mathbb{R}^{N\times N}$ is the weight matrix $\mathbf{W}=\textrm{diag}(a_{1},\ldots,a_{N})$ , then a weighted least-squares regression can be found for $\bm{b_{1}}=\bm{b_{0}}\bm{\gamma}+\bm{\epsilon}$ , where $\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}^{2})$ . The approximation is then given by $\mathbf{b_{0}}\bm{\hat{\gamma}}=\bm{b_{1}}$ , with $\hat{\bm{\gamma}}=(\mathbf{b_{0}}^{T}\mathbf{W}\mathbf{b_{0}})^{-1}\mathbf{b_{0}}^{T}\mathbf{W}\bm{b_{1}}$ . In turn, Proposition 11 provides bounds from prediction intervals using weighted least squares for the $d_{p}^{c}$ distance.

Proposition 11.

Suppose $N$ point clouds are generated from a process, and $N$ corresponding persistence diagrams are created. For each persistence diagram $X_{i}^{k},k\in\{0,1\}$ , record the cardinality of the 0-dim diagram $b_{0}^{i}$ and of the 1-dim diagram $b_{1}^{i}$ . Let $\bm{b_{0}}\in\mathbb{R}^{N\times 2}$ be the predictor matrix whose rows are $[1\;b_{0}^{i}]$ and $\bm{b_{1}}\in\mathbb{R}^{N}$ be the vector of responses of $b_{1}^{i}$ . Assume the model $\bm{b_{1}}=\mathbf{b_{0}}\bm{\gamma}+\bm{\epsilon}$ , where $\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}^{2})$ depends on the value of the input $b_{0}^{i}$ . Let $X^{1}$ and $Y^{1}$ be persistence diagrams generated from the same process as $\bm{b_{0}}$ with $|X^{0}|=\mu$ . Considering the $(1-\alpha)\cdot 100\%$ -level prediction interval for $\bm{b_{1}}$ , the distance $d_{p}^{c}(X^{1},Y^{1})$ is bounded above by

[TABLE]

Proof.

Prediction intervals can be constructed for the cardinality of a 1-dim diagram for an instance of point cloud size ${b_{0}}^{*}$ using standard results on weighted least squares. Specifically, for level $(1-\alpha)\cdot 100\%$ a prediction interval for the new response $\widehat{b_{1}}^{*}$ is sought. To calculate this interval for a new response from the mean predicted response $\widehat{b_{1}}^{*}=\widehat{\bm{\gamma}}{b_{0}}^{*}$ , note that $\widehat{b_{1}}^{*}-{b_{1}}^{*}$ has the distribution $\frac{\widehat{b_{1}}^{*}-{b_{1}^{*}}}{\textrm{Var}(\widehat{b_{1}}^{*}-{b_{1}}^{*})}\sim t_{N-2}.$ Also, $\operatorname{\mathrm{Var}}(\widehat{b_{1}}^{*}-{b_{1}}^{*})=\operatorname{\mathrm{Var}}(\bm{\epsilon})[1\;{b_{0}}^{*}](\mathbf{b_{0}}^{T}\mathbf{W}\mathbf{b_{0}})^{-1}[1\;{b_{0}}^{*}]^{T}+\frac{\operatorname{\mathrm{Var}}(\bm{\epsilon})}{w*}$ , where $w^{*}=\frac{1}{b_{0}^{*}}$ , the weight corresponding to ${b_{0}}^{*}$ . Prediction intervals for $b_{1}^{*}$ are thus $\widehat{b_{1}}^{*}\pm t_{1-{\alpha/2},N-2}s\sqrt{[1\;{b_{0}}^{*}](\mathbf{b_{0}}^{T}\mathbf{b_{0}})^{-1}[1\;{b_{0}}^{*}]^{T}+{b_{0}}^{*}},$ where $s^{2}=\frac{\bm{\widehat{\epsilon}}^{T}\mathbf{W}\bm{\widehat{\epsilon}}}{N-2}$ , the unbiased estimator for $\operatorname{\mathrm{Var}}(\bm{\epsilon})$ , using the residuals $\bm{\widehat{\epsilon}}$ . Thus the cardinality difference term in the calculation of the $d_{p}^{c}$ distance as in Eq. (1) is bounded above by the length of the prediction interval with $(1-\alpha)\cdot 100\%$ -level confidence. Substituting this length into Eq. (1) gives the result. ∎ ∎

4. Classification of Materials Data

Here we describe the $d_{p}^{c}$ -distance based classification of crystal structures of high-entropy alloys (HEAs) using atom probe tomography (APT) experiments. Recall that the building blocks of HEAs are either body-centered cubic (BCC) or face-centerd cubic (FCC). Topological considerations are a natural fit for this problem since BCC and FCC crystal structures enjoy a different atomic configuration within a unit cell. Indeed, the BCC structure has one atom at its center, but the FCC contains a void (recall Figs. 1(a) and 1(b)). This distinction is important from the viewpoint of persistent homology.

However, topology alone is insufficient to distinguish between noisy and sparse BCC and FCC lattice structures accurately. If we count the number of atoms in a unit cell (see Figs. 1(a) and 1(b)) one may see that a BCC unit cell has two atoms, one at the center and $1/8^{th}$ of an atom at the unit cell’s corners, as it shares part of these corner atoms with its neighboring cells. Similarly, an FCC unit cell has four atoms; the same $1/8^{th}$ of the corner atoms plus one-half of each of the six atoms on the cell’s faces. In both cases, the atoms on the faces and lattice points are shared with the cell’s neighbors and are only counted as a proportion contributing to the unit cell.

Another way to see this difference in cardinality is by plotting the number of connected components against the number of holes for both BCC and FCC crystal structures. Figs. 7(c) and 7(d) depict that FCC structures have larger point clouds, and consequently, a greater number of connected components. Observe in Fig. 6 that the number of connected components and 1-dim holes are greater in the FCC diagrams than the BCC diagrams. Consequently, we must account for more than just homological differences when considering persistence diagrams derived from these atomic neighborhoods. Variability in the size of the underlying point clouds must be considered, as verified in Proposition 11. Given the salient topological and cardinality differences between these two crystal structures, we seek to classify their associated persistence diagrams via these essential differences. To that end, we consider the $d_{p}^{c}$ distance given in Eq. (1).

In the numerical experiments, the point clouds (atomic neighborhoods) are extracted from a sample containing approximately 10,000 atoms. We remove atoms, to create spasity, and add Gaussian noise to the larger sample mirroring those levels found in true APT experimental data. To create these neighborhoods, we consider a fixed volume around each atom in the perturbed sample and those atoms within the volume are recorded for our classification methodology. Here we consider $N=1,000$ synthetic atomic neighborhoods ( $N_{BCC}=500$ BCC structures and $N_{FCC}=500$ FCC structures) with noise and sparsity levels similar to those found in true APT experiments. Let $\bm{q}=(q_{1},\dots,q_{M})^{T}$ be the atoms’ positions within an atomic neighborhood. Applying the persistent homology machinery of Section 2, one obtains the associated persistence diagram denoted by $X_{q}$ , see Fig. 6. For our classification problem, we are interested in the conditional probability, $\widetilde{\pi}_{j}=\mathbb{P}(Y_{i}=j\mid X_{i})$ , of the persistence diagram $X_{i}$ being in class $Y_{j}$ , for $j=0$ (BCC) or $j=1$ (FCC). To that end, we consider a logistic regression model,

[TABLE]

where $\varphi_{i}$ is some pertinent smooth function, and $\bm{\Sigma}\in\mathbb{R}^{N\times 8}$ is the feature matrix whose $i^{th}$ row is

[TABLE]

For any persistence diagram $X^{k}_{i}$ with $k$ -dimensional homology $(k=0,1),\\ \mathbb{E}_{i,B}^{k}=\frac{1}{N_{BCC}}\sum_{j=1}^{N_{BCC}}d_{p}^{c}(X_{i}^{k},X_{j}^{k})$ and $\operatorname{\mathrm{Var}}_{i,B}^{k}=\frac{1}{N_{BCC}-1}\sum_{j=1}^{N_{BCC}}\,(d_{p}^{c}(X_{i}^{k},X_{j}^{k})-\mathbb{E}_{i,B}^{k})^{2}$ respectively yield the average and variance of the distance between $X^{k}_{i}$ and the collection of all BCC persistence diagrams. Similarly, $\mathbb{E}_{i,F}^{k}$ and $\operatorname{\mathrm{Var}}_{i,F}^{k}$ are the average and variance of the distance between $X^{k}_{i}$ and the collection of all FCC persistence diagrams.

We perform 10-fold cross validation on the 1,000 synthetic crystal structures. In other words, the data is divided randomly into 10 folds, and 9 folds of the data are used as a training set. For any unknown crystal structure in the remaining fold, the feature vector of the unknown crystal structure is computed according to Eq. (6) and used as input for the decision tree classifier. Similarly, the other 9 folds are each used once as test sets employing the same procedure. The tree finds the best fit for the features from the additive model in Eq. (5) and returns the class of the unknown structure.

For our numerical experiments, the persistence diagrams are constructed using the C++ Ripser software, and the scikit-learn decision tree implementation. The studies [31, 35] estimate that approximately 65% of the data is missing. However, an estimate of the experimental noise is not provided. In fact, as noted by [23, 30], the noise varies between experiments and specimens. Our synthetic data replicates this resolution by drawing from a Gaussian [16, 29, 32], $\mathcal{N}(0,\tau^{2})$ , with four different levels of variance to give a more representative approximation of true APT datasets. Computing the $d_{p}^{c}$ distances with $p=2$ to imitate typical Euclidean distance, we find different values of $c$ via a grid search for these four different levels of variance, $\tau^{2},$ in both 0- and 1-dim homology, employing a different dataset than is used for the classification. In each case, a geometric sequence of 10 values between $0.01$ and $1$ is taken into account. The results and the associated algorithmic accuracy are presented in Table 1.

As a comparison the feature matrix in Eq. (6) is also calculated using the Wasserstein distance, choosing $p=2$ . Moreover, we adopt a counting classifier which takes into account only the number of points in an atomic neighborhood as the input feature in the tree classifier. Our $d_{p}^{c}$ classifier successfully dichotomizes these 1,000 persistence diagrams generated by BCC and FCC lattice structures at better than 96% accuracy, where accuracy is measured as (1 - Misclassification rate). The $d_{p}^{c}$ classifier outperforms both the Wasserstein and the counting classifier, see Fig. 8. These results demonstrate that using just the differences in cardinality between the two classes of crystal structures is insufficient to distinguish between them.

As demonstrated in Proposition 11, there is a relationship between the number of connected components, $\mathbf{b_{0}}$ , (number of atoms in this case) and the number of 1-dim homological features, $\mathbf{b_{1}}$ , in the persistence diagrams Figs. 7(a) and 7(b) demonstrate this relationship, as well as the presence of heteroscedasticity between $\mathbf{b_{0}}$ and $\mathbf{b_{1}}$ , also verified by the Breusch-Pagan test [2] with a $p-$ value of $9.3\times 10^{-54}$ for FCC cells and a $p-$ value of $2.01\times 10^{-47}$ for BCC cells. Figs. 7(a) and 7(b) also provide 95% prediction intervals for $\mathbf{b_{1}}$ based on the weighted least squares regression analysis of Proposition 11. To that end, this exact fine balance between the number of atoms in a neighborhood and the associated topology created by the positions of these atoms in the cubic cell is captured by the $d_{p}^{c}$ distance.

5. Conclusions

This work combined statistical learning and topology to classify the crystal structure of high entropy alloys using atom probe tomography (APT) experiments. These APT experiments produce a noisy and sparse dataset, from which we extract atomic neighborhoods, i.e., atoms within a fixed volume forming a point cloud, and apply the machinery of Topological Data Analysis (TDA) to these point clouds. Viewed through the lens of TDA, these point clouds are a rich source of topological information. Indeed, employing persistent homology, we summarized the shape of these atomic neighborhoods and classified their crystal structures as either BCC or FCC. The classifier was based on features derived from the new distance on persistence diagrams, denoted herein by $d_{p}^{c}$ . This distance is different from all other existing distances on persistence diagrams in that it explicitly penalizes differences in cardinality between diagrams.

We proved a stability result for the $d_{p}^{c}$ distance, demonstrating that small perturbations of the underlying point clouds resulted in small changes to the $d_{p}^{c}$ distance. We also provided guidance for the choice of the $c$ parameter by looking at confidence bounds using a function of the cardinalities of the persistence diagrams.

The classification results presented herein could aid materials science researchers by providing a previously unavailable representation of the local atomic environment of high entropy alloys from APT data. The methodology need not be limited to a binary choice between BCC and FCC, e.g., entropy-stabilized oxides [34] are amenable to APT characterizations and our process could be generalized to those materials as well. Moreover, as APT experiments produce datasets on the order of 10 million atoms, materials science research has moved into the realm of big data, and the necessary computational and modelling tools have yet to be developed for this regime according to [20]. The $d_{p}^{c}$ classifier, coupled with our ongoing research of quantifying local atomic distributions as in [36], aims to recover global atomic structure of high entropy alloys.

Acknowledgments

The authors would like to thank the anonymous associate editor and two anonymous reviewers for their insightful comments which substantially improved the manuscript. Moreover, the authors would like to thank Professor David J. Keffer (Department of Materials Science and Engineering at The University of Tennessee) for providing the codes which create the realistic APT datasets and for useful discussions, as well as Professor Kody J.H. Law (School of Mathematics at the University of Manchester) for insightful discussions.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. Persistence images: A stable vector representation of persistent homology. The Journal of Machine Learning Research 18 , 1 (2017), 218–252.
2[2] Breusch, T. S., and Pagan, A. R. A simple test for heteroscedasticity and random coefficient variation. Econometrica: Journal of the Econometric Society (1979), 1287–1294.
3[3] Bubenik, P. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research 16 , 1 (2015), 77–102.
4[4] Carlsson, G., Zomorodian, A., Collins, A., and Guibas, L. J. Persistence barcodes for shapes. International Journal of Shape Modeling 11 , 02 (2005), 149–187.
5[5] Carriere, M., Cuturi, M., and Oudot, S. Sliced wasserstein kernel for persistence diagrams. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), JMLR. org, pp. 664–673.
6[6] Chazal, F., Cohen-Steiner, D., and Mérigot, Q. Geometric inference for probability measures. Foundations of Computational Mathematics 11 , 6 (2011), 733–751.
7[7] Chazal, F., de Silva, V., and Oudot, S. Persistence stability for geometric complexes. Geometriae Dedicata 173 , 1 (Dec 2014), 193–214.
8[8] Chisholm, J. A., and Motherwell, S. A new algorithm for performing three-dimensional searches of the cambridge structural database. Journal of applied crystallography 37 , 2 (2004), 331–334.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Stable Cardinality Distance for Topological Classification

Abstract.

Key words and phrases:

W911NF-17-1-0313, and the NSF DMS-1821241.

1. Introduction

2. Persistent Homology Background

Definition 1**.**

Definition 2**.**

Definition 3**.**

Remark 1**.**

Definition 4**.**

Definition 5**.**

Remark 2**.**

3. Stability Properties for dpcd_{p}^{c}dpc​ distance

Theorem 1** (Stability Theorem).**

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Proof of Theorem 1.

Definition 8** ([33]).**

Lemma 9** ([17]).**

Proposition 10**.**

Proof.

Proposition 11**.**

Proof.

4. Classification of Materials Data

5. Conclusions

Acknowledgments

Definition 1.

Definition 2.

Definition 3.

Remark 1.

Definition 4.

Definition 5.

Remark 2.

3. Stability Properties for $d_{p}^{c}$ distance

Theorem 1 (Stability Theorem).

Lemma 6.

Lemma 7.

Definition 8 ([33]).

Lemma 9 ([17]).

Proposition 10.

Proposition 11.