Stable Topological Summaries for Analyzing the Organization of Cells in   a Packed Tissue

N. Atienza; M. J. Jimenez; M. Soriano-Trigueros

arXiv:1902.06467·cs.CV·April 11, 2022

Stable Topological Summaries for Analyzing the Organization of Cells in a Packed Tissue

N. Atienza, M. J. Jimenez, M. Soriano-Trigueros

PDF

TL;DR

This paper applies topological data analysis to segmented epithelial tissue images, introducing stable topological summaries that improve tissue classification and provide new biological insights.

Contribution

It develops a novel set of stable topological summaries for tissue analysis, combining normalization methods and demonstrating their effectiveness in classification and biological interpretation.

Findings

01

Topological summaries improve classification accuracy of tissue images.

02

Normalization ensures stability and invariance of topological features.

03

New biological indicators for tissue development are proposed.

Abstract

We use Topological Data Analysis tools for studying the inner organization of cells in segmented images of epithelial tissues. More specifically, for each segmented image, we compute different persistence barcodes, which codify lifetime of homology classes (persistent homology) along different filtrations (increasing nested sequences of simplicial complexes) that are built from the regions representing the cells in the tissue. We use a complete and well-grounded set of numerical variables over those persistence barcodes, also known as topological summaries. A novel combination of normalization methods for both, the set of input segmented images and the produced barcodes, allows to prove stability results for those variables with respect to small changes in the input, as well as invariance to image scale. Our study provides new insights to this problem, such as a possible novel indicator…

Figures40

Click any figure to enlarge with its caption.

Tables6

Table 1. Table 1: The number of valid cells in each image of the epithelial tissues.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
cEE	140	206	229	241	385	380	261	187	405	246	327	204	348	270
cNT	666	661	566	574	669	532	420	592	744	527	594	473	704	748	469	834
dNP	513	723	588	525	439	823	1102	533	309	575	302	375
dWL	432	556	485	525	501	936	890	790	977	913	606	835	785	748	622
dWP	748	806	566	415	454	654	752	713	504	430	387	516	419	455	277	257

Table 2. Table 2: Differences between the tissues for 187 187 187 number of cells. A check mark implies the p 𝑝 p -value of that variable is smaller than 0.01 0.01 0.01 in the Dunn test.

187 cells	$\| λ_{1}^{sub} (0.05 N) \|$	$\| λ_{0}^{sup} (0.03 N) \|$	${Poly}_{1}^{sup} (2, 0.02 N)$	$ℓ_{0}^{rips} (0.05 N)$	$P E_{0}^{rips}$
cEE vs cNT	✓	$\times$	$\times$	$\times$	$\times$
cEE vs dNP	✓	$\times$	✓	$\times$	✓
cNT vs dNP	✓	$\times$	✓	$\times$	$\times$
cEE vs dWL	✓	$\times$	✓	✓	✓
cNT vs dWL	$\times$	$\times$	✓	✓	✓
dNP vs dWL	$\times$	$\times$	$\times$	✓	$\times$
cEE vs dWP	✓	✓	✓	✓	✓
cNT vs dWP	$\times$	✓	✓	✓	✓
dNP vs dWP	$\times$	$\times$	$\times$	✓	✓
dWL vs dWP	$\times$	✓	$\times$	$\times$	$\times$

Table 3. Table 3: The p 𝑝 p -values of some variables for the Mann-Whitney U test between dWL and dWP. The number of cells is set in 187 187 187 and 257 257 257 .

dWL vs dWP	$\| λ_{0}^{sup} (0.03 N) \|$	$\| λ_{1}^{sup} (0.02 N) \|$	${Poly}_{1}^{sup} (1, 0.02 N)$
N = 187	0.013	0.01	0.019
N = 257	0.012	0.006	0.005

Table 4. Table 4: Differences between some tissues and their CVT path counterpart. A check mark implies the p 𝑝 p -value of that variable is smaller than 0.01 0.01 0.01 in the Mann-Whitney U test.

257 cells	$\| λ_{0}^{sub} (0.15 N) \|$	$P E_{0}^{rips}$	${Poly}_{0}^{rips} (1, 0.05 N)$	$ℓ_{0}^{rips} (0.10 N)$
cNT vs $C V T_{1}$	✓	✓	✓	✓
dWL vs $C V T_{4}$	$\times$	✓	✓	$\times$
dWP vs $C V T_{5}$	$\times$	✓	$\times$	✓

Table 5. Table 5: Classification of tissues using all TDA variables, all network variables, all variables together, mean and variance of the degree and a combination of these two with | λ 0 sub ( 0.10 N ) | superscript subscript 𝜆 0 sub 0.10 𝑁 |\lambda_{0}^{\textbf{sub}}(0.10N)| . The accuracy for the training/validation sets are displayed. Best results are highlighted.

187 cells

TDA

network

mixed

m & v

m & v &

| λ_{0}^{sub} ​ (0.10 ​ N) |

cEE

85.5/86.7

86.2/87.8

86.1/87.8

89.7/94.3

97.1/98.7

cNT

82.6/83

93.4/93.7

92.2/92.9

89.2/89.7

96.2/97.7

Drosophila

99.7/99.8

100/100

overall

92.9/93.4

95.8/96.1

95.5/96

95.5/96.2

98.6/99

Table 6. Table 6: Classification using Random Forests of the epithelial tissues. On the left, results using the number of neigbors and mean and variance of the degree. It can be seen that | λ 1 sup ( 3 ) | superscript subscript 𝜆 1 sup 3 |\lambda_{1}^{\textbf{sup}}(3)| performs better

257 cells	Network	$\| λ_{1}^{sup} (3) \|$
dWL	63.8/67.4	59.9/59.9
dWP	64.9/66.5	82.6/82.8
global	63.1/65.5	70/71.8

Equations54

K_{1} \subset K_{2} \dots \subset K_{r} = K

K_{1} \subset K_{2} \dots \subset K_{r} = K

f_{s u b} (σ)

f_{s u b} (σ)

f_{s u p} (σ)

f_{r i p s} (σ) = p, q \in σ max d (p, q),

f_{r i p s} (σ) = p, q \in σ max d (p, q),

H_{m} (K) = \frac{ker \partial _{m}}{im \partial _{m + 1}} .

H_{m} (K) = \frac{ker \partial _{m}}{im \partial _{m + 1}} .

c_{a_{i}, a_{j}}^{m} = β_{a_{i}, a_{j - 1}}^{m} - β_{a_{i - 1}, a_{j - 1}}^{m} - β_{a_{i}, a_{j}}^{m} + β_{a_{i - 1}, a_{j}}^{m}

c_{a_{i}, a_{j}}^{m} = β_{a_{i}, a_{j - 1}}^{m} - β_{a_{i - 1}, a_{j - 1}}^{m} - β_{a_{i}, a_{j}}^{m} + β_{a_{i - 1}, a_{j}}^{m}

{([b_{1}, d_{1}), c_{b_{1}, d_{1}}^{m}), \dots, ([b_{l}, d_{l}), c_{b_{l}, d_{l}}^{m})},

{([b_{1}, d_{1}), c_{b_{1}, d_{1}}^{m}), \dots, ([b_{l}, d_{l}), c_{b_{l}, d_{l}}^{m})},

B = {[b_{i}, d_{i})}_{i = 1, \dots n}

B = {[b_{i}, d_{i})}_{i = 1, \dots n}

\displaystyle d_{b}(B_{1},B_{2})=\min\big{\{}\delta|

\displaystyle d_{b}(B_{1},B_{2})=\min\big{\{}\delta|

\displaystyle\text{ between }B_{1}\text{ and }B_{2}\big{\}}

d_{b} (B_{f}, B_{g}) \leq ∣ f - g ∣_{\infty}

d_{b} (B_{f}, B_{g}) \leq ∣ f - g ∣_{\infty}

d_{b} (ξ_{z} B_{1}, ξ_{z} B_{2}) \leq d_{b} (B_{1}, B_{2})

d_{b} (ξ_{z} B_{1}, ξ_{z} B_{2}) \leq d_{b} (B_{1}, B_{2})

d_{b} (ϕ (B_{1}), ϕ (B_{2})) \leq \frac{n _{ma x}}{max { L _{B_{1}} , L _{B_{2}} }} d_{b} (B_{1}, B_{2})

d_{b} (ϕ (B_{1}), ϕ (B_{2})) \leq \frac{n _{ma x}}{max { L _{B_{1}} , L _{B_{2}} }} d_{b} (B_{1}, B_{2})

d_{b} (ϕ (B_{1}), ϕ (B_{2})) \leq \frac{d _{b} ( B _{1} , B _{2} )}{ℓ ~}

d_{b} (ϕ (B_{1}), ϕ (B_{2})) \leq \frac{d _{b} ( B _{1} , B _{2} )}{ℓ ~}

d (α x, α y) = ∣∣ α x - α y ∣∣ = ∣∣ α (x - y) ∣∣ = ∣ α ∣∣∣ x - y ∣∣ = ∣ α ∣ d (x, y)

d (α x, α y) = ∣∣ α x - α y ∣∣ = ∣∣ α (x - y) ∣∣ = ∣ α ∣∣∣ x - y ∣∣ = ∣ α ∣ d (x, y)

ϕ (B_{2}) = \frac{B _{2}}{L ( B _{2} )} = \frac{∣ α ∣ B _{1}}{∣ α ∣ L ( B _{1} )} = ϕ (B_{1})

ϕ (B_{2}) = \frac{B _{2}}{L ( B _{2} )} = \frac{∣ α ∣ B _{1}}{∣ α ∣ L ( B _{1} )} = ϕ (B_{1})

P E (B) = i = 1 \sum n - \frac{ℓ _{i}}{L ( B )} lo g (\frac{ℓ _{i}}{L ( B )})

P E (B) = i = 1 \sum n - \frac{ℓ _{i}}{L ( B )} lo g (\frac{ℓ _{i}}{L ( B )})

|E_{S}(P)-E_{S}(Q)|\leq||P-Q||_{1}\big{(}\log({u})-\log(||P-Q||_{1})\big{)}

|E_{S}(P)-E_{S}(Q)|\leq||P-Q||_{1}\big{(}\log({u})-\log(||P-Q||_{1})\big{)}

∣ P E (A) - P E (B) ∣ \leq - n d_{b} (A, B) lo g (d_{b} (A, B)) .

∣ P E (A) - P E (B) ∣ \leq - n d_{b} (A, B) lo g (d_{b} (A, B)) .

a ⊞ b := max (a, b) a ⊙ b := a + b .

a ⊞ b := max (a, b) a ⊙ b := a + b .

max (a_{1} + a_{1}^{1} x_{1} + \dots + a_{q}^{1} x_{q}, \dots, a_{r} + a_{1}^{r} x_{1} + \dots + a_{q}^{r} x_{q}),

max (a_{1} + a_{1}^{1} x_{1} + \dots + a_{q}^{1} x_{q}, \dots, a_{r} + a_{1}^{r} x_{1} + \dots + a_{q}^{r} x_{q}),

max (a_{1} + a_{1}^{1} ℓ_{1} + \dots + a_{q}^{1} ℓ_{q}, \dots, a_{r} + a_{1}^{r} ℓ_{1} + \dots + a_{q}^{r} ℓ_{q})

max (a_{1} + a_{1}^{1} ℓ_{1} + \dots + a_{q}^{1} ℓ_{q}, \dots, a_{r} + a_{1}^{r} ℓ_{1} + \dots + a_{q}^{r} ℓ_{q})

∣ F (A) - F (B) ∣ \leq C d_{b} (A, B)

∣ F (A) - F (B) ∣ \leq C d_{b} (A, B)

l = \frac{d + b}{2} h = \frac{d - b}{2}

l = \frac{d + b}{2} h = \frac{d - b}{2}

\lambda(l,h)=\left\{\begin{array}[]{l l}\beta_{l-h,l+h}&\text{ if }h\geq 0\\ 0&\text{ otherwise }\end{array}\right.

\lambda(l,h)=\left\{\begin{array}[]{l l}\beta_{l-h,l+h}&\text{ if }h\geq 0\\ 0&\text{ otherwise }\end{array}\right.

λ_{k} (t) = sup (x \geq 0 ∣ β_{t - x, t + x} \geq k)

λ_{k} (t) = sup (x \geq 0 ∣ β_{t - x, t + x} \geq k)

∣ λ_{k}^{A} (t) - λ_{k}^{B} (t) ∣ \leq d_{b} (A, B)

∣ λ_{k}^{A} (t) - λ_{k}^{B} (t) ∣ \leq d_{b} (A, B)

∣∣ λ_{k}^{A} - λ_{k}^{B} ∣ ∣_{1} \leq C t max (∣ λ_{k}^{A} (t) - λ_{k}^{B} (t) ∣) \leq C d_{b} (A, B)

∣∣ λ_{k}^{A} - λ_{k}^{B} ∣ ∣_{1} \leq C t max (∣ λ_{k}^{A} (t) - λ_{k}^{B} (t) ∣) \leq C d_{b} (A, B)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Stable Topological Summaries for Analyzing the Organization of Cells in a Packed Tissue

Nieves Atienza, Maria-Jose Jimenez, Manuel Soriano-Trigueros

Abstract

We use Topological Data Analysis tools for studying the inner organization of cells in segmented images of epithelial tissues. More specifically, for each segmented image, we compute different persistence barcodes, which codify lifetime of homology classes (persistent homology) along different filtrations (increasing nested sequences of simplicial complexes) that are built from the regions representing the cells in the tissue. We use a complete and well-grounded set of numerical variables over those persistence barcodes, also known as topological summaries. A novel combination of normalization methods for both, the set of input segmented images and the produced barcodes, allows to prove stability results for those variables with respect to small changes in the input, as well as invariance to image scale. Our study provides new insights to this problem, such as a possible novel indicator for the development of the drosophila wing disc tissue or the importance of centroids distribution to differentiate some tissues from their CVT-path counterpart (a mathematical model of epithelia based on Voronoi diagrams). We also show how the use of topological summaries may improve the classification accuracy of epithelial images using Random Forests algorithm.

Department of Applied Mathematics I, University of Seville

{natienza, majiro, msoriano4}@us.es

1 Introduction

Epithelia are packed tissues formed by tightly assembled cells, with almost no intercellular spaces. There are many studies in the literature focused on their natural organization, since changes on it may indicate an early onset of disease [1, 2]. Therefore, being able to quantify differences in the spatial distribution of cells is an interesting problem.

In some of these tissues, their apical surfaces are similar to convex polygons forming a natural tessellation. That allows to identify each cell with a polygon with as many sides as neighboring cells. The study of epithelial organization has been mainly focused on the polygon distributions, that is, the distribution of the number of neighbors (sides) of the cells (polygons). In [3], the authors looked for differences in polygon distributions on two proliferative stages of drosophila wing disc. In later studies, the concept of centroidal Voronoi tessellation (CVT) was used, which is a Voronoi diagram where the point generating each region coincides with its centroid. The Lloyd algorithm is an iterative algorithm that, starting from a random cloud of points, produces a series of Voronoi diagrams, that we will denote by ( $CVT_{1},\,CVT_{2},$ $\,CVT_{3},\ldots)$ , that converge to a CVT [4]. Such a sequence of Voronoi diagrams is called CVT-path in [5]. There, the authors compared the polygon distributions of images of natural packed tissues with those of the CVT-path and showed that the former fit to certain distributions given inside the CVT-path. A different approach was developed in [6], were the authors provided an image analysis tool implemented in the open-access platform FIJI, to quantify epithelial organization based in computational geometry and graph theory concepts. More specifically, considering the contact graph, that is, the graph generated by the cells (vertices) and the cell-to-cell contacts (edges), they searched locally for specific motifs represented by small subgraphs (graphlets) to characterise the tissue.

The previous approaches have two main problems: assuming that the cells are similar to convex polygons and that the analysis is mostly related to local features ignoring other aspects of the contact graph, as for example, what types of polygons are connected among them.

As pointed out by P. Villoutreix in his thesis, the standard topological analysis of complex networks is very limited in this context, since all contact graphs are planar (see [7][Sec. 8.2.2]).

In [7], the author proposed the use of topological data analysis (TDA) as a possible solution to obtain richer information than just the polygon distribution while keeping the stability of the method respect to the inner organization of the tissue.

In this paper we prove these two postulates. In particular, we will formalize the intuitive notion of inner organization of the tissue using its contact graph and its spatial centroid distribution. In addition, we will show that TDA may work when cells are not convex-like. Finally, we add TDA variables to a machine learning workflow to improve the classification of images coming from cellular tissues. We also explain some interpretations of the TDA variables at the tissue level, giving new insights about the organization of the cell.

1.1 Previous topological data analysis approaches

Recall that topology is the branch of mathematics that deals with properties of space that keep invariant under continuous transformations. These properties may be extremely important when the space is a network.

Nowadays, TDA is spreading as a useful approach in very different scientific fields, playing an increasing role in biological and, in general, biomedical imaging. Its main analysis tool, persistent homology [8, 9], has been successfully applied in solving problems such as tumour segmentation [10], analysis of biological networks [11] or diagnostic of chronic obstructive pulmonary disease [12]. Persistent homology studies the evolution of homology classes and their life-times (persistence) in an increasing nested sequence of spaces (called a filtration). A filtration could be thought as a multiscale combinatorial model that represents topological (and somehow, geometric) information of the data. The result of persistent homology computation can be codified as a combinatorial invariant, called the persistence barcode, which plays as a topological signature of the filtration, and therefore of the original data set.

In [7], a model for analyzing the contact graph of the cell tissues using sub and sup* filtrations* (see Section 2.3.1). To our knowledge, this was the first experiment relating epithelial tissues with TDA. Its nature was exploratory and the results similar to the ones obtained using the mean and variance of the degree of the cells. Another problem of this analysis was that the polynomial variables used for analyzing the barcodes were not stable with respect to the bottleneck distance (see Section 2.5). Defining stable polynomial variables was one of the main reasons why tropical coordinates were defined in [13]. We have found the work in [7] extremely useful as a first approach and this paper can be seen as a continuation of it. Independently, a similar approach was presented in the conference paper [14]. In this case, persistent entropy was used instead of polynomial variables. Again, stability was not guaranteed, since the number of bars of each barcode was not fixed, as required in [15] for a stable result. Finally, another approach was introduced in the conference paper [16]. In that case, instead of using the contact graph of the cells, spatial distribution of their centroids was studied, using the alpha* filtration* (or alpha-complex) which was constructed over the Delaunay complex generated by the set of centroids [17]. Nevertheless, keeping the infinity bar in the barcodes made the summaries depend on the original scale of the image, introducing bias in the analysis. One of the major motivations of this paper was to solve that problem. Needless to say that finding a setting for normalizing the barcodes (so that they can be compared), while guaranteeing the stability of the variables used to analyze them is far away from being trivial and require a mathematical analysis even for a specific case study like this one.

Recently, a paper applying TDA to cell images has been published [18]. Their approach uses cycles to analyze the clustering of epithelial cells as self-propelled particles (not forming packed tissues).

1.2 Overview of the paper

In Section 2, we will modify the methods appearing in [7] and [16] to guarantee the stability in all the procedure. Mathematical proofs to support correctness are provided. No assumption about convexity of the cells is needed. In Section 3, a rigorous statistical analysis of the results for tissues (both, epithelium and CVTs) is carried out using TDA variables. Also, these variables will be used to improve the performance of a Random Forests classification of the tissues based on their neighbors distribution. Interpretation of some of the variables at the tissue level are provided. Finally, we will summarize the results in Section 4 and propose new interesting questions arising from this paper that might be of interest for different fields, such as developmental biology, pattern recognition or TDA.

2 Materials and Methods

Our aim is to assign to each epithelial image an invariant, called persistence barcode, representing inner topological and geometrical information. We would like to analyze these persistence barcodes using numerical variables. There are three main difficulties:

•

Finding a correct data normalization which does not include bias in our analysis due to the number of cells or the scale of the image;

•

Guaranteeing our variables only measure topological-geometrical properties. In particular, they should be invariant to rotation or scale changes.

•

Proving robustness of our variables with respect to the cell organization.

2.1 Input data

Our method is suitable for the topologial analysis of the organization of segmented regions that partition a portion of plane. In this paper, the images for our experiments come from several types of (real) epithelial tissues as well as different mathematical tessellations. The segmented images of epithelial tissues considered are available as suplementary material in the article [19]. More specifically, they are images of chicken tissues: chicken embryonic ectoderm (cEE) images, chicken neural tube (cNT) images; and Drosophila tissues: Drosophila notum prepupa (dNP) images, wing disc in the larva (dWL) and prepupal (dWP) stages of development. The tissues dWL and dWP are taken from two proliferative stages separated by only 24 hours of development, so they are really hard to distinguish. Further information about the way these images were obtained and segmented can be found in [19].

Besides, since the morphology of cells in some epithelial tissues is commonly approximated by Voronoi tessellations [20], we also consider the so-called CVT-path: take a random set of points on the plane and, iteratively, compute the Voronoi diagram $CVT_{i}$ from them and its set of centroids as seed points for next iteration ( $i+1$ ). Information of how the CVT-path is generated can be found in [5].

In order to obtain information from the regions like the centroids, the Matlab funtion regionprops was used. For the contact graph, a small dilatation was performed on each region and labels of adjacent regions reached by the dilation were retrieved. Cells are said to be valid if they do not touch the exterior limits of the image. Only valid cells will be processed. The data extraction procedure, together with the whole code, can be found in a publicly available repository.111github.com/Cimagroup/topo-summaries-for-packed-tissues

2.2 Normalization and cell selection

In order to avoid the bias induced by the number of valid cells in each tissue, we will consider always the same number of cells from each image to proceed to the topological analysis. Besides, this normalization is key to prove some stability results in Subsections 2.6 and 2.7. Although a higher number of cells is, in general, better to have a more global picture of the organization, the amount considered will be constrained by the minimum number of cells in the images of the given database. The number of valid cells in the whole set of images range from $140$ to $1102$ , see Table 1. Then, as a starting point, we can fix $N=140$ as the number of cells picked. Unfortunately, a rare event happens in the first cEE image: some valid cells are completely surrounded by non-valid cells making them disconnected from the rest. We have decided to dismiss it as an outlier (representing a $2.8\%$ of the total sample). Then, we will fix $N=187$ , as it is the second minimun number of cells appearing in Table 1. We follow Algorithm 1 [16] to select the desired number of cells in each image. In Figure 1 an intuitive idea of how the algorithm works is given. From now on, we will denote as valid cells only to the ones selected by this algorithm.

2.3 Simplicial complexes and filtrations

A $k$ -simplex (or simplex of dimension $k$ ) in $\mathbb{R}^{m}$ is the convex hull of a set of $k+1$ affinely independent points $\tau=\{p_{0},\ldots,p_{k}\}$ . The points of $\tau$ are called the vertices of $\tau$ and the subsets of $\tau$ form the faces of $\tau$ . That is, each $\ell$ -simplex contained in $\tau$ with $0\leq\ell<k$ is called a face of $\tau$ . A (geometric) simplicial complex $\mathcal{K}$ is formed by a set of simplices satisfying:

Every face of a simplex in $\mathcal{K}$ is also in $\mathcal{K}$ . 2. 2.

The intersection of any two simplices in $\mathcal{K}$ is either a face of both simplices or the empty set.

The dimension of a simplicial complex is the maximum of the dimensions of its simplices. The combinatorial description of $\mathcal{K}$ as finite subsets of the whole set of vertices $V$ (without considering the geometric embedding in $\mathbb{R}^{m}$ ) is known as an abstract simplicial complex. In the following, when we refer to a simplicial complex we mean an abstract simplicial complex.

A filtration over a simplicial complex $\mathcal{K}$ is a finite nested sequence of simplicial subcomplexes

[TABLE]

It is commonly defined using a monotonic function $f:\mathcal{K}\rightarrow\mathbb{R}$ i.e. for any two simplices $\delta,\tau\in\mathcal{K}$ , if $\sigma$ is a face of $\tau$ , then $f(\sigma)\leq f(\tau)$ . That way, if $a_{1}\leq\ldots\leq a_{r}$ are the function values of all the simplices in $\mathcal{K}$ , then the subcomplexes $\mathcal{K}_{i}=f^{-1}(-\infty,a_{i}]$ , for $i=1\ldots r$ define a filtration over $\mathcal{K}$ . We may call $f$ a filtration when we actually refer to the filtration induced by $f$ .

We will use three types of filtrations: the clique complex filtration sub, the clique complex filtration sup, and the Vietoris-Rips filtration rips.

2.3.1 The sub and sup filtrations

Since we know which valid cells are neighbors between them, we can build a graph representing this relation as edges. We denote it as a contact graph. We construct the clique complex of a graph, $CK$ , adding a $k$ -simplex $[x_{0},\ldots,x_{k}]$ whenever the graph has a clique formed by the vertices $x_{0},\ldots,x_{k}$ . Define the sub, sup filtration [7] over the clique complex of a contact graph using the following functions,

[TABLE]

where $VN(x)$ is the number of valid neighbors of the cell (i.e. the degree of the vertex) $x$ and $\sigma$ a face of the simplicial complex. We use the value $15$ in the sup filtration since it is rare to find a cell with such a number of neighbors, and in fact, there is not such a cell in our samples. Note that both sup and sub only carry information about the topology of the contact graph of the cells (it is a topological invariant). See Figure 2 for two examples.

2.3.2 The rips filtration

Another strategy is to obtain the centroid of each cell and study their distribution. In order to do that, we will use the Vietoris-Rips filtration [17]. It is constructed using the function

[TABLE]

Over the simplex generated from the whole set of centroids. See Figure 3 for an example.

It is important to emphasize that the immersion of the point clouds in the image to $\mathbb{R}^{2}$ depends on the “distance per pixel relation” of the original image, and so does rips. Then, rips is not scale-invariant. We would like to eliminate this bias with a normalization process, but due to stability arguments we will apply it in the next step and not directly on the filtration.

2.4 Persistent homology and barcodes

Intuitively, homology formalize the notion of $m$ -dimensional hole. A [math]-dimensional hole is a connected component, a $1$ -dimensional hole is a tunnel (or a cycle in a graph), a $2$ -dimensional hole is a cavity, and so on. More specifically, homology gives a procedure to assign to a simplicial complex $\mathcal{K}$ , a vector space $H_{m}(\mathcal{K})$ as follows:

First, define the $m$ -chains of $\mathcal{K}$ , $C_{m}(\mathcal{K})$ , as the vector space over a field $\mathbb{F}$ , with basis the set of $m$ -dimensional simplices of $\mathcal{K}$ . In this paper, we will use $\mathbb{F}=\mathbb{Z}_{2}$ . If $\tau$ is an $m$ -simplex, define the boundary operator in each $m$ -simplex as $\partial_{m}(\tau)=\sum_{\sigma\in F}\sigma$ where $F$ are the $n-1$ faces of $\tau$ . Then, extend it linearly to obtain $\partial_{m}:C_{m}(\mathcal{K})\rightarrow C_{m-1}(\mathcal{K})$ . Homology is the vector space

[TABLE]

where $\ker$ is the Kernel of $\partial$ and *im * is the Image. Each of the classes of $H_{m}(\mathcal{K})$ can be seen as a hole of $\mathcal{K}$ . The $m$ -betti number, $\beta_{m}=\dim H_{m}(\mathcal{K})$ , is interpreted as its amount of $m$ -dimensional holes.

Besides, if we have two simplicial complexes $\mathcal{K}_{a}\subset\mathcal{K}_{b}$ , homology induces a linear map $f_{a,b}$ between $H_{m}(\mathcal{K}_{a})$ and $H_{m}(\mathcal{K}_{b})$ . In this case, $\beta^{m}_{a,b}=\dim f_{a,b}H_{m}(\mathcal{K}_{a})$ can be seen as the number of $m$ -dimensional holes shared by both simplicial complexes.

Persistent homology study how the $m$ -dimensional holes appear and disappear in a filtration $\mathcal{K}_{a_{1}}\subset\mathcal{K}_{a_{2}}\ldots\subset\mathcal{K}_{a_{r}}=\mathcal{K}$ . Fix a pair of numbers $i=1,\ldots,r$ and $j=2,\ldots r+1$ . Following the previous reasoning, note that the value

[TABLE]

can be interpreted as the number of $d$ -dimensional holes which appears at $i$ and disappear at $j$ . Note that $\beta_{a_{i},a_{m+1}}$ and $\beta_{a_{0},a_{j}}$ are values out of the original filtration. In order to proceed with the calculation, we can set them as [math]. Since $a_{r+1}$ does not correspond to a simplicial complex, holes which disappear at $a_{r+1}$ may actually be considered not to disappear and persist up to infinity. This information is summarized in the $m$ -dimensional barcode (or $m$ -barcode), a multiset of intervals

[TABLE]

where each interval $[b_{i},d_{i})$ appears $c_{b_{i},d_{i}}^{m}$ times (its multiplicity). Nevertheless, we want to use sets instead of multisets, so a barcode, $B$ , will be described as a set of intervals, each appearing repeated as many times as its multiplicity,

[TABLE]

We give an example in Figure 4. Further details on homology and persistent homology can be found in [17].

2.5 Bottleneck distance and stability

One of the main advantages of persistent homology with respect to classical homology, is that it is stable regarding small modifications of the input. In order to introduce this result, we need a notion of closeness for persistence barcodes.

Similarities between persistence barcodes can be measured by bottleneck distance. A $\delta$ -partial matching between barcodes $B_{1},B_{2}$ is a colection of pairs $M\subset B_{1}\times B_{2}$ such that

•

For each $[b_{1},d_{1})\in B_{1}$ , there is at most one $[b_{2},d_{2})\in B_{2}$ , such that $\big{(}[b_{1},d_{1}),[b_{2},d_{2})\big{)}\in M$ and vice versa.

•

If $\big{(}[b_{1},d_{1}),[b_{2},d_{2})\big{)}\in M$ then $\max\{|b_{1}-d_{2}|,|b_{1}-d_{2}|\}\leq\delta$

•

If $[b,d)\in B_{1}$ (or $\beta\in B_{2}$ ) is unpaired, then $(d-b)/2\leq\delta$ .

The bottleneck distance between two barcodes is defined as

[TABLE]

The following stability result can be found in [17]. Given two filtration $f,g:\mathcal{K}\longrightarrow\mathbb{R}$ , we have

[TABLE]

In our case, cell tissues with similar contact network or similar centroid distribution give similar barcodes.

2.6 Barcodes normalization

Barcodes are not good for statistical analysis as shown in [21]. We will use numerical variables calculated from the barcodes, but we need to deal with infinity bars before. Consider a barcode $B$ representing a sub or sup filtration. We would like somehow to keep infinity bars since they gives information about the barcode. Define the function $\xi_{z}B$ which gives the same barcode but with infinity bars $[a,\infty)$ transformed in $[a,z)$ . In [15] it was shown that

[TABLE]

In our sample, no cell has $15$ or more neighbors, so fixing $z=15$ will map infinity bars to bars that are always longer than the others. Note that sub and sup barcodes are always in the same units (number of neighboors) and can be compared between them. In addition, they are invariant to rotations and scale of the input, by definition.

In the rips case, the last complex appearing in the filtration is always contractible, so there are not infinity bars with dimensions greater than [math], and only one infinity interval, $[0,\infty)$ , in the [math] dimensional persistent homology. Then, infinity bars does not give information in this context so we eliminate them using $\xi_{0}$ . Note that this is equivalent to calculate the reduced homology.

Recall that we mentioned in Section 2.3.2 that the centroids from which rips is defined, still carry the units from the image and so does the barcode associated to rips. Hence, it is not invariant to scale, since it depends on the distance matrix of the point cloud. The following normalization solves this problem: we are dividing each barcode by the sum of the lengths of the bars. More specifically, given $B=\{[b_{i},d_{i})\}$ with no infinity bars, define $L_{B}=\sum_{i}d_{i}-b_{i}$ and $\phi(B)=\{[b/L_{B},d/L_{B}):[b,d)\in B\}$ . It is a direct consequence from [15, lemma 3.9] that

[TABLE]

Where $n_{max}$ is the maximum number of bars between $B_{1}$ and $B_{2}$ . Note that for any barcode $B$ coming from the 0-dimenional rips we have $n-1$ number of bars (one for each of the $n$ cells minus the infinity bar) and $L_{B}=(n-1)\bar{\ell}_{B}$ , where $\bar{\ell}_{B}$ is the average length of the bars. Then, for all [math]-dimensional rips barcodes coming from our experiment

[TABLE]

where $\tilde{\ell}$ is the minimum of all averages $\bar{\ell}_{B}$ . Note that $\bar{\ell}_{B}$ cannot be arbitrarily small since there are physical constraint for the size of cells in the tissue. Unfortunately, the $1$ -dimensional case is not stable under this normalization since we cannot find a lower bound for $L$ . We drop the $1$ -dimensional rips barcodes from the experiment. As the following result shows, this normalization makes rips barcodes scale-invariant.

Proposition 2.1.

Fix a normed vector space and the induced distance, $d$ . Fix a scalar, $\alpha$ , and consider two point clouds $P_{1}$ and $P_{2}=\alpha P_{1}$ . Let $B_{1}$ and $B_{2}$ be their barcodes coming from rips. Then, $\phi(B_{1})=\phi(B_{2})$ .

Proof.

Note that for any two points

[TABLE]

Then, if the induced filtration by $P_{1}$ is $f$ , the one from $P_{2}$ is $|\alpha|f$ . In particular, $\beta_{a,b}^{m}$ in the first case must be equal to $\beta_{|\alpha|a,|\alpha|b}^{m}$ in the second case. This means that barcodes are also proportional $B_{2}=|\alpha|B_{1}$ and $L(B_{2})=|\alpha|L(B_{1})$ , so

[TABLE]

∎

In particular, it works in our setting since $\mathbb{R}^{2}$ with the Euclidean distance is a normed vector space. Then, from each image we have five barcodes, four $\xi_{15}B$ coming from [math] and $1$ -dimensional sub and sup filtrations and $\pi\xi_{0}B$ coming from [math]-dimensional rips. From now on, when we mention a barcode coming from any of these filtrations, we assume the corresponding $\xi$ has been already applied.

2.7 Stable topological summaries

In the previous section, we saw that barcodes with the bottleneck distance are stable respect to modifications in the input. Then, variables defined on the barcodes, which are stable with respect to the bottleneck distance, will be also stable with respect to to the input.

In this section, we will describe the variables used in this paper and study their stability.

2.7.1 Persistent Entropy

Persistent entropy [22, 23] is a topological summary that can be seen as an adaptation of Shannon entropy (Shannon index in ecology) to the persistent homology context. Given a barcode with finite bars $B=\{[b_{i},d_{i})\}_{i=1\ldots n}$ consider the length of the bars $\ell_{i}=d_{i}-b_{i}$ and their sum $L(B)=\ell_{1}+\ldots+\ell_{n}$ . Then, its persistent entropy is:

[TABLE]

When computed over an $m$ -dimensional barcode $B$ . The stability result appearing in [15] is simplified greatly in our case. In particular, We have that the [math]-barcodes coming from rips satisfy $L(B)=1$ after normalization.

First, recall a result relative to the Shannon entropy, $E_{S}$ .

Proposition 2.2 ([24, p. 664]).

Let $P$ and $Q$ be two finite probability distributions (seen as vectors in $\mathbb{R}^{u}$ ), and let $E_{S}(P)$ and $E_{S}(Q)$ be, respectively, their Shannon entropy. If $||P-Q||_{1}\leq\frac{1}{2}$ then

[TABLE]

We can transform the previous proposition in the following result for persistent entropy:

Proposition 2.3.

Let $A$ and $B$ be two barcodes with the same number of bars, $n$ , all of them starting at [math] and satisfying $L(A)=L(B)=1$ . If $d_{b}(A,B)\leq\frac{1}{2n}$ then

[TABLE]

Proof.

Note that since both barcodes have the same number of bars and all of them start at [math], the matching provided by the bottleneck distance is a one-to-one mapping between both sets of intervals. Then, we can order the barcodes in such a way that the bars matched by bottleneck distance are listed in the same position. Besides, since we have $L(A)=1$ we can treat its barcode $\{[0,d_{i})\}$ as a finite probability distribution $P=\{d_{i}\}$ . Name $Q$ the probability distribution of $B$ . Note that $||P-Q||_{1}\leq n\,d_{b}(A,B)$ . Then, substituting in the formula in Proposition 2.2 and using $\log(nd_{b}(A,B))=log(n)+log(d_{b}(A,B))$ , the result follows. ∎

Then, this proposition gives a stability result for the rips filtration. In the sub and sup case the result is not straight forward, but we can still talk about stability (see Theorem 3.12 in [15]). Onward, We will refer to the persistent entropy of a $d$ -dimensional barcode as $PE_{d}(B)$ .

2.7.2 Tropical polynomials

Tropical coordinates allow to define stable polynomials over barcodes as explained in [13]. These polynomials are defined on the max-plus semiring $(\mathbb{R}\cup\{-\infty\},\boxplus,\odot)$ with addition and multiplication being defined as:

[TABLE]

In particular, for the variables $x_{i}$ , polynomials of this semi-ring are written (with the usual notation) as

[TABLE]

where $a_{i}\in\mathbb{R}$ and $a_{i}^{j}\in\mathbb{N}_{0}$ . If we make analogous definition for barcodes, using the length of the bars, $\ell_{i}$ , as variables, polynomials of the form

[TABLE]

are shown to be stable with respect to the bottleneck distance.

Proposition 2.4.

[13]** Let $F$ be a polynomial as stated before, defined over barcodes $A,B$ . Then, there exists a constant $C$ such that

[TABLE]

2.7.3 Persistence landscapes

A Persistence landscape [25] is a sequence of summary functions obtained from a barcode. Given a barcode $B=\{[b_{i},d_{i})\}$ perform the change of coordinates

[TABLE]

The rescaled rank function, $\lambda:\mathbb{R}^{2}\rightarrow\mathbb{R}$ is defined as

[TABLE]

The persistence landscape is the set of function $\lambda_{k}:\mathbb{R}\rightarrow\mathbb{R}$ with $k\in\mathbb{N}$ given by

[TABLE]

Persistence landscape are related with tropical polynomials. In particular, they are an example of what is called tropical rational function (see [26, 13]). See Figure 5 for an illustration.

There is also a stability result available,

Proposition 2.5.

[25, 26]** Let A and B be two persistence barcodes and let $\lambda^{A}$ and $\lambda^{B}$ be their persistence landscape. Then, for all $k$ and $t$ ,

[TABLE]

Since we are interested in variables and not summary functions, we will use the $1$ -norm of $\lambda_{k}$ . Note that for sub and sup the domain of the landscape is restricted to $[0,15]$ and for rips all the intervals will lie in $[0,1]$ . Then, in our case

[TABLE]

where $C=15$ or $1$ depending on the filtration.

We have seen how to obtain numerical summaries from the tissue images, each of them stable respect to the filtration induced by the cell organization. This means that, if their contact network or their centroid distribution were similar (up to scaling), the resulting variables will be similar. We have proved these summaries satisfy the desired conditions: they measure topological and geometrical properties (at least up to scaling or rotation), they are robust to modification in the organization of the cells (we mean respect to modification in their contact network or centroid distribution) and all barcodes have been normalized to avoid bias when comparing them.

3 Results

Our experiment is divided in two parts. First, we analyze the barcodes using statistical techniques. Then, we try to classify the images using Random Forests. In both cases, we use variables coming from the previous section. Before each experiment, we performed an exploratory analysis with many variables. We selected different polynomials and $k$ values for the landscapes. The notation is as follows: $|\lambda_{d}^{\textbf{filt}}(k)|$ means the summary corresponding to the norm of the landscape computed from the $d$ -dimensional persistence barcode of the filtration filt with parameter $k$ . Instead of a fixed number, we may express $k$ with a percentage together with the letter $N$ . For example, $|\lambda_{1}^{\textbf{sub}}(0.05N)|$ means we have used the sub filtration and $k=\emph{floor}(0.0.05N)$ . For $N=187$ we have $k=9$ . $Poly_{d}^{\textbf{filt}}(r,k)$ means the sum $\ell_{r}+\ell_{r+1}+\ldots+\ell_{k}$ where $\ell_{r}$ is the $k$ -th largest length in the $d$ -dimensional barcode of the filtration filt. Again, we may express $r$ or $k$ as percentages instead of fixed numbers. For example, $Poly_{0}^{\textbf{sup}}(2,0.02N)$ means $\ell_{2}+\ell_{3}$ in the 0-dimensional sup barcode when $N=187$ , since $k=\emph{floor}(0.02\cdot 187)=3$ . If only one element appear in the sum, we write directly $\ell_{d}^{\textbf{rips}}(k)$ for the $k$ -th length. Finally $PE_{d}^{\textbf{filt}}$ mean the persistent entropy of the $d$ -dimensional filt barcode. As mentioned in Section 2.2, we fix $N=187$ . After choosing different values for the parameters of the variables in each of the $5$ barcodes (the [math] and $1$ -dimensional barcodes of sub and sup, and the [math]-dimensional barcodes of the rips filtration), we obtain a total of $57$ summary variables per image. The code with the whole experiment can be found in a publicly available repository 222github.com/Cimagroup/topo-summaries-for-packed-tissues.

3.1 The statistical analysis

We will look for significant differences in the distributions followed by the TDA summaries in each of the tissues.

3.1.1 Eplithelial tissues

Note that our samples are relatively small: between $12$ and $16$ images per tissue. Then, we cannot assume the variables follow any parametric distribution. This means that our statistical analysis must be based on a non-parametric test. First, we will use the Kruskall-Wallis test to see if each variable follows the same distribution in all the tissues. If this is not true, we will try to find differences between pair of tissues using the Dunn test. Since we are using many variables, we fix the $p$ -value in $0.01$ .

The Kruskall-Wallis test found significant differences among the tissues for all the variables except for two. Due to a high redundancy among the variables, results of the Dunn Test of some selected variables are shown in Table 2. Note that cEE and cNT can be easily differentiated between them and from the rest using sub and sup, as expected. Nevertheless, no differences were found between cEE vs cNT and cNT vs dNP for rips, what means that we could not distinguish their centroid distributions. Differences between dNP and both wing tissues, dWL and dWP, were found but only for rips. In particular, dNP vs dWL can only be differentiated by $\ell^{\textbf{rips}}_{0}(0.05N)$ , $\ell^{\textbf{rips}}_{0}(0.10N)$ and $\ell^{\textbf{rips}}_{0}(0.15N)$ .

Finally, we could find differences between dWL and dWP only for one variable, $|\lambda_{0}^{\textbf{sup}}(0.03N)|$ . A possible explanation can be the small amount of cells selected from each image of the sample. With the expectation that we could find more differences by increasing the number of cells, we design a more specific experiment to compare these two tissues taking the maximum number of available cells ( $N=257$ , see Table 1) and performing a Mann-Whitney U test. In Table 3, the results for the test fixing $N=187$ and $N=257$ are displayed for three significant variables. As we expected, we find more significant differences with the increase in the number of cells. Note that a change in $N$ is just a change in a parameter of the variables, and not a change of the sample size (because the sample images remain the same).

We will analyze the meaning of some of the variables appearing in this section in Section 3.3.

3.1.2 Comparing the CVT-path with epithelia

Some of the epithelial tissues will be compared with their most similar tessellation in the CVT-path. Following [5], cNT follows a similar neighbor distribution to $CVT_{1}$ , dWL to $CVT_{4}$ and dWP to $CVT_{5}$ . Since we are interested only in making those pairwise comparisons, we are performing again the Mann-Whitney U test instead of the Kruskall-Wallis one. The minimum valid cells per image is $257$ (see Table 1). A selection of the results are displayed in Table 4. Many variables follow different distributions between the CVT tissue and its epithelium counterpart (between $8$ and $17$ depending on the type compared), most of them in the rips filtration. Differences in the sub and sup filtrations were only found between cNT and $CVT_{1}$ .

3.2 Classifying the images

We will classify the epithelial images in three classes: cEE, cNT and Drosophila tissues. Drosophila tissues are dWL, dWP and dNP. These tissues can be easily separated from $cEE$ and $cNT$ using the mean and variance of the degrees in the contact graph. Nevertheless, distinguishing between cEE and cNT is more difficult. Since we do not have a big sample of data, we will use the Random Forests technique to avoid over-fitting. Many variables used in network analysis have a strong relation with the mean degree in this specific context [7][Sec. 8.2.2]. Then, variables used for the network analysis are: the mean and variance of the degree and the amount of cells with degree equals to $2,\,3,\,4,\ldots 13$ cells. We fix $N=187$ as in the first experiment of Section 3.1. We use $3/4$ of data as a training test and $1/4$ for validation. We fix the number of trees in $200$ since the accuracy is already stabilized for that number. This procedure is repeated $10^{4}$ times, the average accuracy of the classification is shown in Table 5. The best results is reached with only $3$ variables: the mean and variance of the degree and $|\lambda_{0}^{\textbf{sub}}(0.10N)|$ . The validation results are slightly better than the training ones, so we do not commit over-fitting. The selected variables outperform the others. This proves that TDA variable may be useful to complement other variables in machine learning tasks.

3.3 Interpretation of the variables

As we will see, information carried by sub and sup filtration are strongly related with the neighbor distribution, but not only, since details of the inner organization of the tissue might enrich it. Besides, rips is strongly related with the relative proximity of the centroids of the cells inside the same image and any interpretation of a variable must be in that terms. In the sequel, we will use the term $n$ -cell to refer to a cell with $n$ neighbors.

3.3.1 The variable $|\lambda_{1}^{\textbf{sub}}(k)|$

This landscape is detecting when there are at least $m$ $1$ -dimensional holes alive. For $N=187$ we obtained the most discriminating $m$ value is $\emph{floor}(0.0.05N)=9$ , see Figure 6. $1$ -dimensional holes in the sub filtration are formed when there are clusters of cells surrounded by other cells with a smaller amount of neighbors, see Figure 2. For example, cEE has a big variance with more cells with few neighbors ( $2$ or $3$ ) or many neighbors ( $8$ or $9$ ) than other tissues. In particular, cells with more neighbors have a greater chance to appear forming clusters than in other tissues, where is more common to find them isolated. Then, a smaller number of $1$ -dimensional holes is expected. There is another factor, cEE cells are far away of being convex allowing settings like isolated small cells embedded between two or three big cells. So again, $1$ -dimensional holes in cEE are less likely that in other tissues, and never reach the $9$ simultaneous $1$ -dimensional holes threshold.

On the other hand, Drosophila images have small variance with plenty of cells with $6$ neighbors. In particular, when $i=6$ all $6$ -cells are connected and many $1$ -dimensional holes appear, one for each cluster formed by cells with more than $6$ neighbors. See Figure 2.

Finally, cNT is halfway of both.

In general, there is a strong correlation between this variable and the variance of the degrees. For the rest of variables, complementary information will become more important than just the degree distributions.

3.3.2 The variable $|\lambda_{0}^{\textbf{sub}}(k)|$

In this case, landscape is detecting when there are at least $m$ connected components simultaneously alive. In our experiment, the best $m$ is $\emph{floor}(0.0.10N)=18$ . An interesting pattern arise, which improves the results with respect to just using the number of neighbors. Many cells with $2$ or $3$ neighbors are just cells on the boundary which will be connected soon with some neighbor. Nevertheless, cEE tissues, have non-boundary cells with $2$ or $3$ neighbors which are isolated and surrounded by cells with $6$ or $7$ neighbors. That will generate some longest connected components than in the other tissues.

In the other chicken tissue, cNT, cells with degree from $4$ to $8$ neighbors are more uniformly distributed in the image. Hence, a greater proportion of connected components arises with birth time $4$ or $5$ and death time for $6$ neighbors.

This effect is even clearer in Drosophila tissues: since there are fewer $4$ -cells, connected components with $2$ or $3$ cells on the boundary are alive until cells with $5$ neighbors appear. Many of them connect with cells on the boundary, but other $5$ -cells keep isolated or in small clusters creating connected components. These new connected components have a short life and usually die when $6$ -cells appear.

Then, cEE landscape tends to have a greater area with two close peaks, cNT landscape tends to have a medium area with only one peak and Drosophila landscape tends to have a smaller area and $1$ or $2$ separated peaks depending if there have enough $2,3$ -cells or not; see Figure 7.

Therefore, this variable is not only taking into account the distribution of the cells but how cells with different number of neighbors are connected among them and with regard to the boundary. As it is shown in the Random Forests classification, this variable performs better for classification than comparing directly the number of neighbors (see the accuracy of network variables vs mean + variance + $|\lambda_{0}^{\textbf{sub}}(0.10N)|$ in Table 5).

3.3.3 $|\lambda_{1}^{\textbf{sup}}(k)|$ and $\emph{Poly}^{\textbf{sup}}_{1}(1,k)$

These variables become specially important when comparing dWL and dWP. Actually, in this case there is a huge correlation between both variables since they are measuring the same feature at the tissue level. 1-dimensional holes in the sup filtrations appear when there are cells (or cluster of cells) with a small number of neighbors which are surrounded by cells with a higher number of neighbors. In this case, the key difference between both tissues is provided by persistence bars which appear when there are 4-cells, some of whose neighbors are 6 or 7-cells. The presence of this combination of cells provides more bars with longer persistence in dWL than in dWP, see Figure 8. Then, $|\lambda_{1}^{\textbf{sup}}(k)|$ (and the sum $\emph{Poly}^{\textbf{sup}}_{1}(1,k)$ ) will be greater in dWL. For $N=257$ , the best result is reached when $k=\emph{floor}(0.0.02N)=3$ . In Table 6, a small experiment showing this variable is measuring different features than just the neighbors distribution is displayed.

3.3.4 $\ell_{0}^{\textbf{rips}}(k)$

Calculate the length of the $m$ -th longest bar in the Rips filtration is equivalent to the distance for which there are less than $m$ connected components. Since our barcodes are normalized, the distance have no units and can be interpreted as a proportion. Besides, in this data set the longest bars are associated to connected components corresponding to isolated centroids, or small clusters of centroids (recall the infinity bar was removed). Then, this variable is directly related with the (relative) distance between the centroids. In practice, it is measuring if there are at least $m$ centroids with a relative distance to the main connected component bigger than the others in the same image. The most discriminating $m$ for $N=187$ is $\emph{floor}(0.0.10N)=18$ . This type of variable becomes important when analyzing Drosophila tissues, since it is the only one finding differences between dNP and the others (dWL, dWP). An example is shown in Figure 9.

3.3.5 $PE_{0}^{\textbf{rips}}$

Note that since Shannon entropy is a concave function in the space of probability distribution [24], so is the persistent entropy in the space of normalized rips barcodes. Then, knowing the maximum will give us a valuable hint for understanding this variable at the tissue level.

We will consider the point cloud as a finite metric space, $M$ . Then, we define a distance graph on $M$ , $G_{M}$ , as a clique graph whose set of vertices is given by the centroids and the weight on each edge is the distance between the corresponding vertices.

Proposition 3.1.

Let $M$ be a finite metric space and $B_{M}$ its persistence barcode (with the infinity bar removed) coming from the rips filtration over $G_{M}$ . Then, $PE_{0}(B_{M})$ is maximum if and only if the minimum spanning tree of its distance graph has a constant weight for all its edges.

Proof.

Note that since we only consider the [math]-dimensional rips barcode, all bars are born at [math]and their death value is the same than their length. Besides, note that Shannon entropy reaches its maximum value when all probabilities are the same [24]. Then, persistent entropy reaches its maximum value when all bars have the same length. Combining these two facts, we have that persistent entropy will be maximum if and only if all bars have the same death value. One direction is clear: If the minimum spanning tree has a constant weight, it means that all the vertices are isolated until the filtration value reaches that constant. Then, all the finite bars die at that value. For proving the other direction, assume the minimum spanning tree does not have a constant weight and define $w$ as its minimum weight. When the filtration value is $w$ , some of the vertices become connected between them but not all of them. This means that some bars die at $w$ , but some will die later. Then, persistent entropy cannot be maximum. ∎

Then, $PE_{0}^{\bf rips}$ is strongly related with the variability of the weights in the minimum spanning tree of the distance graph. It makes sense since entropy may also be understood as a diversity index. In particular, it may have an interpretation in terms of how centroids of the cells are related between them. This variable becomes specially important when comparing some tissues with their CVT counterpart, since it was the only one (together with $\emph{Poly}_{0}^{\bf rips}(1,0.05N)$ ) succeeding in differentiating all the cases, see Table 4. Then, it means that the CVT-path are failing to imitate the centroid distribution of the cells.

4 Discussion

We summarize here the results of this paper. Normalizing the number of cells obtained from each tissue, as well as the rips barcodes, allowed us to compare the network and centroid distributions of different cell tissues without loosing stability properties. This was proven by some theoretical results appearing in Sections 2.6 and 2.7. Note that these results may be generalized with respect to other tessellations of the plane. In Section 3, we compared some epithelia, obtaining some conclusions that might be of interest for the biological community:

•

The geometry of the cells in cEE and cNT are completely different (cNT cells tend to be convex, while cEE do not). Nevertheless, their centroid distributions turned out to be very similar. Is there any biological or physical reason for this fact?

•

Wing tissues in different state of development, like dWL and dWP, are difficult to differentiate. However, the variable $\lambda_{1}^{\textbf{sup}}(k)$ and $\emph{Poly}_{1}^{\textbf{sup}}(1,k)$ worked pretty well in this context. The main reason was a difference in the number of $4$ -cells surrounded by a mix of $6$ and $7$ -cells. Is this kind of cell combination a good indicator for the level of development? If so, why?

There are also interesting results for the pattern recognition community:

•

We provided an example where TDA may be useful to study networks with a very simple topology, leading to the study of variables which would have been difficult to discover otherwise.

•

In particular, we propose a combination of normalization in the original image and in the barcode which allows to prove formal stability of the method.

•

This paper also provides an example where TDA variables may be combined with others to improve machine learning performance.

We have also seen that, despite the fact that CVT-path is good imitating the polygonal distribution of some of the tissues, they may fail to model their centroid distribution. A future work could be to use rips variables, specially persistent entropy, to improve the simulation of epithelium. Another interesting future work is to adapt this analysis to 3D epithelium, as well as to other fields, such as material science.

Acknowledgments: Authors would like to thanks researchers L.M. Escudero, P. Gómez-Gálvez and C. Molero-Ríos for their valuable help during the development of this research.

Funding: This research was funded by Ministerio de Ciencia e Innovación – Agencia Estatal de Investigación /10.13039/501100011033, grant number PID2019-107339GB-I00. The author M. Soriano-Trigueros was partially funded by the grant VI-PPITUS from University of Seville.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. Emmanuele, A. Kubota, and M. H. et al., “Fhl 1 w 122s causes loss of protein function and late-onset mild myopathy,” Hum. Mol. Genet. , vol. 24, no. 3, pp. 714–726, 2014.
2[2] J.-A. Park, J. H. Kim, and J. F. et al., “Unjamming and cell shape in the asthmatic airway epithelium,” Nat. Mater. , vol. 14, no. 10, pp. 1040–1048, 2015.
3[3] D. Sánchez-Gutiérrez, A. Sáez, A. Pascual, and L. Escudero, “Topological progression in proliferating epithelia is driven by a unique variation in polygon distribution,” P Lo S ONE , vol. 8, no. 11, p. e 79227, 2013.
4[4] M. Emelianenko, L. Ju, and A. Rand, “Nondegeneracy and weak global convergence of the lloyd algorithm in ℝ d superscript ℝ 𝑑 \mathbb{R}^{d} ,” SIAM Journal on Numerical Analysis , vol. 46, no. 3, pp. 1423–1441, 2008.
5[5] D. Sanchez-Gutierrez, M. Tozluoglu, and L. M. E. et al., “Fundamental physical cellular constraints drive self-organization of tissues,” The EMBO J. , vol. 35, no. 1, pp. 77–88, 2016.
6[6] P. Vicente-Munuera, P. Gomez-Galvez, R. J. Tetley, C. Forja, A. Tagua, M. Letran, M. Tozluoglu, Y. Mao, and L. M. Escudero, “Epigraph: an open-source platform to quantify epithelial organization,” Bioinformatics , vol. 36, no. 4, pp. 1314–1316, 2019.
7[7] P. Villoutreix, Randomness and variability in animal embryogenesis, a multi-scale approach . Ph D thesis, Université Sorbonne Paris Cité, 2015.
8[8] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” Discret. & Comput. Geom. , vol. 28, no. 4, pp. 511–533, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Stable Topological Summaries for Analyzing the Organization of Cells in a Packed Tissue

Abstract

1 Introduction

1.1 Previous topological data analysis approaches

1.2 Overview of the paper

2 Materials and Methods

2.1 Input data

2.2 Normalization and cell selection

2.3 Simplicial complexes and filtrations

2.3.1 The sub and sup filtrations

2.3.2 The rips filtration

2.4 Persistent homology and barcodes

2.5 Bottleneck distance and stability

2.6 Barcodes normalization

Proposition 2.1**.**

Proof.

2.7 Stable topological summaries

2.7.1 Persistent Entropy

Proposition 2.2** ([24, p. 664]).**

Proposition 2.3**.**

Proof.

2.7.2 Tropical polynomials

Proposition 2.4**.**

2.7.3 Persistence landscapes

Proposition 2.5**.**

3 Results

3.1 The statistical analysis

3.1.1 Eplithelial tissues

3.1.2 Comparing the CVT-path with epithelia

3.2 Classifying the images

3.3 Interpretation of the variables

3.3.1 The variable ∣λ1sub(k)∣|\lambda_{1}^{\textbf{sub}}(k)|∣λ1sub​(k)∣

3.3.2 The variable ∣λ0sub(k)∣|\lambda_{0}^{\textbf{sub}}(k)|∣λ0sub​(k)∣

3.3.3 ∣λ1sup(k)∣|\lambda_{1}^{\textbf{sup}}(k)|∣λ1sup​(k)∣ and Poly1sup(1,k)\emph{Poly}^{\textbf{sup}}_{1}(1,k)Poly1sup​(1,k)

3.3.4 ℓ0rips(k)\ell_{0}^{\textbf{rips}}(k)ℓ0rips​(k)

3.3.5 PE0ripsPE_{0}^{\textbf{rips}}PE0rips​

Proposition 3.1**.**

Proof.

4 Discussion

Proposition 2.1.

Proposition 2.2 ([24, p. 664]).

Proposition 2.3.

Proposition 2.4.

Proposition 2.5.

3.3.1 The variable $|\lambda_{1}^{\textbf{sub}}(k)|$

3.3.2 The variable $|\lambda_{0}^{\textbf{sub}}(k)|$

3.3.3 $|\lambda_{1}^{\textbf{sup}}(k)|$ and $\emph{Poly}^{\textbf{sup}}_{1}(1,k)$

3.3.4 $\ell_{0}^{\textbf{rips}}(k)$

3.3.5 $PE_{0}^{\textbf{rips}}$

Proposition 3.1.