Stable Topological Summaries for Analyzing the Organization of Cells in a Packed Tissue
N. Atienza, M. J. Jimenez, M. Soriano-Trigueros

TL;DR
This paper applies topological data analysis to segmented epithelial tissue images, introducing stable topological summaries that improve tissue classification and provide new biological insights.
Contribution
It develops a novel set of stable topological summaries for tissue analysis, combining normalization methods and demonstrating their effectiveness in classification and biological interpretation.
Findings
Topological summaries improve classification accuracy of tissue images.
Normalization ensures stability and invariance of topological features.
New biological indicators for tissue development are proposed.
Abstract
We use Topological Data Analysis tools for studying the inner organization of cells in segmented images of epithelial tissues. More specifically, for each segmented image, we compute different persistence barcodes, which codify lifetime of homology classes (persistent homology) along different filtrations (increasing nested sequences of simplicial complexes) that are built from the regions representing the cells in the tissue. We use a complete and well-grounded set of numerical variables over those persistence barcodes, also known as topological summaries. A novel combination of normalization methods for both, the set of input segmented images and the produced barcodes, allows to prove stability results for those variables with respect to small changes in the input, as well as invariance to image scale. Our study provides new insights to this problem, such as a possible novel indicator…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cEE | 140 | 206 | 229 | 241 | 385 | 380 | 261 | 187 | 405 | 246 | 327 | 204 | 348 | 270 | ||
| cNT | 666 | 661 | 566 | 574 | 669 | 532 | 420 | 592 | 744 | 527 | 594 | 473 | 704 | 748 | 469 | 834 |
| dNP | 513 | 723 | 588 | 525 | 439 | 823 | 1102 | 533 | 309 | 575 | 302 | 375 | ||||
| dWL | 432 | 556 | 485 | 525 | 501 | 936 | 890 | 790 | 977 | 913 | 606 | 835 | 785 | 748 | 622 | |
| dWP | 748 | 806 | 566 | 415 | 454 | 654 | 752 | 713 | 504 | 430 | 387 | 516 | 419 | 455 | 277 | 257 |
| 187 cells | |||||
| cEE vs cNT | ✓ | ||||
| cEE vs dNP | ✓ | ✓ | ✓ | ||
| cNT vs dNP | ✓ | ✓ | |||
| cEE vs dWL | ✓ | ✓ | ✓ | ✓ | |
| cNT vs dWL | ✓ | ✓ | ✓ | ||
| dNP vs dWL | ✓ | ||||
| cEE vs dWP | ✓ | ✓ | ✓ | ✓ | ✓ |
| cNT vs dWP | ✓ | ✓ | ✓ | ✓ | |
| dNP vs dWP | ✓ | ✓ | |||
| dWL vs dWP | ✓ |
| dWL vs dWP | |||
|---|---|---|---|
| N = 187 | 0.013 | 0.01 | 0.019 |
| N = 257 | 0.012 | 0.006 | 0.005 |
| 257 cells | ||||
| cNT vs | ✓ | ✓ | ✓ | ✓ |
| dWL vs | ✓ | ✓ | ||
| dWP vs | ✓ | ✓ |
| 187 cells | TDA | network | mixed | m & v |
|
||
| cEE | 85.5/86.7 | 86.2/87.8 | 86.1/87.8 | 89.7/94.3 | 97.1/98.7 | ||
| cNT | 82.6/83 | 93.4/93.7 | 92.2/92.9 | 89.2/89.7 | 96.2/97.7 | ||
| Drosophila | 99.7/99.8 | 100/100 | 100/100 | 100/100 | 100/100 | ||
| overall | 92.9/93.4 | 95.8/96.1 | 95.5/96 | 95.5/96.2 | 98.6/99 |
| 257 cells | Network | |
|---|---|---|
| dWL | 63.8/67.4 | 59.9/59.9 |
| dWP | 64.9/66.5 | 82.6/82.8 |
| global | 63.1/65.5 | 70/71.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Stable Topological Summaries for Analyzing the Organization of Cells in a Packed Tissue
Nieves Atienza, Maria-Jose Jimenez, Manuel Soriano-Trigueros
Abstract
We use Topological Data Analysis tools for studying the inner organization of cells in segmented images of epithelial tissues. More specifically, for each segmented image, we compute different persistence barcodes, which codify lifetime of homology classes (persistent homology) along different filtrations (increasing nested sequences of simplicial complexes) that are built from the regions representing the cells in the tissue. We use a complete and well-grounded set of numerical variables over those persistence barcodes, also known as topological summaries. A novel combination of normalization methods for both, the set of input segmented images and the produced barcodes, allows to prove stability results for those variables with respect to small changes in the input, as well as invariance to image scale. Our study provides new insights to this problem, such as a possible novel indicator for the development of the drosophila wing disc tissue or the importance of centroids distribution to differentiate some tissues from their CVT-path counterpart (a mathematical model of epithelia based on Voronoi diagrams). We also show how the use of topological summaries may improve the classification accuracy of epithelial images using Random Forests algorithm.
Department of Applied Mathematics I, University of Seville
{natienza, majiro, msoriano4}@us.es
1 Introduction
Epithelia are packed tissues formed by tightly assembled cells, with almost no intercellular spaces. There are many studies in the literature focused on their natural organization, since changes on it may indicate an early onset of disease [1, 2]. Therefore, being able to quantify differences in the spatial distribution of cells is an interesting problem.
In some of these tissues, their apical surfaces are similar to convex polygons forming a natural tessellation. That allows to identify each cell with a polygon with as many sides as neighboring cells. The study of epithelial organization has been mainly focused on the polygon distributions, that is, the distribution of the number of neighbors (sides) of the cells (polygons). In [3], the authors looked for differences in polygon distributions on two proliferative stages of drosophila wing disc. In later studies, the concept of centroidal Voronoi tessellation (CVT) was used, which is a Voronoi diagram where the point generating each region coincides with its centroid. The Lloyd algorithm is an iterative algorithm that, starting from a random cloud of points, produces a series of Voronoi diagrams, that we will denote by ( , that converge to a CVT [4]. Such a sequence of Voronoi diagrams is called CVT-path in [5]. There, the authors compared the polygon distributions of images of natural packed tissues with those of the CVT-path and showed that the former fit to certain distributions given inside the CVT-path. A different approach was developed in [6], were the authors provided an image analysis tool implemented in the open-access platform FIJI, to quantify epithelial organization based in computational geometry and graph theory concepts. More specifically, considering the contact graph, that is, the graph generated by the cells (vertices) and the cell-to-cell contacts (edges), they searched locally for specific motifs represented by small subgraphs (graphlets) to characterise the tissue.
The previous approaches have two main problems: assuming that the cells are similar to convex polygons and that the analysis is mostly related to local features ignoring other aspects of the contact graph, as for example, what types of polygons are connected among them.
As pointed out by P. Villoutreix in his thesis, the standard topological analysis of complex networks is very limited in this context, since all contact graphs are planar (see [7][Sec. 8.2.2]).
In [7], the author proposed the use of topological data analysis (TDA) as a possible solution to obtain richer information than just the polygon distribution while keeping the stability of the method respect to the inner organization of the tissue.
In this paper we prove these two postulates. In particular, we will formalize the intuitive notion of inner organization of the tissue using its contact graph and its spatial centroid distribution. In addition, we will show that TDA may work when cells are not convex-like. Finally, we add TDA variables to a machine learning workflow to improve the classification of images coming from cellular tissues. We also explain some interpretations of the TDA variables at the tissue level, giving new insights about the organization of the cell.
1.1 Previous topological data analysis approaches
Recall that topology is the branch of mathematics that deals with properties of space that keep invariant under continuous transformations. These properties may be extremely important when the space is a network.
Nowadays, TDA is spreading as a useful approach in very different scientific fields, playing an increasing role in biological and, in general, biomedical imaging. Its main analysis tool, persistent homology [8, 9], has been successfully applied in solving problems such as tumour segmentation [10], analysis of biological networks [11] or diagnostic of chronic obstructive pulmonary disease [12]. Persistent homology studies the evolution of homology classes and their life-times (persistence) in an increasing nested sequence of spaces (called a filtration). A filtration could be thought as a multiscale combinatorial model that represents topological (and somehow, geometric) information of the data. The result of persistent homology computation can be codified as a combinatorial invariant, called the persistence barcode, which plays as a topological signature of the filtration, and therefore of the original data set.
In [7], a model for analyzing the contact graph of the cell tissues using sub and sup* filtrations* (see Section 2.3.1). To our knowledge, this was the first experiment relating epithelial tissues with TDA. Its nature was exploratory and the results similar to the ones obtained using the mean and variance of the degree of the cells. Another problem of this analysis was that the polynomial variables used for analyzing the barcodes were not stable with respect to the bottleneck distance (see Section 2.5). Defining stable polynomial variables was one of the main reasons why tropical coordinates were defined in [13]. We have found the work in [7] extremely useful as a first approach and this paper can be seen as a continuation of it. Independently, a similar approach was presented in the conference paper [14]. In this case, persistent entropy was used instead of polynomial variables. Again, stability was not guaranteed, since the number of bars of each barcode was not fixed, as required in [15] for a stable result. Finally, another approach was introduced in the conference paper [16]. In that case, instead of using the contact graph of the cells, spatial distribution of their centroids was studied, using the alpha* filtration* (or alpha-complex) which was constructed over the Delaunay complex generated by the set of centroids [17]. Nevertheless, keeping the infinity bar in the barcodes made the summaries depend on the original scale of the image, introducing bias in the analysis. One of the major motivations of this paper was to solve that problem. Needless to say that finding a setting for normalizing the barcodes (so that they can be compared), while guaranteeing the stability of the variables used to analyze them is far away from being trivial and require a mathematical analysis even for a specific case study like this one.
Recently, a paper applying TDA to cell images has been published [18]. Their approach uses cycles to analyze the clustering of epithelial cells as self-propelled particles (not forming packed tissues).
1.2 Overview of the paper
In Section 2, we will modify the methods appearing in [7] and [16] to guarantee the stability in all the procedure. Mathematical proofs to support correctness are provided. No assumption about convexity of the cells is needed. In Section 3, a rigorous statistical analysis of the results for tissues (both, epithelium and CVTs) is carried out using TDA variables. Also, these variables will be used to improve the performance of a Random Forests classification of the tissues based on their neighbors distribution. Interpretation of some of the variables at the tissue level are provided. Finally, we will summarize the results in Section 4 and propose new interesting questions arising from this paper that might be of interest for different fields, such as developmental biology, pattern recognition or TDA.
2 Materials and Methods
Our aim is to assign to each epithelial image an invariant, called persistence barcode, representing inner topological and geometrical information. We would like to analyze these persistence barcodes using numerical variables. There are three main difficulties:
- •
Finding a correct data normalization which does not include bias in our analysis due to the number of cells or the scale of the image;
- •
Guaranteeing our variables only measure topological-geometrical properties. In particular, they should be invariant to rotation or scale changes.
- •
Proving robustness of our variables with respect to the cell organization.
2.1 Input data
Our method is suitable for the topologial analysis of the organization of segmented regions that partition a portion of plane. In this paper, the images for our experiments come from several types of (real) epithelial tissues as well as different mathematical tessellations. The segmented images of epithelial tissues considered are available as suplementary material in the article [19]. More specifically, they are images of chicken tissues: chicken embryonic ectoderm (cEE) images, chicken neural tube (cNT) images; and Drosophila tissues: Drosophila notum prepupa (dNP) images, wing disc in the larva (dWL) and prepupal (dWP) stages of development. The tissues dWL and dWP are taken from two proliferative stages separated by only 24 hours of development, so they are really hard to distinguish. Further information about the way these images were obtained and segmented can be found in [19].
Besides, since the morphology of cells in some epithelial tissues is commonly approximated by Voronoi tessellations [20], we also consider the so-called CVT-path: take a random set of points on the plane and, iteratively, compute the Voronoi diagram from them and its set of centroids as seed points for next iteration (). Information of how the CVT-path is generated can be found in [5].
In order to obtain information from the regions like the centroids, the Matlab funtion regionprops was used. For the contact graph, a small dilatation was performed on each region and labels of adjacent regions reached by the dilation were retrieved. Cells are said to be valid if they do not touch the exterior limits of the image. Only valid cells will be processed. The data extraction procedure, together with the whole code, can be found in a publicly available repository.111github.com/Cimagroup/topo-summaries-for-packed-tissues
2.2 Normalization and cell selection
In order to avoid the bias induced by the number of valid cells in each tissue, we will consider always the same number of cells from each image to proceed to the topological analysis. Besides, this normalization is key to prove some stability results in Subsections 2.6 and 2.7. Although a higher number of cells is, in general, better to have a more global picture of the organization, the amount considered will be constrained by the minimum number of cells in the images of the given database. The number of valid cells in the whole set of images range from to , see Table 1. Then, as a starting point, we can fix as the number of cells picked. Unfortunately, a rare event happens in the first cEE image: some valid cells are completely surrounded by non-valid cells making them disconnected from the rest. We have decided to dismiss it as an outlier (representing a of the total sample). Then, we will fix , as it is the second minimun number of cells appearing in Table 1. We follow Algorithm 1 [16] to select the desired number of cells in each image. In Figure 1 an intuitive idea of how the algorithm works is given. From now on, we will denote as valid cells only to the ones selected by this algorithm.
2.3 Simplicial complexes and filtrations
A -simplex (or simplex of dimension ) in is the convex hull of a set of affinely independent points . The points of are called the vertices of and the subsets of form the faces of . That is, each -simplex contained in with is called a face of . A (geometric) simplicial complex is formed by a set of simplices satisfying:
Every face of a simplex in is also in . 2. 2.
The intersection of any two simplices in is either a face of both simplices or the empty set.
The dimension of a simplicial complex is the maximum of the dimensions of its simplices. The combinatorial description of as finite subsets of the whole set of vertices (without considering the geometric embedding in ) is known as an abstract simplicial complex. In the following, when we refer to a simplicial complex we mean an abstract simplicial complex.
A filtration over a simplicial complex is a finite nested sequence of simplicial subcomplexes
[TABLE]
It is commonly defined using a monotonic function i.e. for any two simplices , if is a face of , then . That way, if are the function values of all the simplices in , then the subcomplexes , for define a filtration over . We may call a filtration when we actually refer to the filtration induced by .
We will use three types of filtrations: the clique complex filtration sub, the clique complex filtration sup, and the Vietoris-Rips filtration rips.
2.3.1 The sub and sup filtrations
Since we know which valid cells are neighbors between them, we can build a graph representing this relation as edges. We denote it as a contact graph. We construct the clique complex of a graph, , adding a -simplex whenever the graph has a clique formed by the vertices . Define the sub, sup filtration [7] over the clique complex of a contact graph using the following functions,
[TABLE]
where is the number of valid neighbors of the cell (i.e. the degree of the vertex) and a face of the simplicial complex. We use the value in the sup filtration since it is rare to find a cell with such a number of neighbors, and in fact, there is not such a cell in our samples. Note that both sup and sub only carry information about the topology of the contact graph of the cells (it is a topological invariant). See Figure 2 for two examples.
2.3.2 The rips filtration
Another strategy is to obtain the centroid of each cell and study their distribution. In order to do that, we will use the Vietoris-Rips filtration [17]. It is constructed using the function
[TABLE]
Over the simplex generated from the whole set of centroids. See Figure 3 for an example.
It is important to emphasize that the immersion of the point clouds in the image to depends on the “distance per pixel relation” of the original image, and so does rips. Then, rips is not scale-invariant. We would like to eliminate this bias with a normalization process, but due to stability arguments we will apply it in the next step and not directly on the filtration.
2.4 Persistent homology and barcodes
Intuitively, homology formalize the notion of -dimensional hole. A [math]-dimensional hole is a connected component, a -dimensional hole is a tunnel (or a cycle in a graph), a -dimensional hole is a cavity, and so on. More specifically, homology gives a procedure to assign to a simplicial complex , a vector space as follows:
First, define the -chains of , , as the vector space over a field , with basis the set of -dimensional simplices of . In this paper, we will use . If is an -simplex, define the boundary operator in each -simplex as where are the faces of . Then, extend it linearly to obtain . Homology is the vector space
[TABLE]
where is the Kernel of and *im * is the Image. Each of the classes of can be seen as a hole of . The -betti number, , is interpreted as its amount of -dimensional holes.
Besides, if we have two simplicial complexes , homology induces a linear map between and . In this case, can be seen as the number of -dimensional holes shared by both simplicial complexes.
Persistent homology study how the -dimensional holes appear and disappear in a filtration . Fix a pair of numbers and . Following the previous reasoning, note that the value
[TABLE]
can be interpreted as the number of -dimensional holes which appears at and disappear at . Note that and are values out of the original filtration. In order to proceed with the calculation, we can set them as [math]. Since does not correspond to a simplicial complex, holes which disappear at may actually be considered not to disappear and persist up to infinity. This information is summarized in the -dimensional barcode (or -barcode), a multiset of intervals
[TABLE]
where each interval appears times (its multiplicity). Nevertheless, we want to use sets instead of multisets, so a barcode, , will be described as a set of intervals, each appearing repeated as many times as its multiplicity,
[TABLE]
We give an example in Figure 4. Further details on homology and persistent homology can be found in [17].
2.5 Bottleneck distance and stability
One of the main advantages of persistent homology with respect to classical homology, is that it is stable regarding small modifications of the input. In order to introduce this result, we need a notion of closeness for persistence barcodes.
Similarities between persistence barcodes can be measured by bottleneck distance. A -partial matching between barcodes is a colection of pairs such that
- •
For each , there is at most one , such that \big{(}[b_{1},d_{1}),[b_{2},d_{2})\big{)}\in M and vice versa.
- •
If \big{(}[b_{1},d_{1}),[b_{2},d_{2})\big{)}\in M then
- •
If (or ) is unpaired, then .
The bottleneck distance between two barcodes is defined as
[TABLE]
The following stability result can be found in [17]. Given two filtration , we have
[TABLE]
In our case, cell tissues with similar contact network or similar centroid distribution give similar barcodes.
2.6 Barcodes normalization
Barcodes are not good for statistical analysis as shown in [21]. We will use numerical variables calculated from the barcodes, but we need to deal with infinity bars before. Consider a barcode representing a sub or sup filtration. We would like somehow to keep infinity bars since they gives information about the barcode. Define the function which gives the same barcode but with infinity bars transformed in . In [15] it was shown that
[TABLE]
In our sample, no cell has or more neighbors, so fixing will map infinity bars to bars that are always longer than the others. Note that sub and sup barcodes are always in the same units (number of neighboors) and can be compared between them. In addition, they are invariant to rotations and scale of the input, by definition.
In the rips case, the last complex appearing in the filtration is always contractible, so there are not infinity bars with dimensions greater than [math], and only one infinity interval, , in the [math] dimensional persistent homology. Then, infinity bars does not give information in this context so we eliminate them using . Note that this is equivalent to calculate the reduced homology.
Recall that we mentioned in Section 2.3.2 that the centroids from which rips is defined, still carry the units from the image and so does the barcode associated to rips. Hence, it is not invariant to scale, since it depends on the distance matrix of the point cloud. The following normalization solves this problem: we are dividing each barcode by the sum of the lengths of the bars. More specifically, given with no infinity bars, define and . It is a direct consequence from [15, lemma 3.9] that
[TABLE]
Where is the maximum number of bars between and . Note that for any barcode coming from the 0-dimenional rips we have number of bars (one for each of the cells minus the infinity bar) and , where is the average length of the bars. Then, for all [math]-dimensional rips barcodes coming from our experiment
[TABLE]
where is the minimum of all averages . Note that cannot be arbitrarily small since there are physical constraint for the size of cells in the tissue. Unfortunately, the -dimensional case is not stable under this normalization since we cannot find a lower bound for . We drop the -dimensional rips barcodes from the experiment. As the following result shows, this normalization makes rips barcodes scale-invariant.
Proposition 2.1**.**
Fix a normed vector space and the induced distance, . Fix a scalar, , and consider two point clouds and . Let and be their barcodes coming from rips. Then, .
Proof.
Note that for any two points
[TABLE]
Then, if the induced filtration by is , the one from is . In particular, in the first case must be equal to in the second case. This means that barcodes are also proportional and , so
[TABLE]
∎
In particular, it works in our setting since with the Euclidean distance is a normed vector space. Then, from each image we have five barcodes, four coming from [math] and -dimensional sub and sup filtrations and coming from [math]-dimensional rips. From now on, when we mention a barcode coming from any of these filtrations, we assume the corresponding has been already applied.
2.7 Stable topological summaries
In the previous section, we saw that barcodes with the bottleneck distance are stable respect to modifications in the input. Then, variables defined on the barcodes, which are stable with respect to the bottleneck distance, will be also stable with respect to to the input.
In this section, we will describe the variables used in this paper and study their stability.
2.7.1 Persistent Entropy
Persistent entropy [22, 23] is a topological summary that can be seen as an adaptation of Shannon entropy (Shannon index in ecology) to the persistent homology context. Given a barcode with finite bars consider the length of the bars and their sum . Then, its persistent entropy is:
[TABLE]
When computed over an -dimensional barcode . The stability result appearing in [15] is simplified greatly in our case. In particular, We have that the [math]-barcodes coming from rips satisfy after normalization.
First, recall a result relative to the Shannon entropy, .
Proposition 2.2** ([24, p. 664]).**
Let and be two finite probability distributions (seen as vectors in ), and let and be, respectively, their Shannon entropy. If then
[TABLE]
We can transform the previous proposition in the following result for persistent entropy:
Proposition 2.3**.**
Let and be two barcodes with the same number of bars, , all of them starting at [math] and satisfying . If then
[TABLE]
Proof.
Note that since both barcodes have the same number of bars and all of them start at [math], the matching provided by the bottleneck distance is a one-to-one mapping between both sets of intervals. Then, we can order the barcodes in such a way that the bars matched by bottleneck distance are listed in the same position. Besides, since we have we can treat its barcode as a finite probability distribution . Name the probability distribution of . Note that . Then, substituting in the formula in Proposition 2.2 and using , the result follows. ∎
Then, this proposition gives a stability result for the rips filtration. In the sub and sup case the result is not straight forward, but we can still talk about stability (see Theorem 3.12 in [15]). Onward, We will refer to the persistent entropy of a -dimensional barcode as .
2.7.2 Tropical polynomials
Tropical coordinates allow to define stable polynomials over barcodes as explained in [13]. These polynomials are defined on the max-plus semiring with addition and multiplication being defined as:
[TABLE]
In particular, for the variables , polynomials of this semi-ring are written (with the usual notation) as
[TABLE]
where and . If we make analogous definition for barcodes, using the length of the bars, , as variables, polynomials of the form
[TABLE]
are shown to be stable with respect to the bottleneck distance.
Proposition 2.4**.**
[13]** Let be a polynomial as stated before, defined over barcodes . Then, there exists a constant such that
[TABLE]
2.7.3 Persistence landscapes
A Persistence landscape [25] is a sequence of summary functions obtained from a barcode. Given a barcode perform the change of coordinates
[TABLE]
The rescaled rank function, is defined as
[TABLE]
The persistence landscape is the set of function with given by
[TABLE]
Persistence landscape are related with tropical polynomials. In particular, they are an example of what is called tropical rational function (see [26, 13]). See Figure 5 for an illustration.
There is also a stability result available,
Proposition 2.5**.**
[25, 26]** Let A and B be two persistence barcodes and let and be their persistence landscape. Then, for all and ,
[TABLE]
Since we are interested in variables and not summary functions, we will use the -norm of . Note that for sub and sup the domain of the landscape is restricted to and for rips all the intervals will lie in . Then, in our case
[TABLE]
where or depending on the filtration.
We have seen how to obtain numerical summaries from the tissue images, each of them stable respect to the filtration induced by the cell organization. This means that, if their contact network or their centroid distribution were similar (up to scaling), the resulting variables will be similar. We have proved these summaries satisfy the desired conditions: they measure topological and geometrical properties (at least up to scaling or rotation), they are robust to modification in the organization of the cells (we mean respect to modification in their contact network or centroid distribution) and all barcodes have been normalized to avoid bias when comparing them.
3 Results
Our experiment is divided in two parts. First, we analyze the barcodes using statistical techniques. Then, we try to classify the images using Random Forests. In both cases, we use variables coming from the previous section. Before each experiment, we performed an exploratory analysis with many variables. We selected different polynomials and values for the landscapes. The notation is as follows: means the summary corresponding to the norm of the landscape computed from the -dimensional persistence barcode of the filtration filt with parameter . Instead of a fixed number, we may express with a percentage together with the letter . For example, means we have used the sub filtration and . For we have . means the sum where is the -th largest length in the -dimensional barcode of the filtration filt. Again, we may express or as percentages instead of fixed numbers. For example, means in the 0-dimensional sup barcode when , since . If only one element appear in the sum, we write directly for the -th length. Finally mean the persistent entropy of the -dimensional filt barcode. As mentioned in Section 2.2, we fix . After choosing different values for the parameters of the variables in each of the barcodes (the [math] and -dimensional barcodes of sub and sup, and the [math]-dimensional barcodes of the rips filtration), we obtain a total of summary variables per image. The code with the whole experiment can be found in a publicly available repository 222github.com/Cimagroup/topo-summaries-for-packed-tissues.
3.1 The statistical analysis
We will look for significant differences in the distributions followed by the TDA summaries in each of the tissues.
3.1.1 Eplithelial tissues
Note that our samples are relatively small: between and images per tissue. Then, we cannot assume the variables follow any parametric distribution. This means that our statistical analysis must be based on a non-parametric test. First, we will use the Kruskall-Wallis test to see if each variable follows the same distribution in all the tissues. If this is not true, we will try to find differences between pair of tissues using the Dunn test. Since we are using many variables, we fix the -value in .
The Kruskall-Wallis test found significant differences among the tissues for all the variables except for two. Due to a high redundancy among the variables, results of the Dunn Test of some selected variables are shown in Table 2. Note that cEE and cNT can be easily differentiated between them and from the rest using sub and sup, as expected. Nevertheless, no differences were found between cEE vs cNT and cNT vs dNP for rips, what means that we could not distinguish their centroid distributions. Differences between dNP and both wing tissues, dWL and dWP, were found but only for rips. In particular, dNP vs dWL can only be differentiated by , and .
Finally, we could find differences between dWL and dWP only for one variable, . A possible explanation can be the small amount of cells selected from each image of the sample. With the expectation that we could find more differences by increasing the number of cells, we design a more specific experiment to compare these two tissues taking the maximum number of available cells (, see Table 1) and performing a Mann-Whitney U test. In Table 3, the results for the test fixing and are displayed for three significant variables. As we expected, we find more significant differences with the increase in the number of cells. Note that a change in is just a change in a parameter of the variables, and not a change of the sample size (because the sample images remain the same).
We will analyze the meaning of some of the variables appearing in this section in Section 3.3.
3.1.2 Comparing the CVT-path with epithelia
Some of the epithelial tissues will be compared with their most similar tessellation in the CVT-path. Following [5], cNT follows a similar neighbor distribution to , dWL to and dWP to . Since we are interested only in making those pairwise comparisons, we are performing again the Mann-Whitney U test instead of the Kruskall-Wallis one. The minimum valid cells per image is (see Table 1). A selection of the results are displayed in Table 4. Many variables follow different distributions between the CVT tissue and its epithelium counterpart (between and depending on the type compared), most of them in the rips filtration. Differences in the sub and sup filtrations were only found between cNT and .
3.2 Classifying the images
We will classify the epithelial images in three classes: cEE, cNT and Drosophila tissues. Drosophila tissues are dWL, dWP and dNP. These tissues can be easily separated from and using the mean and variance of the degrees in the contact graph. Nevertheless, distinguishing between cEE and cNT is more difficult. Since we do not have a big sample of data, we will use the Random Forests technique to avoid over-fitting. Many variables used in network analysis have a strong relation with the mean degree in this specific context [7][Sec. 8.2.2]. Then, variables used for the network analysis are: the mean and variance of the degree and the amount of cells with degree equals to cells. We fix as in the first experiment of Section 3.1. We use of data as a training test and for validation. We fix the number of trees in since the accuracy is already stabilized for that number. This procedure is repeated times, the average accuracy of the classification is shown in Table 5. The best results is reached with only variables: the mean and variance of the degree and . The validation results are slightly better than the training ones, so we do not commit over-fitting. The selected variables outperform the others. This proves that TDA variable may be useful to complement other variables in machine learning tasks.
3.3 Interpretation of the variables
As we will see, information carried by sub and sup filtration are strongly related with the neighbor distribution, but not only, since details of the inner organization of the tissue might enrich it. Besides, rips is strongly related with the relative proximity of the centroids of the cells inside the same image and any interpretation of a variable must be in that terms. In the sequel, we will use the term -cell to refer to a cell with neighbors.
3.3.1 The variable
This landscape is detecting when there are at least -dimensional holes alive. For we obtained the most discriminating value is , see Figure 6. -dimensional holes in the sub filtration are formed when there are clusters of cells surrounded by other cells with a smaller amount of neighbors, see Figure 2. For example, cEE has a big variance with more cells with few neighbors ( or ) or many neighbors ( or ) than other tissues. In particular, cells with more neighbors have a greater chance to appear forming clusters than in other tissues, where is more common to find them isolated. Then, a smaller number of -dimensional holes is expected. There is another factor, cEE cells are far away of being convex allowing settings like isolated small cells embedded between two or three big cells. So again, -dimensional holes in cEE are less likely that in other tissues, and never reach the simultaneous -dimensional holes threshold.
On the other hand, Drosophila images have small variance with plenty of cells with neighbors. In particular, when all -cells are connected and many -dimensional holes appear, one for each cluster formed by cells with more than neighbors. See Figure 2.
Finally, cNT is halfway of both.
In general, there is a strong correlation between this variable and the variance of the degrees. For the rest of variables, complementary information will become more important than just the degree distributions.
3.3.2 The variable
In this case, landscape is detecting when there are at least connected components simultaneously alive. In our experiment, the best is . An interesting pattern arise, which improves the results with respect to just using the number of neighbors. Many cells with or neighbors are just cells on the boundary which will be connected soon with some neighbor. Nevertheless, cEE tissues, have non-boundary cells with or neighbors which are isolated and surrounded by cells with or neighbors. That will generate some longest connected components than in the other tissues.
In the other chicken tissue, cNT, cells with degree from to neighbors are more uniformly distributed in the image. Hence, a greater proportion of connected components arises with birth time or and death time for neighbors.
This effect is even clearer in Drosophila tissues: since there are fewer -cells, connected components with or cells on the boundary are alive until cells with neighbors appear. Many of them connect with cells on the boundary, but other -cells keep isolated or in small clusters creating connected components. These new connected components have a short life and usually die when -cells appear.
Then, cEE landscape tends to have a greater area with two close peaks, cNT landscape tends to have a medium area with only one peak and Drosophila landscape tends to have a smaller area and or separated peaks depending if there have enough -cells or not; see Figure 7.
Therefore, this variable is not only taking into account the distribution of the cells but how cells with different number of neighbors are connected among them and with regard to the boundary. As it is shown in the Random Forests classification, this variable performs better for classification than comparing directly the number of neighbors (see the accuracy of network variables vs mean + variance + in Table 5).
3.3.3 and
These variables become specially important when comparing dWL and dWP. Actually, in this case there is a huge correlation between both variables since they are measuring the same feature at the tissue level. 1-dimensional holes in the sup filtrations appear when there are cells (or cluster of cells) with a small number of neighbors which are surrounded by cells with a higher number of neighbors. In this case, the key difference between both tissues is provided by persistence bars which appear when there are 4-cells, some of whose neighbors are 6 or 7-cells. The presence of this combination of cells provides more bars with longer persistence in dWL than in dWP, see Figure 8. Then, (and the sum ) will be greater in dWL. For , the best result is reached when . In Table 6, a small experiment showing this variable is measuring different features than just the neighbors distribution is displayed.
3.3.4
Calculate the length of the -th longest bar in the Rips filtration is equivalent to the distance for which there are less than connected components. Since our barcodes are normalized, the distance have no units and can be interpreted as a proportion. Besides, in this data set the longest bars are associated to connected components corresponding to isolated centroids, or small clusters of centroids (recall the infinity bar was removed). Then, this variable is directly related with the (relative) distance between the centroids. In practice, it is measuring if there are at least centroids with a relative distance to the main connected component bigger than the others in the same image. The most discriminating for is . This type of variable becomes important when analyzing Drosophila tissues, since it is the only one finding differences between dNP and the others (dWL, dWP). An example is shown in Figure 9.
3.3.5
Note that since Shannon entropy is a concave function in the space of probability distribution [24], so is the persistent entropy in the space of normalized rips barcodes. Then, knowing the maximum will give us a valuable hint for understanding this variable at the tissue level.
We will consider the point cloud as a finite metric space, . Then, we define a distance graph on , , as a clique graph whose set of vertices is given by the centroids and the weight on each edge is the distance between the corresponding vertices.
Proposition 3.1**.**
Let be a finite metric space and its persistence barcode (with the infinity bar removed) coming from the rips filtration over . Then, is maximum if and only if the minimum spanning tree of its distance graph has a constant weight for all its edges.
Proof.
Note that since we only consider the [math]-dimensional rips barcode, all bars are born at [math]and their death value is the same than their length. Besides, note that Shannon entropy reaches its maximum value when all probabilities are the same [24]. Then, persistent entropy reaches its maximum value when all bars have the same length. Combining these two facts, we have that persistent entropy will be maximum if and only if all bars have the same death value. One direction is clear: If the minimum spanning tree has a constant weight, it means that all the vertices are isolated until the filtration value reaches that constant. Then, all the finite bars die at that value. For proving the other direction, assume the minimum spanning tree does not have a constant weight and define as its minimum weight. When the filtration value is , some of the vertices become connected between them but not all of them. This means that some bars die at , but some will die later. Then, persistent entropy cannot be maximum. ∎
Then, is strongly related with the variability of the weights in the minimum spanning tree of the distance graph. It makes sense since entropy may also be understood as a diversity index. In particular, it may have an interpretation in terms of how centroids of the cells are related between them. This variable becomes specially important when comparing some tissues with their CVT counterpart, since it was the only one (together with ) succeeding in differentiating all the cases, see Table 4. Then, it means that the CVT-path are failing to imitate the centroid distribution of the cells.
4 Discussion
We summarize here the results of this paper. Normalizing the number of cells obtained from each tissue, as well as the rips barcodes, allowed us to compare the network and centroid distributions of different cell tissues without loosing stability properties. This was proven by some theoretical results appearing in Sections 2.6 and 2.7. Note that these results may be generalized with respect to other tessellations of the plane. In Section 3, we compared some epithelia, obtaining some conclusions that might be of interest for the biological community:
- •
The geometry of the cells in cEE and cNT are completely different (cNT cells tend to be convex, while cEE do not). Nevertheless, their centroid distributions turned out to be very similar. Is there any biological or physical reason for this fact?
- •
Wing tissues in different state of development, like dWL and dWP, are difficult to differentiate. However, the variable and worked pretty well in this context. The main reason was a difference in the number of -cells surrounded by a mix of and -cells. Is this kind of cell combination a good indicator for the level of development? If so, why?
There are also interesting results for the pattern recognition community:
- •
We provided an example where TDA may be useful to study networks with a very simple topology, leading to the study of variables which would have been difficult to discover otherwise.
- •
In particular, we propose a combination of normalization in the original image and in the barcode which allows to prove formal stability of the method.
- •
This paper also provides an example where TDA variables may be combined with others to improve machine learning performance.
We have also seen that, despite the fact that CVT-path is good imitating the polygonal distribution of some of the tissues, they may fail to model their centroid distribution. A future work could be to use rips variables, specially persistent entropy, to improve the simulation of epithelium. Another interesting future work is to adapt this analysis to 3D epithelium, as well as to other fields, such as material science.
Acknowledgments: Authors would like to thanks researchers L.M. Escudero, P. Gómez-Gálvez and C. Molero-Ríos for their valuable help during the development of this research.
Funding: This research was funded by Ministerio de Ciencia e Innovación – Agencia Estatal de Investigación /10.13039/501100011033, grant number PID2019-107339GB-I00. The author M. Soriano-Trigueros was partially funded by the grant VI-PPITUS from University of Seville.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Emmanuele, A. Kubota, and M. H. et al., “Fhl 1 w 122s causes loss of protein function and late-onset mild myopathy,” Hum. Mol. Genet. , vol. 24, no. 3, pp. 714–726, 2014.
- 2[2] J.-A. Park, J. H. Kim, and J. F. et al., “Unjamming and cell shape in the asthmatic airway epithelium,” Nat. Mater. , vol. 14, no. 10, pp. 1040–1048, 2015.
- 3[3] D. Sánchez-Gutiérrez, A. Sáez, A. Pascual, and L. Escudero, “Topological progression in proliferating epithelia is driven by a unique variation in polygon distribution,” P Lo S ONE , vol. 8, no. 11, p. e 79227, 2013.
- 4[4] M. Emelianenko, L. Ju, and A. Rand, “Nondegeneracy and weak global convergence of the lloyd algorithm in ℝ d superscript ℝ 𝑑 \mathbb{R}^{d} ,” SIAM Journal on Numerical Analysis , vol. 46, no. 3, pp. 1423–1441, 2008.
- 5[5] D. Sanchez-Gutierrez, M. Tozluoglu, and L. M. E. et al., “Fundamental physical cellular constraints drive self-organization of tissues,” The EMBO J. , vol. 35, no. 1, pp. 77–88, 2016.
- 6[6] P. Vicente-Munuera, P. Gomez-Galvez, R. J. Tetley, C. Forja, A. Tagua, M. Letran, M. Tozluoglu, Y. Mao, and L. M. Escudero, “Epigraph: an open-source platform to quantify epithelial organization,” Bioinformatics , vol. 36, no. 4, pp. 1314–1316, 2019.
- 7[7] P. Villoutreix, Randomness and variability in animal embryogenesis, a multi-scale approach . Ph D thesis, Université Sorbonne Paris Cité, 2015.
- 8[8] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” Discret. & Comput. Geom. , vol. 28, no. 4, pp. 511–533, 2002.
