Bayesian contiguity constrained clustering, spanning trees and dendrograms
Etienne C\^ome

TL;DR
This paper introduces a Bayesian approach to contiguity-constrained clustering using spanning trees, enabling exact posterior probability computation, MAP partitioning, and Bayesian dendrogram extraction, with real-world applications demonstrated.
Contribution
It formalizes contiguity-constrained clustering in a Bayesian framework and develops algorithms for MAP partitioning and dendrogram extraction based on spanning trees.
Findings
Exact posterior probabilities of partitions computed using spanning trees
Algorithm for finding MAP partitions in the Bayesian setting
Demonstrated effectiveness on real-world datasets
Abstract
Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is…
| 0.25 | 0.5 | 0.75 | 1 | 1.25 | 1.5 | 1.75 | 2 | 2.25 | |
|---|---|---|---|---|---|---|---|---|---|
| azpg | 0.92 | 0.83 | 0.79 | 0.74 | 0.70 | 0.64 | 0.63 | 0.59 | 0.55 |
| azpsa | 0.92 | 0.87 | 0.81 | 0.75 | 0.69 | 0.63 | 0.61 | 0.57 | 0.54 |
| gtclust | 1.00 | 1.00 | 0.98 | 0.96 | 0.91 | 0.84 | 0.74 | 0.62 | 0.54 |
| gtclust (9) | 1.00 | 1.00 | 0.98 | 0.96 | 0.91 | 0.85 | 0.77 | 0.69 | 0.63 |
| mclust | 0.90 | 0.40 | 0.33 | 0.24 | 0.20 | 0.17 | 0.10 | 0.04 | 0.01 |
| mclust (9) | 0.91 | 0.65 | 0.52 | 0.43 | 0.36 | 0.31 | 0.26 | 0.23 | 0.20 |
| redcap | 0.86 | 0.87 | 0.88 | 0.84 | 0.80 | 0.75 | 0.70 | 0.64 | 0.62 |
| skater | 0.90 | 0.90 | 0.90 | 0.83 | 0.80 | 0.74 | 0.69 | 0.66 | 0.65 |
| 0.25 | 0.5 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 2.00 | 2.25 | |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 100 | 96 | 94 | 74 | 58 | 20 | 0 | 0 | |
| average | 9 | 9 | 9 | 8.98 | 8.80 | 8.42 | 7.14 | 5.82 | 5.02 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Data Management and Algorithms
A tree is all you need
Etienne Côme*∗*
*∗*Cosys/Grettia, Gustave-Eiffel University, Noisy Champs, France
Abstract
Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is available in the R package gtclust that accompanies the paper111available at http://github.com/comeetie/gtclust.
1 Context, motivations and related works
Contiguity-constrained clustering combines two pieces of information: on the one hand, a classical data set corresponding to a sample of observations living for example in or ; and on the other hand, an undirected graph between the same observations, with edges set . The informal objective being to find a partition of the data points that explains well and where each cluster forms a connected sub-graph of . A subgraph is said to be connected if, for any pair of nodes and in , there is at least one path connecting them without leaving . We will note a partition that fulfils this property. Such a problem allows the use of prior information about the clusters to ensure, for example, that they correspond to spatially coherent regions or time segments. These constraints are quite natural in several applications and greatly reduce the size of the space of possible solutions. Such constraints are therefore interesting from both a computational and an application point of view. Contiguity-constrained clustering has a long history, particularly in geography, where this approach is known as regionalisation and has been studied since the seminal papers of Masser \BBA Brown (\APACyear1975) and Openshaw (\APACyear1977). The interest for such a problem in spatial statistics seems natural, since contiguity graphs and distance graphs play an important role in this area (Anselin, \APACyear2001). For example, a contiguity graph can be inferred from spatial polygon data by looking at common boundaries, as shown in Figure 1(a), or from geographic networks (such as road networks) by looking at the intersections between segments, as shown in Figure 1(b). In the former case, contiguity constraints ensure that the clusters found correspond to coherent spatial regions, and in the latter case to connected roads. Since these pioneering contributions, several research papers in the field of quantitative geography have revisited this issue. The Spatial ‘K‘luster Analysis by Tree Edge Removal (SKATER) proposed by Assunção \BOthers. (\APACyear2006) is a graph-based method that uses a minimal spanning tree to reduce the search space. The regions are then defined by removing edges from the minimal spanning tree. The removed edges are chosen to minimise a dissimilarity measure. Inspired by SKATER, Guo (\APACyear2008) proposed REDCAP (Regionalization with Dynamically Constrained Agglomerative clustering and Partitioning), which discusses several variants of agglomerative approaches to solving contiguity constrained clustering problems by varying the affinity metrics and considering full order and partial order constraints.
Statisticians have also been studying contiguity constrained clustering for some time, and relevant references include the work of Lebart (\APACyear1978), Murtagh (\APACyear1985), Grimm (\APACyear1987) and Gordon (\APACyear1996), which study hierarchical agglomerative approaches to solving such a problem. As these references show, the interest in contiguity-constrained clustering and the fact that such constraints can be easily combined with an agglomerative approach has long been known in the statistical community. In addition to applications in geography and statistics, contiguity constraints have also been studied in the context of sequence analysis. Sequences, of course, can be equipped with simple line-graph and therefore benefit from contiguity constraints to solve segmentation tasks. This has led to applications of such an approach in genetics to study genome sequences (Christophe \BOthers., \APACyear2019) or in time series analysis (Barry \BBA Hartigan, \APACyear1993; Schwaller \BBA Robin, \APACyear2017). Finally, in the network analysis community, the Louvain (Blondel \BOthers., \APACyear2008) and Leiden algorithms (Traag \BOthers., \APACyear2019), two of the most well-known graph clustering approaches, leverage, even if not explicitly stated, contiguity constraints to speed up computations and extract communities.
In the Bayesian context, a first line of related work deals with the use of Spatial Product Partition Models (Hegarty \BBA Barry, \APACyear2008; Page \BBA Quintana, \APACyear2016). PPMs were first introduced by Hartigan (\APACyear1990) and assume that the partition prior is a product of subjective non-negative functions called prior cohesions. The cohesion functions measure how likely it is that elements in a given cluster are clustered a priori. This generic setting was then used to define priors that enforce the spatial coherence of the clusters. To do so, Hegarty \BBA Barry (\APACyear2008) defines these cohesion functions by counting the number of zones that are neighbours of an element of clusters but do not belong to , while Page \BBA Quintana (\APACyear2016) considers the locations and spatial distances between the elements of the clusters. Both approaches favour spatially coherent clustering, but do not strictly enforce the contiguity constraints, as we will do in this proposal. This line of research can also be linked to approaches that try to incorporate spatial information into classical clustering approaches, see for example (Chavent \BOthers., \APACyear2018), where a Ward-like hierarchical clustering algorithm is extended to minimise, at each stage, a convex combination of a homogeneity criterion computed in feature space and a homogeneity criterion incorporating the spatial dimension of the problem.
Finally, the approach of Teixeira \BOthers. (\APACyear2019) is the closest to our proposal since it also relies on spanning trees to define a prior over partitions that strictly enforce contiguity constraints. However, it differs from the present work since it is based on Markov Chain Monte Carlo (MCMC) computations to solve the estimation problem and does not use exactly the same prior/criterion. The interest of our proposal with respect to this approach lies in it’s cheaper computational cost and the ease of interpretation of the results (dendrogram and MAP partition).
As we have just seen, clustering with adjacency constraints is an important problem in unsupervised learning that has been studied for a long time and has important applications in various fields. This paper proposes to revisit this problem and aims at proposing a hierarchical Bayesian approach to solve it. This proposal is based on the definition of a prior on the partition space that enforces the graph-induced contiguity constraints and favours partitions that are easy to disentangle. To this end, the main contributions of the paper are as follows:
- •
the definition of a partition prior to ensure the contiguity constraints induced by a graph
- •
the derivation of an analytical expression to compute exactly this prior
- •
a hierarchical agglomerative algorithm to find an approximate maximum a posteriori partition
- •
a simple approach to building a dendrogram from the computed cluster hierarchy
Finally, the open source R package (R Core Team, \APACyear2019) gtclust provides a reference implementation of the algorithm presented in this paper. The implementation is extensible and new models can be integrated. The main computationally intensive methods have been developed in Cpp thanks to the Rcpp package (Eddelbuettel \BBA Balamuta, \APACyear2017), exploiting the computational efficiency of sparse matrices thanks to the Matrix packages (Eddelbuettel \BBA Sanderson, \APACyear2014; Bates \BBA Maechler, \APACyear2019) and the cholmod library. The c library (Chen \BOthers., \APACyear2008) for sparse Cholesky matrix factorisation. Finally, the sf package (Pebesma, \APACyear2018) was used to seamlessly integrate the package with the standards used to handle spatial datasets in R.
The remainder of this paper is structured as follows: in Section 2 we present details of our methodology, in Subsection 2.1 we present the mathematical framework we will use and how it can be adapted to different observational models, and in Subsection 2.2 we discuss the prior we propose and the elements needed to compute it. In Subsection 2.3 we describe the algorithm we propose to search for an optimal partition, and in Subsection 2.4 we show how to build a dendrogram from it. Finally, in Section 3 we describe several applications of the method to real and simulated data sets. Section 4 concludes the paper.
2 Method
Since we will be relying on a non-parametric Bayesian approach, the target we are looking for is an ordered partition of points with an unknown number of clusters in . More formally, let be an ordered partition of , that is:
[TABLE]
We assume that all data points belonging to the same element of the partition are , i.e. conditional independence, and we marginalise the cluster parameters so that the probability of the observed data knowing the partition is given by:
[TABLE]
where is a parametric distribution such as a Gaussian that generates the samples coming from group and its parameter vector. Assuming a clustering objective, we will try to find the MAP partition:
[TABLE]
Such a criterion, already successfully used in clustering application (Côme \BOthers., \APACyear2021), corresponds to an exact version of the Integrated Classification Likelihood (Biernacki \BOthers., \APACyear2000, \APACyear2010) and is thus a penalised criterion. It will therefore allow to automatically tune the number of clusters. To simplify the derivation, we will mainly work with the un-normalised log posterior:
[TABLE]
which has the same maxima. Before discussing the main point of this paper, which concerns the definition of an adequate prior over the partition to enforce contiguity constraints, we briefly introduce some elements on the computations of the first part of the criterion i.e. the probability of the observed data knowing the partition.
2.1 Exponential family and observation models
Computing involves computations of quantities :
[TABLE]
In this section, we show that such quantities depend on the data only through the sum of sufficient statistics in exponential family distributions, and that the integral can be computed analytically. To this end, we assume an exponential family observation model for the sample in cluster :
[TABLE]
where is the natural parameter, is the log partition function and is a positive normalisation function to ensure that this is a proper density. Then we take the same conjugate prior (Diaconis \BBA Ylvisaker, \APACyear1979) on for each cluster :
[TABLE]
where is another normalisation function. Finally, the integrated observational log-likelihood is obtained via the difference of the normalisation function before and after the update by sufficient statistics:
[TABLE]
Thus, any greedy merge heuristics relying on computations of only has to compute:
[TABLE]
these will easily fit into a hierarchical approach where there are evaluations of the ’s at the beginning and then merging two clusters only amounts to summing their respective sufficient statistics . This property allows the use of a simple stored-data approach Murtagh \BBA Contreras (\APACyear2012), where the stored data are the cluster sufficient statistics, to design an agglomerative hierarchical clustering algorithm, and this article will follow this line of work. Note that such an approach could be costly without the contiguity constraints, since all possible merges have to be considered, but that it becomes very competitive in the constrained case, since the constraints remove a lot of candidates, especially when the contiguity graph has a small average degree. Finally, note that a similar approach can be used for swap moves and allows efficient computations for several classical observation models i.e. Gaussian, Poisson, Multinomial. We refer the reader to the supplementary material of Côme \BOthers. (\APACyear2021) for derivations of specific models. Let us now introduce the main contribution of this work, which concerns the definition of the contiguity-constrained partition prior.
2.2 Contiguity-constrained partition prior
In the unconstrained case, the classical solutions to define a prior over the space of partitions are mainly based on the Dirichlet process (Rasmussen \BBA Ghahramani, \APACyear2001) or on uniform priors (Peixoto, \APACyear2019). We will follow a path closer to the latter approach here, and we will start by recognising that a contiguity-constrained prior can be easily defined if the graph is a tree. In fact, trees have the interesting property of being minimally connected, so removing any edges will create two connected components, and removing edges will create connected subgraphs. This property shows that there is a one-to-one relationship between combinations of edges and compatible partitions. Thus, if is a tree, there are ways of partitioning the vertices into groups that satisfy the constraints induced by , since removing any edges in the tree creates connected subgraphs and the tree has edges. Furthermore, there are ways of ordering the clusters, and so we can easily define a uniform distribution for when is a tree:
[TABLE]
For the moment, the prior distribution on the number of clusters can also be set to a uniform prior to get , but this assumption will be discussed later in Subsection 2.4. The interesting properties of trees that we have just highlighted give us a uniform prior when is a tree, but do not solve the general case. To do so, following Teixeira \BOthers. (\APACyear2019), we will introduce in the setting the set of all spanning tree of , denoted by . A spanning tree of a graph is a connected subgraph with no cycles containing all nodes of . Since a spanning tree is a tree, it inherits the tree properties: any two nodes of are connected by a unique path in , the number of edges in is , and removing any edges divides the vertices of into clusters, each cluster being a connected subgraph of . The spanning trees of can thus be used to define a two-stage prior, as follows:
Sample a spanning tree of uniformly:
[TABLE]
with the set of all spanning tree of . 2. 2.
Then remove edges, and draw an ordering for the corresponding clusters:
[TABLE]
This generative scheme for partitions will produce partitions that are compatible with , since the clusters form connected subgraphs of , and the edges of are by definition contained in the edges of . To obtain a prior on partition from this process, and since the specific spanning tree used to generate the clustering is not of particular interest, it is natural to marginalise this random variable, which results in the following expression:
[TABLE]
At this point, the introduction of spanning trees may seem artificial, but such a prior has interesting properties. This prior is quite informative, since it depends on the constraints defined by , but it goes even further, since it does not correspond to a uniform prior on feasible partitions. In fact, it is known that the probability that an edge is used in a uniformly sampled spanning tree is a good way of measuring how crucial that edge is for the graph to be connected. This measure, known as spanning tree centrality, (Angriman \BOthers., \APACyear2020; Hayashi \BOthers., \APACyear2016), gives us an interesting perspective on the proposed prior. This prior can be seen as a way of extending this measure to partitions of the graph vertices, measuring how easily the different elements of a given partition can be disentangled in the graph. This formalises in an interesting way the kind of prior that practitioners want to express with the contiguity graph.
Let us now discuss how to analytically compute the quantities involved in this prior. To use such a prior, we need to compute the number of spanning trees of and the number of spanning trees of that are compatible with a given partition . These quantities, which at first sight may seem quite difficult to compute, are in fact perfectly amenable to analytical expressions, thanks to the Kirchhoff’s theorem; which relates the number of spanning trees of a graph to the spectrum of its Laplacian Kirchhoff (\APACyear1847); Cayley (\APACyear1889).
Theorem 1** (Kirchhoff’s theorem).**
The number of spanning tree of a graph or multi-graph , without self-loops is given by:
[TABLE]
with be the non-zero eigenvalues of the Laplacian matrix of , given by:
[TABLE]
where is the adjacency matrix of , with equal to the multiplicity of edge for multi-graphs.
Using this theorem, it is therefore possible to compute . There are several solutions to this, but from a computational point of view it is interesting to note that a consequence of this theorem is that the number of spanning trees is given by any cofactor of , and is thus equal to the determinant of with one of its rows and columns removed. The Laplacian matrix with row and column removed is denoted by . To avoid numerical problems, the logarithm of the determinant is used:
[TABLE]
Finally, if the graph is sparse, the computation of the determinant can be solved in reasonable time using a sparse Cholesky decomposition for medium-sized problems. Such a factorisation will give sparse matrices and such that and where is a lower triangular matrix (with ones on the diagonal) and is a diagonal matrix and hence . This approach solves the problem of computing the number of spanning trees of a given graph, but does not answer the question of the number of spanning trees compatible with a given partition. However, we will see that it can be used as a building block for this. To this end, we can decompose the problem by studying the inter-cluster and intra-cluster connectivity of . The intra-cluster connectivity is easy to analyse by considering the subgraphs given by each element of the partition:
[TABLE]
To study the connectivity between clusters, the main objects of interest are the cutsets between the different elements of the partition, i.e. the set of edges connecting the different clusters:
[TABLE]
Using these cutsets, we can define an aggregate multi-graph that we will call , which describes the connection patterns between the clusters:
[TABLE]
where represents edges between and . This multi-graph thus has as many vertices as there are elements in , and the multiplicity of an edge between two vertices and is given by the size of the corresponding cutset in . Using this tool, the following proposition can be used to find the number of spanning trees that may have led to a given partition:
Proposition 1** (Compatible spanning trees).**
The number of spanning trees in a graph compatible with a given partition of its vertices is given by:
[TABLE]
Proof.
To form a compatible spanning tree we need to take () edges in the union of the of each cluster pairs. To get a spanning tree of these edges must define a spanning tree of , i.e. belongs to . However, any combination of intra-cluster spanning trees and inter-cluster spanning tree when combined will be compatible with the partition by definition and they will also form a spanning tree of . ∎
This proposition can then be combined with Kirchhoff’s theorem to compute exactly the desired quantity needed to compute the prior defined in Equation 8, since the latter formula only involves the number of spanning trees of given graphs, and Kirchhoff’s theorem generalises to multi-graphs. So that,
[TABLE]
can be computed exactly.
2.3 A greedy agglomerative algorithm
This section deals with the algorithmic details that must be taken into account in order to design an efficient agglomerative algorithm from the previous proposal. In particular, we will see that great care must be taken to avoid unnecessary computations when computing the prior probability of a partition along the merge tree.
In its simplest form, greedy agglomerative clustering simply reduces to iteratively selecting the best merge action to perform until only one cluster remains. Such a process builds a complete cluster hierarchy, since at each step of the algorithm with the current partition , the number of clusters is reduced by one, and the process starts with each individual data point in its own cluster:
[TABLE]
Ideally, we are interested in all partitions that maximise , given by the Equation 10 for ranging from to . Such an output can then be used to find the partition with maximum a posteriori probability, and to construct a dendrogram as shown in the next section. A greedy agglomerative algorithm with proper merge ranking can be seen as a way to approximate this set of solutions by selecting the best partition from the partitions that can be constructed by merging two clusters of . Contiguity constraints can, of course, speed up this process by reducing the space of possible merges at each step. More formally, the possible merges compatible with the constraints and a current partition correspond to the support of the multiset of edges of that we introduced earlier. At the first iteration, and therefore the possible merges are defined by the initial graph, so every must be inspected and the best one taken. When a merge occurs, the contiguity graph between clusters must be updated. This can be done by contracting the edge : all edges between are removed, as well as their two incident vertices and , which are merged into a new vertex , where the edges incident to each correspond to an edge incident to either or , keeping possible duplicate edges leading to a multi-graph. This ensures that the support of the muti-graph edges always corresponds to all potential merges that fulfil the constraints. We will denote the multi-graph resulting from such an edge contraction by , which allows us to write down the following recurrence relation:
[TABLE]
where is the edge (and thus the merge) selected at step in . The use of edge contraction thus allows the cluster contiguity graphs to be efficiently updated.
To design a greedy agglomerative algorithm for the contiguity constrained problem, the main technical difficulty is to define how to rank the possible merges in an efficient way. Formally, a merge corresponds to modifying the current partition , leaving all clusters as they were, except two clusters and , which are removed and replaced by a single cluster corresponding to their union. We will refer to this merged partition as :
[TABLE]
To rank the merges compatible with the constraints, we are naturally interested in the effect of this change on the log posterior:
[TABLE]
This quantity can be decomposed into two parts, one coming from the terms and one from the priors:
[TABLE]
Removing the terms appearing on both the initial partition and the one with the two clusters merged, it is easy to see that:
[TABLE]
This quantity can be easily evaluated using Equation 2.1 from the cluster sufficient statistics and . Furthermore, it depends only on and , and is therefore independent from the other elements of . This property is important because we didn’t want to update all the possible merge scores at each step, but only the new ones. Following a similar approach for the second term in the right hand side of Equation 13 and simplifying we obtain:
[TABLE]
This second term is more problematic from a computational point of view, because it depends on K and on the whole partition. The terms and depend on the whole partition and not only on the clusters and , and as already explained, this must be avoided. The dependence on is actually not a problem: at each iteration of the algorithm, the merge to be compared will result in the same number of clusters , and therefore the second term on the right and side of Equation 14 can be safely ignored to rank the merges. Regarding the dependence on the entire partition, this can also be relaxed by considering a lower bound on . In fact, we can use the deletion-contraction recurrence for multigraphs given by Lemma 1 to get a lower bound on the ratio of the two problematic terms. This bound depends only on the size of the cutset between clusters and and is given in Proposition 2.
Proposition 2** (Lower bound on merge score).**
Given a graph , for all possible partition of its vertices , with at least two elements and , the following inequality holds:
[TABLE]
Proof.
This is a direct consequence of applying Lemma 1 with and vertices and , which gives , and since the inequality holds. ∎
Lemma 1** (Deletion-contraction recurrence for multi-graphs).**
Given a multigraph , without self-loops and more than 3 vertices, for any pair of vertices and , connected by at least one edge, we have:
[TABLE]
with the multiplicity of edge and the multigraph with all the edges between and removed.
Proof.
This Lemma is an extension of a classical result on spanning tree deletion-contraction recurrence [ref] to multi-graph. is the sum of the number of spanning trees that use one of the edges and the number of spanning trees of that do not pass through any of the edges. The number of spanning trees of that do not pass through any of the edges is given by since every spanning tree of is a spanning tree of that do not contains any edge and conversely any spanning tree of that do not contains any edge is a spanning tree of . If is a spanning tree of G containing one of the edges, the contraction of in both and results in a spanning tree of . But, if is a spanning tree of , there exists spanning trees of , one for each of the edges between and , such that . Thus, the number of spanning trees of passing trough one the edges is . Hence ∎
Putting the previous elements together, we get the following lower bound for the merge scores, which depends only on the clusters and :
[TABLE]
This quantity can be computed quite efficiently. During the merging process we keep track of the cluster contiguity graph, updating it with edge contraction, so the size of the cutset between two clusters is readily available. The other terms can also be computed efficiently, noting that we can compute with a small rank update of the Cholesky factorisation used to compute and since the Laplacian matrix of can be written as the sum of a block diagonal matrix with the Laplacian matrices of and on the diagonal plus an update for each edge in the cutset:
[TABLE]
with the Laplacian matrix of graph and , the column vector for edge with all elements equal to zero, except element equal to 1 and element equal to . This allows the use of efficient numerical routines for updating a sparse Cholesky factorization as the ones provided by the cholmod library (Davis \BBA Hager, \APACyear1999, \APACyear2001, \APACyear2005). Finally, once the merging process has stopped, a similar approach can be used, but backwards, to compute the terms from a previous decomposition of the Laplacian of . This allows us to compute for all the partitions extracted by the hierarchical algorithm, and thus choose the best one. For an overview of the whole process, see Algorithm 1.
2.4 Dendrogram and prior for the number of clusters
The previous sections have shown how to define a contiguity constrained prior when the number of desired clusters is known, but one of the main interests of the proposed prior is to help choose an appropriate number of clusters. For this purpose, one can use a uniform prior for over the value and search for a MAP partition for all possible . However, in a clustering application, practitioners may have some prior knowledge about the desired number of clusters and would prefer to gain some insight into the evolution of the model description capabilities with respect to this parameter. A classic tool often used in such a setting is the so-called dendrogram, which represents the merge tree of a hierarchical agglomerative clustering algorithm, with branch heights proportional to the difference in criterion values induced by the corresponding merge. A dendrogram is therefore a binary tree in which each node corresponds to a cluster. The edges connect the two clusters (nodes) merged in a given step of the algorithm. The height of the leaves is generally assumed to be 0. The leaves are ordered by a permutation of the initial clusters that ensures that successive merges are neighbours in the dendrogram. The height of the node corresponding to the cluster created at merge , , is often the value of the linkage criterion.
We propose to revisit this classical tool in the Bayesian context on which this paper is based. To do so, we start by replacing the uniform distribution on the number of clusters by an informative prior. Considering that the expected value of this random quantity is known, the maximum entropy prior over is given by the truncated geometric distribution:
[TABLE]
Such a prior is therefore a natural way of providing information about the number of clusters. This prior also has the attractive property of including the uniform distribution as a special case. In fact, the prior parameter allows a smooth interpolation between the extreme prior cases, when is equal to a uniform prior is recovered, and when is equal to [math] this prior is equivalent to a Dirac distribution at . Using a small value of will therefore lead to solutions with fewer clusters. Thus, can be seen as a regularisation parameter that gives access to simpler, coarser solutions. Adapting the approach previously proposed in (Côme \BOthers., \APACyear2021) to the non-parametric Bayesian setting of the current proposal, we will show that using this prior together with a greedy agglomerative algorithm that produces a hierarchy of nested partitions allows the exact computation of the sequence of regularisation parameters that unlock the fusions. A difference with the previous proposal is that here the sequence of values that enable the fusion is computed in an exact manner, without relying on any approximation of the posterior. Going back to the definition of the un-normalised posterior given by Equation 10, combining it with the prior defined in Equation 17 and making its dependence on the chosen value explicit, we have:
[TABLE]
The first two terms of this equation do not depend on and can be aggregated into , then substituting by its value given by Equation 17, we obtain:
[TABLE]
The last term of Equation 19 does not depend on the size of the solutions (i.e. their number of clusters ) and can therefore be ignored when comparing partitions. This shows that it is sufficient to compute the intersection of two lines to get the exact value where one partition outperforms the other. In fact, we can examine the front of all solutions extracted by a greedy agglomerative algorithm in the plane as shown in Figure 2 (a) and extract the sequence of regularisation parameters that unlocks the fusions.
These tipping points can then be used to draw a dendrogram, as shown in Figure 2 (a). An important point to note is that some of the partitions extracted by an agglomerative greedy algorithm may not be dominant over the others anywhere in . This corresponds to situations where combining several merges in one step is better than performing them sequentially. This is quite natural, since is a penalised criterion, so it does not necessarily increase with model complexity. Since such partitions cannot belong to the approximated Pareto front over , it is sufficient to remove them and to record only the point corresponding to the real change of the partition on the front; with such an approach, several merges may appear at the same level in the dendrogram, as can be seen by looking carefully at Figure 2 (b). This approach offers an interesting solution to the problem encountered with classical dendrograms when working with contiguity constraints. Indeed, when using contiguity constraints, it is not guaranteed that the classical criterion (i.e. loss of information) increases (Randriamihamison \BOthers., \APACyear2020), leading to difficulties (reversal) which are avoided here. In conclusion, this approach, although reminiscent of classical dendrograms, has several advantages: it naturally avoids the overplotting problems encountered at the bottom of classical dendrograms, since it focuses only on the interesting part of the merging process; it solves the possible reversal problem. Finally, it provides a natural way to interpret the heights of the merges in the dendrograms as the prior parameter value needed to accept the merge.
3 Results and discussion
3.1 Simulation study
To study the performance of the algorithm, we first compare its performance on simulated data in a controlled setting with state-of-the-art clustering algorithms with and without adjacency constraints. The data were generated on a regular grid that can be assimilated to an image of 30 by 30 pixels. These images were then divided into 9 square regions (i.e. clusters) of equal size (10 pixels by 10 pixels), each with a different mean, given by:
[TABLE]
The data were then simulated with a Gaussian distribution whose mean is determined by the region to which the pixel belongs and whose variance varied between . These simulations allow us to study the performance of the different algorithms in simple situations () as well as in more difficult situations (). The first column of Figure 3 shows a random sample of simulated images for different values of .
We have compared our proposal with different state-of-the-art approaches:
mclust:
A Gaussian mixture model without contiguity constraints, with automatic model selection and the number of cluster varied between 1 and 20. The implementation of the R package mclust, (Scrucca \BOthers., \APACyear2016) was used to derive those results.
mclust (9):
The same implementation and algorithm with the number of clusters fixed to the true number of cluster.
skater
The skater algorithm with a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.
redcap:
The redcap algorithm with a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.
azpg:
The azp algorithm with greedy optimization and a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.
azpsa:
The azp algorithm with simulated-annealing optimization and a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.
The solutions obtained with these algorithms were compared with those obtained with our implementation of the algorithm introduced in this paper and available in the gtclust R package with a classical Gaussian model (Gaussian prior for the mean and Gamma prior for the variances, with values , , , ). We extracted for the analysis two solutions from the hierarchy :
gtclust:
the MAP partition found by the algorithm.
gtclust (9):
the partition with 9 clutsers found by the algorithm.
For each value on the grid, 50 simulated records were generated and the results of each algorithm were recorded in terms of Normalised Mutual Information (NMI) between the extracted partition and the simulated one. All algorithms were given the same contiguity graph (constructed with rook adjacency). The details of the results obtained are shown in Figure 3. The average performance of each algorithm is also given in Table 1. The results obtained by gtclust are quite good, outperforming all the other methods until reaches the value of 1.75, after this value the problems are quite hard with a very low signal to noise ratio and all the contiguity constrained methods reach similar performances. But below this value, gtclust, even without fixing the number of clusters to the true value, has better performances than the other methods (which need to know the number of clusters in advance) and is also more stable than the other approaches, as we can see in Figure 4. Furthermore, in this range of settings, the number of clusters automatically found by gtclust is almost always equal to 9 or 8, as we can see in Table 2.
With respect to the other approaches, the performance of the mixture models without contiguity constraints deteriorates rapidly. The solutions found by SKATER and REDCAP are comparable and slightly better than those found by the two AZP variants. These simulations thus highlight the interest of the proposed solution and of contiguity constraints.
3.2 Regionalization of french mobility statistics
The first results on real data that we describe come from the analysis of the mobility statistics of the 2018 French census, provided by INSEE222see https://www.insee.fr/fr/statistiques/5650708. Among the information collected for the census, several pieces of information on the main mode of transport are collected, and we analysed the share of residents using: car/transit/motorised two-wheelers/bicycle/footwalking or any other mode as the main mode of transport in the ” region centre ” of France at the IRIS level. IRIS are geographical units that divide the French municipalities into regions of around 2000 inhabitants. There are 2150 IRIS in the dataset used and they are described by the six variables mentioned above. Geographical units with less than 50 inhabitants were smoothed with a weighted average of their neighbourhood values. We considered the data as compositional and transformed them using an additive log-ratio transformation with the car variable as the reference variable, leaving 5 variables to be analysed. A multivariate Gaussian mixture model with diagonal covariance matrices and normal gamma prior was then applied to these transformed data. The prior parameters were set to , , , ).
The dendrogram of the solution found is shown in Figure 5 and starts with 23 clusters. As shown in Figure 6, the main characteristics of the territory are clearly extracted by the algorithm, the main agglomerations (highlighted with labels) form specific clusters, the two largest (Tours and Orléans) are even represented by two clusters (one for the inner city and another for the suburbs), which is easily explained by the higher share of walking and transit use in these regions. Two large clusters divide the region in a north-south direction, with a lower use of transit (and a higher use of walking) in the southern part of the region. Finally, a final cluster in the north-east of the map presents a heavy use of transit and can be explained by important commuting flows with Paris and made to a large extent by transit.
3.3 Traffic speed clustering
The second application on real data that we have carried out deals with traffic speeds on a road network. Such an application is of interest because finding homogeneous traffic zones in road networks can be very helpful for traffic control. An important open question in traffic theory for the application of the Macroscopic Fundamental Diagram (MFD), a classical tool in traffic control, is how best to segment an urban network into regional subnetworks and how to treat endogenous heterogeneity in the spatial distribution of congestion (Ji \BBA Geroliminis, \APACyear2012; Haghbayan \BOthers., \APACyear2021). This segmentation must fulfil several properties: first, the regions must be of reasonable and similar size, and second, they must be topologically connected and compact. Finally, the traffic conditions in each region must be approximately homogeneous (i.e. congestion must be approximately homogeneous in the region). Contiguity constrained clustering therefore seems a natural way to solve this problem and we have investigated the use of the proposed algorithm in this context.
The data used in this section is one of the benchmark datasets used to study this task (Bellocchi \BBA Geroliminis, \APACyear2019). This dataset concerns the road network of Shenzhen, where the average speed of road segments is estimated every 5 minutes from map-matched traces of about 20,000 taxi GPS points collected over three days in 2011. In order to replicate the preprocessing steps used in previous studies (Haghbayan \BOthers., \APACyear2021), the following preprocessing steps were performed: nodes with no structural role (i.e. that are not intersections) were smoothed (their incoming edges were merged and the associated speed value was taken as the mean of the speeds of the combined edges); average speed values were also computed over 15-minute time intervals. Finally, the graph between road segments was constructed by connecting any two segments connected by an intersection, as shown in Figure 1. A Gaussian mixture model with Norma-Gamma conjugate prior with prior parameters (, , , ) and the velocities in were then used to cluster the data set.
We present in Figure 7 the clustering results obtained over the 7h30-7h45 time slot of the 9th of September. The figure shows the observed velocity in a first map and the extracted clusters in another. These two maps are accompanied by the computed dendrogram and per cluster box plots of the observed velocities. The uniform prior leads to 11 clusters, from the dendrogram we can see that the next interesting solutions contain only 5 clusters. The traffic conditions are homogeneous in each cluster, two clusters correspond to free flow zones with an average speed around 50 km/h, the others correspond to congested zones with speeds around 20 km/h or 35 km/h. When aggregated to the next level of interest in the dendrogram, the remaining clusters are the pink, light green and dark blue clusters, the others being merged into a single ”non-congested” cluster. Again, the segmentation obtained by clustering seems to be quite coherent and meaningful, showing the interest of the proposal for this type of applications.
4 Conclusion and future works
This paper has introduced a Bayesian approach to contiguity constrained clustering that allows semi-automatic tuning of the model complexity (number of clusters), together with an efficient algorithm for finding a MAP partition and building a Bayesian dendrogram from it. A simulation study has shown the interest of this algorithm compared to other solutions for contiguity constrained clustering and its ability to recover the number of clusters. Finally, two experiments on real data have demonstrated the applicability of the proposal on real test cases. From an algorithmic point of view, it would be interesting to compare the hierarchical approach used in this proposal with other solutions based, for example, on local swap moves (ie. in the sense of the Louvain or Leiden algorithms), and to compare the advantages or disadvantages of these two solutions on benchmark datasets. From a modelling point of view, we have mainly illustrated the proposal with Gaussian Mixture Models, but as the proposal is quite general, other models should be investigated (Poisson, Mulitnomial, …). In this spirit, we are currently working on a solution based on the Laplace approximation to deal with more complex observation models where conjugate priors do not exist.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Angriman \B Others . ( \APA Cyear 2020) \APA Cinsertmetastar Angriman 20 {APA Crefauthors} Angriman, E., Predari, M., van der Grinten, A. \BCBL \BBA Meyerhenke, H. \APA Cref Year Month Day 2020. \APA Crefbtitle Approximation of the Diagonal of a Laplacian’s Pseudoinverse for Complex Network Analysis. Approximation of the diagonal of a laplacian’s pseudoinverse for complex network analysis. \APA Caddress Publisher ar Xiv. {APA Cref URL} https://arxiv.org/abs/2006.13679 doi: 10.48 · doi ↗
- 2Anselin ( \APA Cyear 2001) \APA Cinsertmetastar Anselin 01 {APA Crefauthors} Anselin, L. \APA Cref Year Month Day 2001. \BBOQ \APA Crefatitle Spatial econometrics Spatial econometrics. \BBCQ \B In \APA Crefbtitle A companion to theoretical econometrics A companion to theoretical econometrics ( \BVOL 310330, \BPGS 310–330). \Print Back Refs \Current Bib
- 3Assunção \B Others . ( \APA Cyear 2006) \APA Cinsertmetastar Assuncao 2006 {APA Crefauthors} Assunção, R \BPBI M., Neves, M \BPBI C., Câmara, G. \BCBL \BBA Freitas, C \BPBI D \BPBI C. \APA Cref Year Month Day 2006. \BBOQ \APA Crefatitle Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees. \BBCQ \APA Cjournal Vol Num Pages International · doi ↗
- 4Barry \BBA Hartigan ( \APA Cyear 1993) \APA Cinsertmetastar Barry 93 {APA Crefauthors} Barry, D. \BCBT \BBA Hartigan, J \BPBI A. \APA Cref Year Month Day 1993. \BBOQ \APA Crefatitle A Bayesian Analysis for Change Point Problems A bayesian analysis for change point problems. \BBCQ \APA Cjournal Vol Num Pages Journal of the American Statistical Association 88421309-319. {APA Cref URL} https://doi.org/10.1080/01621459.1993.10594323 doi: 10.1080/01621459.1993.10594323 \Print Back R · doi ↗
- 5Bates \BBA Maechler ( \APA Cyear 2019) \APA Cinsertmetastar Bates 2019 {APA Crefauthors} Bates, D. \BCBT \BBA Maechler, M. \APA Cref Year Month Day 2019. \BBOQ \APA Crefatitle Matrix: Sparse and Dense Matrix Classes and Methods Matrix: Sparse and dense matrix classes and methods \BBCQ [ \bibcomputersoftwaremanual ]. {APA Cref URL} https://CRAN.R-project.org/package=Matrix \APA Crefnote R package version 1.2-17 \Print Back Refs \Current Bib
- 6Bellocchi \BBA Geroliminis ( \APA Cyear 2019) \APA Cinsertmetastar Bellocchi 2019 {APA Crefauthors} Bellocchi, L. \BCBT \BBA Geroliminis, N. \APA Cref Year Month Day 20193. \APA Crefbtitle Shenzhen whole day Speeds. Shenzhen whole day speeds. {APA Cref URL} https://figshare.com/articles/dataset/Shenzhen_whole_day_Speeds/7212230 doi: 10.6084/m 9.figshare.7212230.v 2 \Print Back Refs \Current Bib · doi ↗
- 7Biernacki \B Others . ( \APA Cyear 2000) \APA Cinsertmetastar Biernacki 2000 {APA Crefauthors} Biernacki, C., Celeux, G. \BCBL \BBA Govaert, G. \APA Cref Year Month Day 2000. \BBOQ \APA Crefatitle Assessing a mixture model for clustering with the integrated completed likelihood Assessing a mixture model for clustering with the integrated completed likelihood. \BBCQ \APA Cjournal Vol Num Pages IEEE Transaction on Pattern Analysis and Machine Intelligence 7719-725. \Print Back Refs \Current Bi
- 8Biernacki \B Others . ( \APA Cyear 2010) \APA Cinsertmetastar Biernacki 2010 {APA Crefauthors} Biernacki, C., Celeux, G. \BCBL \BBA Govaert, G. \APA Cref Year Month Day 2010. \BBOQ \APA Crefatitle Exact and monte carlo calculations of integrated likelihoods for the latent class model Exact and monte carlo calculations of integrated likelihoods for the latent class model. \BBCQ \APA Cjournal Vol Num Pages Journal of Statistical Planning and Inference 1402991-3002. \Print Back Refs \Current Bi
