Bayesian contiguity constrained clustering, spanning trees and   dendrograms

Etienne C\^ome

arXiv:2302.12546·stat.CO·February 27, 2023

Bayesian contiguity constrained clustering, spanning trees and dendrograms

Etienne C\^ome

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Bayesian approach to contiguity-constrained clustering using spanning trees, enabling exact posterior probability computation, MAP partitioning, and Bayesian dendrogram extraction, with real-world applications demonstrated.

Contribution

It formalizes contiguity-constrained clustering in a Bayesian framework and develops algorithms for MAP partitioning and dendrogram extraction based on spanning trees.

Findings

01

Exact posterior probabilities of partitions computed using spanning trees

02

Algorithm for finding MAP partitions in the Bayesian setting

03

Demonstrated effectiveness on real-world datasets

Abstract

Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is…

Tables2

Table 1. Table 1: Normalized Mutual Information with the simulated partition average over 50 simulations for each algorithm and σ 𝜎 \sigma value. Results where gtclust outperforms significantly (according to a paired Wilcoxon tests) all the other methods are in bold.

$σ$	0.25	0.5	0.75	1	1.25	1.5	1.75	2	2.25
azpg	0.92	0.83	0.79	0.74	0.70	0.64	0.63	0.59	0.55
azpsa	0.92	0.87	0.81	0.75	0.69	0.63	0.61	0.57	0.54
gtclust	1.00	1.00	0.98	0.96	0.91	0.84	0.74	0.62	0.54
gtclust (9)	1.00	1.00	0.98	0.96	0.91	0.85	0.77	0.69	0.63
mclust	0.90	0.40	0.33	0.24	0.20	0.17	0.10	0.04	0.01
mclust (9)	0.91	0.65	0.52	0.43	0.36	0.31	0.26	0.23	0.20
redcap	0.86	0.87	0.88	0.84	0.80	0.75	0.70	0.64	0.62
skater	0.90	0.90	0.90	0.83	0.80	0.74	0.69	0.66	0.65

Table 2. Table 2: Observed frequency of K ^ = 9 ^ 𝐾 9 \hat{K}=9 for gtclust and average value of K ^ ^ 𝐾 \hat{K} over 50 simulations, with respect to σ 𝜎 \sigma .

$σ$	0.25	0.5	0.75	1.00	1.25	1.50	1.75	2.00	2.25
$p (\hat{K} = 9) %$	100	100	96	94	74	58	20	0	0
average $\hat{K}$	9	9	9	8.98	8.80	8.42	7.14	5.82	5.02

Equations73

c = {c_{1}, \dots, c_{K}}

c = {c_{1}, \dots, c_{K}}

p (X ∣ c) = k \prod \int_{θ_{k}} i \in c_{k} \prod p (x_{i} ∣ θ_{k}) p (θ_{k}) d θ_{k},

p (X ∣ c) = k \prod \int_{θ_{k}} i \in c_{k} \prod p (x_{i} ∣ θ_{k}) p (θ_{k}) d θ_{k},

\hat{c} = ar g c max p (c ∣ X, G) = ar g c max p (X ∣ c) p (c ∣ G)

\hat{c} = ar g c max p (c ∣ X, G) = ar g c max p (X ∣ c) p (c ∣ G)

L^{p os t} (c ∣ X, G) = lo g (p (X ∣ c)) + lo g (p (c ∣ G)),

L^{p os t} (c ∣ X, G) = lo g (p (X ∣ c)) + lo g (p (c ∣ G)),

L^{o b s} (X, c_{k}) = lo g (\int_{θ_{k}} i \in c_{k} \prod p (x_{i} ∣ θ_{k}) p (θ_{k}) d θ_{k})

L^{o b s} (X, c_{k}) = lo g (\int_{θ_{k}} i \in c_{k} \prod p (x_{i} ∣ θ_{k}) p (θ_{k}) d θ_{k})

p (x ∣ θ_{k}) = B (x) exp (Φ (θ_{k})^{⊤} T (x) - A (Φ (θ_{k}))),

p (x ∣ θ_{k}) = B (x) exp (Φ (θ_{k})^{⊤} T (x) - A (Φ (θ_{k}))),

p (θ_{k} ∣ β) = H (β) exp (Φ (θ_{k})^{⊤} β - A (Φ (θ_{k}))),

p (θ_{k} ∣ β) = H (β) exp (Φ (θ_{k})^{⊤} β - A (Φ (θ_{k}))),

L^{o b s} (X, c_{k})

L^{o b s} (X, c_{k})

\displaystyle=\log H\bigg{(}\bm{\beta}\bigg{)}-\log H\bigg{(}\bm{\beta}+\sum_{i\in\bm{c}_{k}}T(\mathbf{x}_{i})\bigg{)}+cst.

T_{k} = i \in c_{k} \sum T (x_{i}), \forall k \in {1, \dots, K},

T_{k} = i \in c_{k} \sum T (x_{i}), \forall k \in {1, \dots, K},

p (c ∣ G, K) = \frac{1}{C _{N - 1}^{K - 1} K !} 1_{{c /∣ c ∣ = K; c ≺ G}} .

p (c ∣ G, K) = \frac{1}{C _{N - 1}^{K - 1} K !} 1_{{c /∣ c ∣ = K; c ≺ G}} .

p (t ∣ G) = \frac{1}{∣ T _{G} ∣},

p (t ∣ G) = \frac{1}{∣ T _{G} ∣},

p (c ∣ t, K) = \frac{1}{C _{K - 1}^{N - 1} K !} 1_{{c / c ≺ t}}

p (c ∣ t, K) = \frac{1}{C _{K - 1}^{N - 1} K !} 1_{{c / c ≺ t}}

p (c ∣ G, K)

p (c ∣ G, K)

∣ T_{G} ∣ = \frac{1}{N} λ_{1} ... λ_{N - 1},

∣ T_{G} ∣ = \frac{1}{N} λ_{1} ... λ_{N - 1},

L_{ij} = {- A_{ij} \sum_{v \in V} A_{i v} if i \neq = j if i = j

L_{ij} = {- A_{ij} \sum_{v \in V} A_{i v} if i \neq = j if i = j

lo g (∣ T_{G} ∣) = lo g (det (L_{- j, - j}))

lo g (∣ T_{G} ∣) = lo g (det (L_{- j, - j}))

G [c_{k}] = (c_{k}, {(u, v) \in E / u \in c_{k}, v \in c_{k}})

G [c_{k}] = (c_{k}, {(u, v) \in E / u \in c_{k}, v \in c_{k}})

cutset (G, c_{g}, c_{h}) = {(u, v) \in E / u \in c_{g}, v \in c_{h}},

cutset (G, c_{g}, c_{h}) = {(u, v) \in E / u \in c_{g}, v \in c_{h}},

G ⋄ c = ({1, \dots, ∣ c ∣}, {(g, h, ∣ cutset (G, c_{g}, c_{h}) ∣), \forall g, h \in {1, \dots, ∣ c ∣}^{2}}),

G ⋄ c = ({1, \dots, ∣ c ∣}, {(g, h, ∣ cutset (G, c_{g}, c_{h}) ∣), \forall g, h \in {1, \dots, ∣ c ∣}^{2}}),

lo g (∣ {t / c ≺ t} ∣) = k = 1 \sum K lo g (∣ T_{G [c_{k}]} ∣) intra-cluster spanning trees + inter-clusters spanning trees lo g (∣ T_{G ⋄ c} ∣)

lo g (∣ {t / c ≺ t} ∣) = k = 1 \sum K lo g (∣ T_{G [c_{k}]} ∣) intra-cluster spanning trees + inter-clusters spanning trees lo g (∣ T_{G ⋄ c} ∣)

L^{p os t} (c ∣ X, G, K) = k = 1 \sum K L^{o b s} (X, c_{k}) + lo g (\frac{∣ { t / c ≺ t } ∣}{∣ T _{G} ∣ C _{K - 1}^{N - 1} K !}),

L^{p os t} (c ∣ X, G, K) = k = 1 \sum K L^{o b s} (X, c_{k}) + lo g (\frac{∣ { t / c ≺ t } ∣}{∣ T _{G} ∣ C _{K - 1}^{N - 1} K !}),

c^{(0)} = {{1}, {2}, \dots, {N}} .

c^{(0)} = {{1}, {2}, \dots, {N}} .

G ⋄ c^{(0)} = G, G ⋄ c^{(t + 1)} = G ⋄ c^{(t)} / e^{(t + 1)},

G ⋄ c^{(0)} = G, G ⋄ c^{(t + 1)} = G ⋄ c^{(t)} / e^{(t + 1)},

c^{g \cup h} = {c_{k, \forall k \neq = g, h}, c_{g} \cup c_{h}} .

c^{g \cup h} = {c_{k, \forall k \neq = g, h}, c_{g} \cup c_{h}} .

Δ (g, h) = L^{p os t} (c^{g \cup h} ∣ X, G, K - 1) - L^{p os t} (c ∣ X, G, K)

Δ (g, h) = L^{p os t} (c^{g \cup h} ∣ X, G, K - 1) - L^{p os t} (c ∣ X, G, K)

Δ (g, h) = Δ L^{o b s} (g, h) + Δ^{p r i or} (g, h)

Δ (g, h) = Δ L^{o b s} (g, h) + Δ^{p r i or} (g, h)

Δ L^{o b s} (g, h) = L^{o b s} (X, c_{g} \cup c_{h}) - L^{o b s} (X, c_{g}) - L^{o b s} (X, c_{h}) .

Δ L^{o b s} (g, h) = L^{o b s} (X, c_{g} \cup c_{h}) - L^{o b s} (X, c_{g}) - L^{o b s} (X, c_{h}) .

Δ^{p r i or} (g, h) = lo g (\frac{∣ T _{G ⋄ c / (g, h)} ∣∣ T _{G [c_{g} \cup c_{h}]} ∣}{∣ T _{G ⋄ c} ∣∣ T _{G [c_{h}]} ∣∣ T _{G [c_{g}]} ∣}) + lo g (\frac{K ( N - K + 2 )}{K - 1}) .

Δ^{p r i or} (g, h) = lo g (\frac{∣ T _{G ⋄ c / (g, h)} ∣∣ T _{G [c_{g} \cup c_{h}]} ∣}{∣ T _{G ⋄ c} ∣∣ T _{G [c_{h}]} ∣∣ T _{G [c_{g}]} ∣}) + lo g (\frac{K ( N - K + 2 )}{K - 1}) .

\frac{∣ T _{G ⋄ c / (g, h)} ∣}{∣ T _{G ⋄ c} ∣} \geq \frac{1}{∣ cutset ( G , c _{g} , c _{h} ) ∣}

\frac{∣ T _{G ⋄ c / (g, h)} ∣}{∣ T _{G ⋄ c} ∣} \geq \frac{1}{∣ cutset ( G , c _{g} , c _{h} ) ∣}

∣ T_{G} ∣ = m ∣ T_{G / (g, h)} ∣ + ∣ T_{G - (g, h)} ∣,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

comeetie/gtclust
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Data Management and Algorithms

Full text

A tree is all you need

Etienne Côme*∗*

*∗*Cosys/Grettia, Gustave-Eiffel University, Noisy Champs, France

[email protected]

Abstract

Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is available in the R package gtclust that accompanies the paper111available at http://github.com/comeetie/gtclust.

1 Context, motivations and related works

Contiguity-constrained clustering combines two pieces of information: on the one hand, a classical data set $\mathbf{X}$ corresponding to a sample of observations $\{\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\}$ living for example in $\mathbb{R}^{d}$ or $\mathbb{N}^{d}$ ; and on the other hand, an undirected graph $G$ between the same $N$ observations, with edges set $E$ . The informal objective being to find a partition $\bm{c}$ of the data points that explains well $\mathbf{X}$ and where each cluster $\bm{c}_{k}$ forms a connected sub-graph of $G$ . A subgraph $G[\bm{c}_{k}]$ is said to be connected if, for any pair of nodes $u$ and $v$ in $\bm{c}_{k}$ , there is at least one path connecting them without leaving $\bm{c}_{k}$ . We will note $\bm{c}\prec G$ a partition that fulfils this property. Such a problem allows the use of prior information about the clusters to ensure, for example, that they correspond to spatially coherent regions or time segments. These constraints are quite natural in several applications and greatly reduce the size of the space of possible solutions. Such constraints are therefore interesting from both a computational and an application point of view. Contiguity-constrained clustering has a long history, particularly in geography, where this approach is known as regionalisation and has been studied since the seminal papers of Masser \BBA Brown (\APACyear1975) and Openshaw (\APACyear1977). The interest for such a problem in spatial statistics seems natural, since contiguity graphs and distance graphs play an important role in this area (Anselin, \APACyear2001). For example, a contiguity graph can be inferred from spatial polygon data by looking at common boundaries, as shown in Figure 1(a), or from geographic networks (such as road networks) by looking at the intersections between segments, as shown in Figure 1(b). In the former case, contiguity constraints ensure that the clusters found correspond to coherent spatial regions, and in the latter case to connected roads. Since these pioneering contributions, several research papers in the field of quantitative geography have revisited this issue. The Spatial ‘K‘luster Analysis by Tree Edge Removal (SKATER) proposed by Assunção \BOthers. (\APACyear2006) is a graph-based method that uses a minimal spanning tree to reduce the search space. The regions are then defined by removing edges from the minimal spanning tree. The removed edges are chosen to minimise a dissimilarity measure. Inspired by SKATER, Guo (\APACyear2008) proposed REDCAP (Regionalization with Dynamically Constrained Agglomerative clustering and Partitioning), which discusses several variants of agglomerative approaches to solving contiguity constrained clustering problems by varying the affinity metrics and considering full order and partial order constraints.

Statisticians have also been studying contiguity constrained clustering for some time, and relevant references include the work of Lebart (\APACyear1978), Murtagh (\APACyear1985), Grimm (\APACyear1987) and Gordon (\APACyear1996), which study hierarchical agglomerative approaches to solving such a problem. As these references show, the interest in contiguity-constrained clustering and the fact that such constraints can be easily combined with an agglomerative approach has long been known in the statistical community. In addition to applications in geography and statistics, contiguity constraints have also been studied in the context of sequence analysis. Sequences, of course, can be equipped with simple line-graph and therefore benefit from contiguity constraints to solve segmentation tasks. This has led to applications of such an approach in genetics to study genome sequences (Christophe \BOthers., \APACyear2019) or in time series analysis (Barry \BBA Hartigan, \APACyear1993; Schwaller \BBA Robin, \APACyear2017). Finally, in the network analysis community, the Louvain (Blondel \BOthers., \APACyear2008) and Leiden algorithms (Traag \BOthers., \APACyear2019), two of the most well-known graph clustering approaches, leverage, even if not explicitly stated, contiguity constraints to speed up computations and extract communities.

In the Bayesian context, a first line of related work deals with the use of Spatial Product Partition Models (Hegarty \BBA Barry, \APACyear2008; Page \BBA Quintana, \APACyear2016). PPMs were first introduced by Hartigan (\APACyear1990) and assume that the partition prior $\bm{c}$ is a product of subjective non-negative functions $\kappa(\bm{c}_{k})$ called prior cohesions. The cohesion functions measure how likely it is that elements in a given cluster are clustered a priori. This generic setting was then used to define priors that enforce the spatial coherence of the clusters. To do so, Hegarty \BBA Barry (\APACyear2008) defines these cohesion functions by counting the number of zones that are neighbours of an element of clusters $\bm{c}_{k}$ but do not belong to $\bm{c}_{k}$ , while Page \BBA Quintana (\APACyear2016) considers the locations and spatial distances between the elements of the clusters. Both approaches favour spatially coherent clustering, but do not strictly enforce the contiguity constraints, as we will do in this proposal. This line of research can also be linked to approaches that try to incorporate spatial information into classical clustering approaches, see for example (Chavent \BOthers., \APACyear2018), where a Ward-like hierarchical clustering algorithm is extended to minimise, at each stage, a convex combination of a homogeneity criterion computed in feature space and a homogeneity criterion incorporating the spatial dimension of the problem.

Finally, the approach of Teixeira \BOthers. (\APACyear2019) is the closest to our proposal since it also relies on spanning trees to define a prior over partitions that strictly enforce contiguity constraints. However, it differs from the present work since it is based on Markov Chain Monte Carlo (MCMC) computations to solve the estimation problem and does not use exactly the same prior/criterion. The interest of our proposal with respect to this approach lies in it’s cheaper computational cost and the ease of interpretation of the results (dendrogram and MAP partition).

As we have just seen, clustering with adjacency constraints is an important problem in unsupervised learning that has been studied for a long time and has important applications in various fields. This paper proposes to revisit this problem and aims at proposing a hierarchical Bayesian approach to solve it. This proposal is based on the definition of a prior on the partition space that enforces the graph-induced contiguity constraints and favours partitions that are easy to disentangle. To this end, the main contributions of the paper are as follows:

•

the definition of a partition prior to ensure the contiguity constraints induced by a graph

•

the derivation of an analytical expression to compute exactly this prior

•

a hierarchical agglomerative algorithm to find an approximate maximum a posteriori partition

•

a simple approach to building a dendrogram from the computed cluster hierarchy

Finally, the open source R package (R Core Team, \APACyear2019) gtclust provides a reference implementation of the algorithm presented in this paper. The implementation is extensible and new models can be integrated. The main computationally intensive methods have been developed in Cpp thanks to the Rcpp package (Eddelbuettel \BBA Balamuta, \APACyear2017), exploiting the computational efficiency of sparse matrices thanks to the Matrix packages (Eddelbuettel \BBA Sanderson, \APACyear2014; Bates \BBA Maechler, \APACyear2019) and the cholmod library. The c library (Chen \BOthers., \APACyear2008) for sparse Cholesky matrix factorisation. Finally, the sf package (Pebesma, \APACyear2018) was used to seamlessly integrate the package with the standards used to handle spatial datasets in R.

The remainder of this paper is structured as follows: in Section 2 we present details of our methodology, in Subsection 2.1 we present the mathematical framework we will use and how it can be adapted to different observational models, and in Subsection 2.2 we discuss the prior we propose and the elements needed to compute it. In Subsection 2.3 we describe the algorithm we propose to search for an optimal partition, and in Subsection 2.4 we show how to build a dendrogram from it. Finally, in Section 3 we describe several applications of the method to real and simulated data sets. Section 4 concludes the paper.

2 Method

Since we will be relying on a non-parametric Bayesian approach, the target we are looking for is an ordered partition of $N$ points with an unknown number of clusters $K$ in $\{1,\ldots,N\}$ . More formally, let $\bm{c}\in\mathcal{P}_{N}$ be an ordered partition of $\{1,\ldots,N\}$ , that is:

[TABLE]

We assume that all data points belonging to the same element of the partition are $iid$ , i.e. conditional independence, and we marginalise the cluster parameters so that the probability of the observed data knowing the partition is given by:

[TABLE]

where $p(.\mid\bm{\theta}_{k})$ is a parametric distribution such as a Gaussian that generates the samples coming from group $k$ and $\bm{\theta}_{k}$ its parameter vector. Assuming a clustering objective, we will try to find the MAP partition:

[TABLE]

Such a criterion, already successfully used in clustering application (Côme \BOthers., \APACyear2021), corresponds to an exact version of the Integrated Classification Likelihood (Biernacki \BOthers., \APACyear2000, \APACyear2010) and is thus a penalised criterion. It will therefore allow to automatically tune the number of clusters. To simplify the derivation, we will mainly work with the un-normalised log posterior:

[TABLE]

which has the same maxima. Before discussing the main point of this paper, which concerns the definition of an adequate prior $p(\bm{c}\mid G)$ over the partition to enforce contiguity constraints, we briefly introduce some elements on the computations of the first part of the criterion i.e. the probability of the observed data knowing the partition.

2.1 Exponential family and observation models

Computing $\log p(\bm{X}\mid\bm{c})$ involves computations of quantities $\mathscr{L}^{obs}(\mathbf{X},\bm{c}_{k})$ :

[TABLE]

In this section, we show that such quantities depend on the data $\bm{X}$ only through the sum of sufficient statistics $T(\cdot)$ in exponential family distributions, and that the integral can be computed analytically. To this end, we assume an exponential family observation model for the sample in cluster $k$ :

[TABLE]

where $\Phi(\bm{\theta})$ is the natural parameter, $A(\cdot)$ is the log partition function and $B(\cdot)$ is a positive normalisation function to ensure that this is a proper density. Then we take the same conjugate prior (Diaconis \BBA Ylvisaker, \APACyear1979) on $\bm{\theta}_{k}$ for each cluster $k$ :

[TABLE]

where $H$ is another normalisation function. Finally, the integrated observational log-likelihood is obtained via the difference of the normalisation function before and after the update by sufficient statistics:

[TABLE]

Thus, any greedy merge heuristics relying on computations of $\mathscr{L}^{obs}(\mathbf{X},\bm{c}_{k})$ only has to compute:

[TABLE]

these will easily fit into a hierarchical approach where there are $N$ evaluations of the $T_{i}$ ’s at the beginning and then merging two clusters $g\cup h$ only amounts to summing their respective sufficient statistics $T_{g\cup h}=T_{g}+T_{h}$ . This property allows the use of a simple stored-data approach Murtagh \BBA Contreras (\APACyear2012), where the stored data are the cluster sufficient statistics, to design an agglomerative hierarchical clustering algorithm, and this article will follow this line of work. Note that such an approach could be costly without the contiguity constraints, since all possible merges have to be considered, but that it becomes very competitive in the constrained case, since the constraints remove a lot of candidates, especially when the contiguity graph has a small average degree. Finally, note that a similar approach can be used for swap moves and allows efficient computations for several classical observation models i.e. Gaussian, Poisson, Multinomial. We refer the reader to the supplementary material of Côme \BOthers. (\APACyear2021) for derivations of specific models. Let us now introduce the main contribution of this work, which concerns the definition of the contiguity-constrained partition prior.

2.2 Contiguity-constrained partition prior

In the unconstrained case, the classical solutions to define a prior over the space of partitions $p(\bm{c})$ are mainly based on the Dirichlet process (Rasmussen \BBA Ghahramani, \APACyear2001) or on uniform priors (Peixoto, \APACyear2019). We will follow a path closer to the latter approach here, and we will start by recognising that a contiguity-constrained prior can be easily defined if the graph $G$ is a tree. In fact, trees have the interesting property of being minimally connected, so removing any edges will create two connected components, and removing $K-1$ edges will create $K$ connected subgraphs. This property shows that there is a one-to-one relationship between combinations of edges and compatible partitions. Thus, if $G$ is a tree, there are $C_{N-1}^{K-1}$ ways of partitioning the vertices into $K$ groups that satisfy the constraints induced by $G$ , since removing any $(K-1)$ edges in the tree creates $K$ connected subgraphs and the tree has $N-1$ edges. Furthermore, there are $K!$ ways of ordering the $K$ clusters, and so we can easily define a uniform distribution for $\bm{c}\mid K,G$ when $G$ is a tree:

[TABLE]

For the moment, the prior distribution on the number of clusters can also be set to a uniform prior $p(K)=\frac{1}{N}$ to get $p(\bm{c}\mid G)$ , but this assumption will be discussed later in Subsection 2.4. The interesting properties of trees that we have just highlighted give us a uniform prior when $G$ is a tree, but do not solve the general case. To do so, following Teixeira \BOthers. (\APACyear2019), we will introduce in the setting the set of all spanning tree of $G$ , denoted by $\mathcal{T}_{G}$ . A spanning tree $\bm{t}\in\mathcal{T}_{G}$ of a graph $G$ is a connected subgraph with no cycles containing all nodes of $G$ . Since a spanning tree is a tree, it inherits the tree properties: any two nodes of $G$ are connected by a unique path in $\bm{t}$ , the number of edges in $\bm{t}$ is $N-1$ , and removing any $(K-1)$ edges divides the vertices of $G$ into $K$ clusters, each cluster being a connected subgraph of $G$ . The spanning trees of $G$ can thus be used to define a two-stage prior, as follows:

Sample a spanning tree $\bm{t}$ of $G$ uniformly:

[TABLE]

with $\mathcal{T}_{G}$ the set of all spanning tree of $G$ . 2. 2.

Then remove $K-1$ edges, and draw an ordering for the corresponding clusters:

[TABLE]

This generative scheme for partitions will produce partitions that are compatible with $G$ , since the clusters form connected subgraphs of $\bm{t}$ , and the edges of $\bm{t}$ are by definition contained in the edges of $G$ . To obtain a prior on partition from this process, and since the specific spanning tree used to generate the clustering is not of particular interest, it is natural to marginalise this random variable, which results in the following expression:

[TABLE]

At this point, the introduction of spanning trees may seem artificial, but such a prior has interesting properties. This prior is quite informative, since it depends on the constraints defined by $G$ , but it goes even further, since it does not correspond to a uniform prior on feasible partitions. In fact, it is known that the probability that an edge $e$ is used in a uniformly sampled spanning tree is a good way of measuring how crucial that edge is for the graph to be connected. This measure, known as spanning tree centrality, (Angriman \BOthers., \APACyear2020; Hayashi \BOthers., \APACyear2016), gives us an interesting perspective on the proposed prior. This prior can be seen as a way of extending this measure to partitions of the graph vertices, measuring how easily the different elements of a given partition can be disentangled in the graph. This formalises in an interesting way the kind of prior that practitioners want to express with the contiguity graph.

Let us now discuss how to analytically compute the quantities involved in this prior. To use such a prior, we need to compute $|\mathcal{T}_{G}|$ the number of spanning trees of $G$ and $|\{\bm{t}/\bm{c}\prec\bm{t}\}|$ the number of spanning trees of $G$ that are compatible with a given partition $\bm{c}$ . These quantities, which at first sight may seem quite difficult to compute, are in fact perfectly amenable to analytical expressions, thanks to the Kirchhoff’s theorem; which relates the number of spanning trees of a graph to the spectrum of its Laplacian Kirchhoff (\APACyear1847); Cayley (\APACyear1889).

Theorem 1 (Kirchhoff’s theorem).

The number of spanning tree of a graph or multi-graph $G$ , without self-loops is given by:

[TABLE]

with $\lambda_{i}$ be the non-zero eigenvalues of the Laplacian matrix $L$ of $G$ , given by:

[TABLE]

where $A$ is the adjacency matrix of $G$ , with $A_{ij}$ equal to the multiplicity of edge $(i,j)$ for multi-graphs.

Using this theorem, it is therefore possible to compute $\log(|\mathcal{T}_{G}|)$ . There are several solutions to this, but from a computational point of view it is interesting to note that a consequence of this theorem is that the number of spanning trees is given by any cofactor of $L$ , and is thus equal to the determinant of $L$ with one of its rows and columns $(j)$ removed. The Laplacian matrix with row and column $(j)$ removed is denoted by $L_{\text{-}j,\text{-}j}$ . To avoid numerical problems, the logarithm of the determinant is used:

[TABLE]

Finally, if the graph is sparse, the computation of the determinant can be solved in reasonable time using a sparse Cholesky decomposition for medium-sized problems. Such a factorisation will give sparse matrices $U$ and $D$ such that $L_{\text{-}j,\text{-}j}=U.D.U^{t}$ and where $U$ is a lower triangular matrix (with ones on the diagonal) and $D$ is a diagonal matrix and hence $\log(|\mathcal{T}_{G}|)=\sum_{i=1}^{N-1}\log(D_{ii})$ . This approach solves the problem of computing the number of spanning trees of a given graph, but does not answer the question of the number of spanning trees compatible with a given partition. However, we will see that it can be used as a building block for this. To this end, we can decompose the problem by studying the inter-cluster and intra-cluster connectivity of $G$ . The intra-cluster connectivity is easy to analyse by considering the subgraphs given by each element of the partition:

[TABLE]

To study the connectivity between clusters, the main objects of interest are the cutsets between the different elements of the partition, i.e. the set of edges connecting the different clusters:

[TABLE]

Using these cutsets, we can define an aggregate multi-graph that we will call $G\diamond\bm{c}$ , which describes the connection patterns between the clusters:

[TABLE]

where $(g,h,m)$ represents $m$ edges between $g$ and $h$ . This multi-graph thus has as many vertices as there are elements in $\bm{c}$ , and the multiplicity of an edge between two vertices $g$ and $h$ is given by the size of the corresponding cutset in $G$ . Using this tool, the following proposition can be used to find the number of spanning trees that may have led to a given partition:

Proposition 1 (Compatible spanning trees).

The number of spanning trees in a graph $G$ compatible with a given partition $\bm{c}$ of its vertices is given by:

[TABLE]

Proof.

To form a compatible spanning tree we need to take ( $K-1$ ) edges in the union of the $\bm{cutset}(G,\bm{c}_{g},\bm{c}_{h})$ of each cluster pairs. To get a spanning tree $\bm{t}$ of $G$ these edges must define a spanning tree of $G\diamond\bm{c}$ , i.e. belongs to $\mathcal{T}_{G\diamond\bm{c}}$ . However, any combination of intra-cluster spanning trees and inter-cluster spanning tree when combined will be compatible with the partition by definition and they will also form a spanning tree of $G$ . ∎

This proposition can then be combined with Kirchhoff’s theorem to compute exactly the desired quantity needed to compute the prior defined in Equation 8, since the latter formula only involves the number of spanning trees of given graphs, and Kirchhoff’s theorem generalises to multi-graphs. So that,

[TABLE]

can be computed exactly.

2.3 A greedy agglomerative algorithm

This section deals with the algorithmic details that must be taken into account in order to design an efficient agglomerative algorithm from the previous proposal. In particular, we will see that great care must be taken to avoid unnecessary computations when computing the prior probability of a partition along the merge tree.

In its simplest form, greedy agglomerative clustering simply reduces to iteratively selecting the best merge action to perform until only one cluster remains. Such a process builds a complete cluster hierarchy, since at each step $t$ of the algorithm with the current partition $\bm{c}^{(t)}$ , the number of clusters is reduced by one, and the process starts with each individual data point in its own cluster:

[TABLE]

Ideally, we are interested in all partitions $\bm{c}^{(0)},\ldots,\bm{c}^{(N)}$ that maximise $\mathscr{L}^{post}$ , given by the Equation 10 for $K$ ranging from $N$ to $1$ . Such an output can then be used to find the partition $\bm{c}^{*}$ with maximum a posteriori probability, and to construct a dendrogram as shown in the next section. A greedy agglomerative algorithm with proper merge ranking can be seen as a way to approximate this set of solutions by selecting the best partition $\bm{c}^{(t+1)}$ from the partitions that can be constructed by merging two clusters of $\bm{c}^{(t)}$ . Contiguity constraints can, of course, speed up this process by reducing the space of possible merges at each step. More formally, the possible merges compatible with the constraints and a current partition $\bm{c}^{(t)}$ correspond to the support of the multiset of edges of $G\diamond\bm{c}^{(t)}$ that we introduced earlier. At the first iteration, $G\diamond\bm{c}^{(0)}=G$ and therefore the possible merges are defined by the initial graph, so every $(g,h)\in E$ must be inspected and the best one $e^{*}$ taken. When a merge occurs, the contiguity graph between clusters $G\diamond\bm{c}^{(t)}$ must be updated. This can be done by contracting the edge $e^{*}=(g,h)$ : all edges between $(g,h)$ are removed, as well as their two incident vertices $g$ and $h$ , which are merged into a new vertex $u$ , where the edges incident to $u$ each correspond to an edge incident to either $g$ or $h$ , keeping possible duplicate edges leading to a multi-graph. This ensures that the support of the muti-graph edges always corresponds to all potential merges that fulfil the constraints. We will denote the multi-graph resulting from such an edge contraction by $G/(g,h)$ , which allows us to write down the following recurrence relation:

[TABLE]

where $e^{(t+1)}$ is the edge (and thus the merge) selected at step $t+1$ in $G\diamond\bm{c}^{(t)}$ . The use of edge contraction thus allows the cluster contiguity graphs $G\diamond\bm{c}$ to be efficiently updated.

To design a greedy agglomerative algorithm for the contiguity constrained problem, the main technical difficulty is to define how to rank the possible merges in an efficient way. Formally, a merge corresponds to modifying the current partition $\bm{c}$ , leaving all clusters as they were, except two clusters $g$ and $h$ , which are removed and replaced by a single cluster corresponding to their union. We will refer to this merged partition as $\bm{c}^{g\cup h}$ :

[TABLE]

To rank the merges compatible with the constraints, we are naturally interested in the effect of this change on the log posterior:

[TABLE]

This quantity can be decomposed into two parts, one coming from the $\mathscr{L}^{obs}$ terms and one from the priors:

[TABLE]

Removing the terms appearing on both the initial partition and the one with the two clusters merged, it is easy to see that:

[TABLE]

This quantity can be easily evaluated using Equation 2.1 from the cluster sufficient statistics $T_{g},T_{h}$ and $T_{g\cup h}=T_{g}+T_{h}$ . Furthermore, it depends only on $\bm{c}_{g}$ and $\bm{c}_{h}$ , and is therefore independent from the other elements of $\bm{c}^{(t)}$ . This property is important because we didn’t want to update all the possible merge scores $\Delta(g,h)$ at each step, but only the new ones. Following a similar approach for the second term in the right hand side of Equation 13 and simplifying we obtain:

[TABLE]

This second term is more problematic from a computational point of view, because it depends on K and on the whole partition. The terms $|\mathcal{T}_{G\diamond\bm{c}/(g,h)}|$ and $|\mathcal{T}_{G\diamond\bm{c}}|$ depend on the whole partition and not only on the clusters $g$ and $h$ , and as already explained, this must be avoided. The dependence on $K$ is actually not a problem: at each iteration of the algorithm, the merge to be compared will result in the same number of clusters $K-1$ , and therefore the second term on the right and side of Equation 14 can be safely ignored to rank the merges. Regarding the dependence on the entire partition, this can also be relaxed by considering a lower bound on $\Delta^{prior}$ . In fact, we can use the deletion-contraction recurrence for multigraphs given by Lemma 1 to get a lower bound on the ratio of the two problematic terms. This bound depends only on the size of the cutset between clusters $g$ and $h$ and is given in Proposition 2.

Proposition 2 (Lower bound on merge score).

Given a graph $G$ , for all possible partition of its vertices $\bm{c}$ , with at least two elements $\bm{c}_{g}$ and $\bm{c}_{h}$ , the following inequality holds:

[TABLE]

Proof.

This is a direct consequence of applying Lemma 1 with $G\diamond\bm{c}$ and vertices $g$ and $h$ , which gives $|\mathcal{T}_{G\diamond\bm{c}}|=|\bm{cutset}(G,\bm{c}_{g},\bm{c}_{h})||\mathcal{T}_{G\diamond\bm{c}/(g,h)}|+|\mathcal{T}_{G\diamond\bm{c}-(g,h)}|$ , and since $|\mathcal{T}_{G\diamond\bm{c}-(g,h)}|\geq 0$ the inequality holds. ∎

Lemma 1 (Deletion-contraction recurrence for multi-graphs).

Given a multigraph $G$ , without self-loops and more than 3 vertices, for any pair of vertices $g$ and $h$ , connected by at least one edge, we have:

[TABLE]

with $m$ the multiplicity of edge $(g,h)$ and $G-(g,h)$ the multigraph with all the $m$ edges between $g$ and $h$ removed.

Proof.

This Lemma is an extension of a classical result on spanning tree deletion-contraction recurrence [ref] to multi-graph. $|\mathcal{T}_{G}|$ is the sum of the number of spanning trees that use one of the $(g,h)$ edges and the number of spanning trees of $G$ that do not pass through any of the $(g,h)$ edges. The number of spanning trees of $G$ that do not pass through any of the $(g,h)$ edges is given by $|\mathcal{T}_{G-(g,h)}|$ since every spanning tree of $G-(g,h)$ is a spanning tree of $G$ that do not contains any $(g,h)$ edge and conversely any spanning tree of $G$ that do not contains any $(g,h)$ edge is a spanning tree of $G-(g,h)$ . If $\bm{t}$ is a spanning tree of G containing one of the $(g,h)$ edges, the contraction of $(g,h)$ in both $\bm{t}$ and $G$ results in a spanning tree $\bm{t}/(g,h)$ of $G/(g,h)$ . But, if $\bm{t}^{*}$ is a spanning tree of $G/(g,h)$ , there exists $m$ spanning trees $\bm{t}$ of $G$ , one for each of the $m$ edges between $g$ and $h$ , such that $\bm{t}/(g,h)=\bm{t}^{*}$ . Thus, the number of spanning trees of $G$ passing trough one the $(g,h)$ edges is $m.|\mathcal{T}_{G/(g,h)}|$ . Hence $|T_{G/(g,h)}|=m.|\mathcal{T}_{G/(g,h)}|+|\mathcal{T}_{G-(g,h)}|.$ ∎

Putting the previous elements together, we get the following lower bound for the merge scores, which depends only on the clusters $g$ and $h$ :

[TABLE]

This quantity can be computed quite efficiently. During the merging process we keep track of the cluster contiguity graph, updating it with edge contraction, so the size of the cutset between two clusters is readily available. The other terms can also be computed efficiently, noting that we can compute $|\mathcal{T}_{G[\bm{c}_{g}\cup\bm{c}_{h}]}|$ with a small rank update of the Cholesky factorisation used to compute $|\mathcal{T}_{G[\bm{c}_{g}]}|$ and $|\mathcal{T}_{G[\bm{c}_{g}]}|$ since the Laplacian matrix of $G[\bm{c}_{g}\cup\bm{c}_{h}]$ can be written as the sum of a block diagonal matrix with the Laplacian matrices of $G[\bm{c}_{g}]$ and $G[\bm{c}_{h}]$ on the diagonal plus an update for each edge in the $(g,h)$ cutset:

[TABLE]

with $L(G)$ the Laplacian matrix of graph $G$ and $l_{(e)}$ , the column vector for edge $e=(u,v)$ with all elements equal to zero, except element $u$ equal to 1 and element $v$ equal to $-1$ . This allows the use of efficient numerical routines for updating a sparse Cholesky factorization as the ones provided by the cholmod library (Davis \BBA Hager, \APACyear1999, \APACyear2001, \APACyear2005). Finally, once the merging process has stopped, a similar approach can be used, but backwards, to compute the terms $|\mathcal{T}_{G\diamond\bm{c}^{(K)}}|$ from a previous decomposition of the Laplacian of $\mathcal{T}_{G\diamond\bm{c}^{(K-1)}}$ . This allows us to compute $\mathscr{L}^{post}(\bm{X},\bm{c}^{(K)})$ for all the partitions extracted by the hierarchical algorithm, and thus choose the best one. For an overview of the whole process, see Algorithm 1.

2.4 Dendrogram and prior for the number of clusters

The previous sections have shown how to define a contiguity constrained prior when the number of desired clusters is known, but one of the main interests of the proposed prior is to help choose an appropriate number of clusters. For this purpose, one can use a uniform prior for $K$ over the value $\{1,\ldots,N\}$ and search for a MAP partition for all possible $K$ . However, in a clustering application, practitioners may have some prior knowledge about the desired number of clusters and would prefer to gain some insight into the evolution of the model description capabilities with respect to this parameter. A classic tool often used in such a setting is the so-called dendrogram, which represents the merge tree of a hierarchical agglomerative clustering algorithm, with branch heights proportional to the difference in criterion values induced by the corresponding merge. A dendrogram is therefore a binary tree in which each node corresponds to a cluster. The edges connect the two clusters (nodes) merged in a given step of the algorithm. The height of the leaves is generally assumed to be 0. The leaves are ordered by a permutation of the initial clusters that ensures that successive merges are neighbours in the dendrogram. The height of the node corresponding to the cluster created at merge $t$ , $h_{t}$ , is often the value of the linkage criterion.

We propose to revisit this classical tool in the Bayesian context on which this paper is based. To do so, we start by replacing the uniform distribution on the number of clusters $K$ by an informative prior. Considering that the expected value of this random quantity is known, the maximum entropy prior over $\{1,\ldots,N\}$ is given by the truncated geometric distribution:

[TABLE]

Such a prior is therefore a natural way of providing information about the number of clusters. This prior also has the attractive property of including the uniform distribution as a special case. In fact, the prior parameter $\alpha$ allows a smooth interpolation between the extreme prior cases, when $\alpha$ is equal to $1$ a uniform prior is recovered, and when $\alpha$ is equal to [math] this prior is equivalent to a Dirac distribution at $K=1$ . Using a small value of $\alpha$ will therefore lead to solutions with fewer clusters. Thus, $\alpha$ can be seen as a regularisation parameter that gives access to simpler, coarser solutions. Adapting the approach previously proposed in (Côme \BOthers., \APACyear2021) to the non-parametric Bayesian setting of the current proposal, we will show that using this prior together with a greedy agglomerative algorithm that produces a hierarchy of nested partitions allows the exact computation of the sequence of regularisation parameters that unlock the fusions. A difference with the previous proposal is that here the sequence of $\alpha$ values that enable the fusion is computed in an exact manner, without relying on any approximation of the posterior. Going back to the definition of the un-normalised posterior given by Equation 10, combining it with the prior defined in Equation 17 and making its dependence on the chosen $\alpha$ value explicit, we have:

[TABLE]

The first two terms of this equation do not depend on $\alpha$ and can be aggregated into $I(\bm{c},\bm{X},G)$ , then substituting $p(K|\alpha)$ by its value given by Equation 17, we obtain:

[TABLE]

The last term of Equation 19 does not depend on the size of the solutions (i.e. their number of clusters $K$ ) and can therefore be ignored when comparing partitions. This shows that it is sufficient to compute the intersection of two lines to get the exact $\alpha$ value where one partition outperforms the other. In fact, we can examine the front of all solutions extracted by a greedy agglomerative algorithm in the $\left(-\log(\alpha),\mathscr{L}^{post}(\bm{c}|\bm{X},G,\alpha)\right)$ plane as shown in Figure 2 (a) and extract the sequence of regularisation parameters that unlocks the fusions.

These tipping points $-log(\alpha^{(t)})$ can then be used to draw a dendrogram, as shown in Figure 2 (a). An important point to note is that some of the partitions extracted by an agglomerative greedy algorithm may not be dominant over the others anywhere in $\alpha\in]0,1]$ . This corresponds to situations where combining several merges in one step is better than performing them sequentially. This is quite natural, since $\mathscr{L}^{post}$ is a penalised criterion, so it does not necessarily increase with model complexity. Since such partitions cannot belong to the approximated Pareto front over $\alpha\in[0,1]$ , it is sufficient to remove them and to record only the point corresponding to the real change of the partition on the front; with such an approach, several merges may appear at the same level in the dendrogram, as can be seen by looking carefully at Figure 2 (b). This approach offers an interesting solution to the problem encountered with classical dendrograms when working with contiguity constraints. Indeed, when using contiguity constraints, it is not guaranteed that the classical criterion (i.e. loss of information) increases (Randriamihamison \BOthers., \APACyear2020), leading to difficulties (reversal) which are avoided here. In conclusion, this approach, although reminiscent of classical dendrograms, has several advantages: it naturally avoids the overplotting problems encountered at the bottom of classical dendrograms, since it focuses only on the interesting part of the merging process; it solves the possible reversal problem. Finally, it provides a natural way to interpret the heights of the merges in the dendrograms as the prior parameter value needed to accept the merge.

3 Results and discussion

3.1 Simulation study

To study the performance of the algorithm, we first compare its performance on simulated data in a controlled setting with state-of-the-art clustering algorithms with and without adjacency constraints. The data were generated on a regular grid that can be assimilated to an image of 30 by 30 pixels. These images were then divided into 9 square regions (i.e. clusters) of equal size (10 pixels by 10 pixels), each with a different mean, given by:

[TABLE]

The data were then simulated with a Gaussian distribution whose mean is determined by the region to which the pixel belongs and whose variance $\sigma$ varied between $\{0.25,0.5,0.75,1,1.25,1.5,1.75,2,2.25\}$ . These simulations allow us to study the performance of the different algorithms in simple situations ( $\sigma=0.25,0.5$ ) as well as in more difficult situations ( $\sigma=1.75,2,2.25$ ). The first column of Figure 3 shows a random sample of simulated images for different values of $\sigma$ .

We have compared our proposal with different state-of-the-art approaches:

mclust:

A Gaussian mixture model without contiguity constraints, with automatic model selection and the number of cluster varied between 1 and 20. The implementation of the R package mclust, (Scrucca \BOthers., \APACyear2016) was used to derive those results.

mclust (9):

The same implementation and algorithm with the number of clusters fixed to the true number of cluster.

skater

The skater algorithm with a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.

redcap:

The redcap algorithm with a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.

azpg:

The azp algorithm with greedy optimization and a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.

azpsa:

The azp algorithm with simulated-annealing optimization and a number of clusters equal to the true number of cluster. The implementation used is the one available in the rgeoda package.

The solutions obtained with these algorithms were compared with those obtained with our implementation of the algorithm introduced in this paper and available in the gtclust R package with a classical Gaussian model (Gaussian prior for the mean and Gamma prior for the variances, with values $\tau=0.01$ , $\kappa=1$ , $\beta=0.1s^{2}$ , $\mu=\bar{x}$ ). We extracted for the analysis two solutions from the hierarchy :

gtclust:

the MAP partition found by the algorithm.

gtclust (9):

the partition with 9 clutsers found by the algorithm.

For each $\sigma$ value on the grid, 50 simulated records were generated and the results of each algorithm were recorded in terms of Normalised Mutual Information (NMI) between the extracted partition and the simulated one. All algorithms were given the same contiguity graph (constructed with rook adjacency). The details of the results obtained are shown in Figure 3. The average performance of each algorithm is also given in Table 1. The results obtained by gtclust are quite good, outperforming all the other methods until $\sigma$ reaches the value of 1.75, after this value the problems are quite hard with a very low signal to noise ratio and all the contiguity constrained methods reach similar performances. But below this value, gtclust, even without fixing the number of clusters to the true value, has better performances than the other methods (which need to know the number of clusters in advance) and is also more stable than the other approaches, as we can see in Figure 4. Furthermore, in this range of settings, the number of clusters automatically found by gtclust is almost always equal to 9 or 8, as we can see in Table 2.

With respect to the other approaches, the performance of the mixture models without contiguity constraints deteriorates rapidly. The solutions found by SKATER and REDCAP are comparable and slightly better than those found by the two AZP variants. These simulations thus highlight the interest of the proposed solution and of contiguity constraints.

3.2 Regionalization of french mobility statistics

The first results on real data that we describe come from the analysis of the mobility statistics of the 2018 French census, provided by INSEE222see https://www.insee.fr/fr/statistiques/5650708. Among the information collected for the census, several pieces of information on the main mode of transport are collected, and we analysed the share of residents using: car/transit/motorised two-wheelers/bicycle/footwalking or any other mode as the main mode of transport in the ” region centre ” of France at the IRIS level. IRIS are geographical units that divide the French municipalities into regions of around 2000 inhabitants. There are 2150 IRIS in the dataset used and they are described by the six variables mentioned above. Geographical units with less than 50 inhabitants were smoothed with a weighted average of their neighbourhood values. We considered the data as compositional and transformed them using an additive log-ratio transformation with the car variable as the reference variable, leaving 5 variables to be analysed. A multivariate Gaussian mixture model with diagonal covariance matrices and normal gamma prior was then applied to these transformed data. The prior parameters were set to $(\tau=0.01$ , $\kappa=1$ , $\beta=0.64$ , $\mu=\bar{\mathbf{x}}$ ).

The dendrogram of the solution found is shown in Figure 5 and starts with 23 clusters. As shown in Figure 6, the main characteristics of the territory are clearly extracted by the algorithm, the main agglomerations (highlighted with labels) form specific clusters, the two largest (Tours and Orléans) are even represented by two clusters (one for the inner city and another for the suburbs), which is easily explained by the higher share of walking and transit use in these regions. Two large clusters divide the region in a north-south direction, with a lower use of transit (and a higher use of walking) in the southern part of the region. Finally, a final cluster in the north-east of the map presents a heavy use of transit and can be explained by important commuting flows with Paris and made to a large extent by transit.

3.3 Traffic speed clustering

The second application on real data that we have carried out deals with traffic speeds on a road network. Such an application is of interest because finding homogeneous traffic zones in road networks can be very helpful for traffic control. An important open question in traffic theory for the application of the Macroscopic Fundamental Diagram (MFD), a classical tool in traffic control, is how best to segment an urban network into regional subnetworks and how to treat endogenous heterogeneity in the spatial distribution of congestion (Ji \BBA Geroliminis, \APACyear2012; Haghbayan \BOthers., \APACyear2021). This segmentation must fulfil several properties: first, the regions must be of reasonable and similar size, and second, they must be topologically connected and compact. Finally, the traffic conditions in each region must be approximately homogeneous (i.e. congestion must be approximately homogeneous in the region). Contiguity constrained clustering therefore seems a natural way to solve this problem and we have investigated the use of the proposed algorithm in this context.

The data used in this section is one of the benchmark datasets used to study this task (Bellocchi \BBA Geroliminis, \APACyear2019). This dataset concerns the road network of Shenzhen, where the average speed of road segments is estimated every 5 minutes from map-matched traces of about 20,000 taxi GPS points collected over three days in 2011. In order to replicate the preprocessing steps used in previous studies (Haghbayan \BOthers., \APACyear2021), the following preprocessing steps were performed: nodes with no structural role (i.e. that are not intersections) were smoothed (their incoming edges were merged and the associated speed value was taken as the mean of the speeds of the combined edges); average speed values were also computed over 15-minute time intervals. Finally, the graph between road segments was constructed by connecting any two segments connected by an intersection, as shown in Figure 1. A Gaussian mixture model with Norma-Gamma conjugate prior $\mu_{k}\sim\mathcal{N}\left(\mu,(\tau.V_{k})^{-1}\right),V_{k}\sim Gam(\kappa,\beta),$ with prior parameters ( $\tau=0.01$ , $\kappa=1$ , $\beta=0.1s^{2}\approx 0.62$ , $\mu=\bar{x}\approx 9$ ) and the velocities in $m/s$ were then used to cluster the data set.

We present in Figure 7 the clustering results obtained over the 7h30-7h45 time slot of the 9th of September. The figure shows the observed velocity in a first map and the extracted clusters in another. These two maps are accompanied by the computed dendrogram and per cluster box plots of the observed velocities. The uniform prior leads to 11 clusters, from the dendrogram we can see that the next interesting solutions contain only 5 clusters. The traffic conditions are homogeneous in each cluster, two clusters correspond to free flow zones with an average speed around 50 km/h, the others correspond to congested zones with speeds around 20 km/h or 35 km/h. When aggregated to the next level of interest in the dendrogram, the remaining clusters are the pink, light green and dark blue clusters, the others being merged into a single ”non-congested” cluster. Again, the segmentation obtained by clustering seems to be quite coherent and meaningful, showing the interest of the proposal for this type of applications.

4 Conclusion and future works

This paper has introduced a Bayesian approach to contiguity constrained clustering that allows semi-automatic tuning of the model complexity (number of clusters), together with an efficient algorithm for finding a MAP partition and building a Bayesian dendrogram from it. A simulation study has shown the interest of this algorithm compared to other solutions for contiguity constrained clustering and its ability to recover the number of clusters. Finally, two experiments on real data have demonstrated the applicability of the proposal on real test cases. From an algorithmic point of view, it would be interesting to compare the hierarchical approach used in this proposal with other solutions based, for example, on local swap moves (ie. in the sense of the Louvain or Leiden algorithms), and to compare the advantages or disadvantages of these two solutions on benchmark datasets. From a modelling point of view, we have mainly illustrated the proposal with Gaussian Mixture Models, but as the proposal is quite general, other models should be investigated (Poisson, Mulitnomial, …). In this spirit, we are currently working on a solution based on the Laplace approximation to deal with more complex observation models where conjugate priors do not exist.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Angriman \B Others . ( \APA Cyear 2020) \APA Cinsertmetastar Angriman 20 {APA Crefauthors} Angriman, E., Predari, M., van der Grinten, A. \BCBL \BBA Meyerhenke, H. \APA Cref Year Month Day 2020. \APA Crefbtitle Approximation of the Diagonal of a Laplacian’s Pseudoinverse for Complex Network Analysis. Approximation of the diagonal of a laplacian’s pseudoinverse for complex network analysis. \APA Caddress Publisher ar Xiv. {APA Cref URL} https://arxiv.org/abs/2006.13679 doi: 10.48 · doi ↗
2Anselin ( \APA Cyear 2001) \APA Cinsertmetastar Anselin 01 {APA Crefauthors} Anselin, L. \APA Cref Year Month Day 2001. \BBOQ \APA Crefatitle Spatial econometrics Spatial econometrics. \BBCQ \B In \APA Crefbtitle A companion to theoretical econometrics A companion to theoretical econometrics ( \BVOL 310330, \BPGS 310–330). \Print Back Refs \Current Bib
3Assunção \B Others . ( \APA Cyear 2006) \APA Cinsertmetastar Assuncao 2006 {APA Crefauthors} Assunção, R \BPBI M., Neves, M \BPBI C., Câmara, G. \BCBL \BBA Freitas, C \BPBI D \BPBI C. \APA Cref Year Month Day 2006. \BBOQ \APA Crefatitle Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees. \BBCQ \APA Cjournal Vol Num Pages International · doi ↗
4Barry \BBA Hartigan ( \APA Cyear 1993) \APA Cinsertmetastar Barry 93 {APA Crefauthors} Barry, D. \BCBT \BBA Hartigan, J \BPBI A. \APA Cref Year Month Day 1993. \BBOQ \APA Crefatitle A Bayesian Analysis for Change Point Problems A bayesian analysis for change point problems. \BBCQ \APA Cjournal Vol Num Pages Journal of the American Statistical Association 88421309-319. {APA Cref URL} https://doi.org/10.1080/01621459.1993.10594323 doi: 10.1080/01621459.1993.10594323 \Print Back R · doi ↗
5Bates \BBA Maechler ( \APA Cyear 2019) \APA Cinsertmetastar Bates 2019 {APA Crefauthors} Bates, D. \BCBT \BBA Maechler, M. \APA Cref Year Month Day 2019. \BBOQ \APA Crefatitle Matrix: Sparse and Dense Matrix Classes and Methods Matrix: Sparse and dense matrix classes and methods \BBCQ [ \bibcomputersoftwaremanual ]. {APA Cref URL} https://CRAN.R-project.org/package=Matrix \APA Crefnote R package version 1.2-17 \Print Back Refs \Current Bib
6Bellocchi \BBA Geroliminis ( \APA Cyear 2019) \APA Cinsertmetastar Bellocchi 2019 {APA Crefauthors} Bellocchi, L. \BCBT \BBA Geroliminis, N. \APA Cref Year Month Day 20193. \APA Crefbtitle Shenzhen whole day Speeds. Shenzhen whole day speeds. {APA Cref URL} https://figshare.com/articles/dataset/Shenzhen_whole_day_Speeds/7212230 doi: 10.6084/m 9.figshare.7212230.v 2 \Print Back Refs \Current Bib · doi ↗
7Biernacki \B Others . ( \APA Cyear 2000) \APA Cinsertmetastar Biernacki 2000 {APA Crefauthors} Biernacki, C., Celeux, G. \BCBL \BBA Govaert, G. \APA Cref Year Month Day 2000. \BBOQ \APA Crefatitle Assessing a mixture model for clustering with the integrated completed likelihood Assessing a mixture model for clustering with the integrated completed likelihood. \BBCQ \APA Cjournal Vol Num Pages IEEE Transaction on Pattern Analysis and Machine Intelligence 7719-725. \Print Back Refs \Current Bi
8Biernacki \B Others . ( \APA Cyear 2010) \APA Cinsertmetastar Biernacki 2010 {APA Crefauthors} Biernacki, C., Celeux, G. \BCBL \BBA Govaert, G. \APA Cref Year Month Day 2010. \BBOQ \APA Crefatitle Exact and monte carlo calculations of integrated likelihoods for the latent class model Exact and monte carlo calculations of integrated likelihoods for the latent class model. \BBCQ \APA Cjournal Vol Num Pages Journal of Statistical Planning and Inference 1402991-3002. \Print Back Refs \Current Bi

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

A tree is all you need

Abstract

1 Context, motivations and related works

2 Method

2.1 Exponential family and observation models

2.2 Contiguity-constrained partition prior

Theorem 1** (Kirchhoff’s theorem).**

Proposition 1** (Compatible spanning trees).**

Proof.

2.3 A greedy agglomerative algorithm

Proposition 2** (Lower bound on merge score).**

Proof.

Lemma 1** (Deletion-contraction recurrence for multi-graphs).**

Proof.

2.4 Dendrogram and prior for the number of clusters

3 Results and discussion

3.1 Simulation study

3.2 Regionalization of french mobility statistics

3.3 Traffic speed clustering

4 Conclusion and future works

Theorem 1 (Kirchhoff’s theorem).

Proposition 1 (Compatible spanning trees).

Proposition 2 (Lower bound on merge score).

Lemma 1 (Deletion-contraction recurrence for multi-graphs).