Semi-supervised Clustering with Two Types of Background Knowledge:   Fusing Pairwise Constraints and Monotonicity Constraints

Germ\'an Gonz\'alez-Almagro; Juan Luis Su\'arez; Pablo; S\'anchez-Bermejo; Jos\'e-Ram\'on Cano; Salvador Garc\'ia

arXiv:2302.14060·cs.LG·March 1, 2023

Semi-supervised Clustering with Two Types of Background Knowledge: Fusing Pairwise Constraints and Monotonicity Constraints

Germ\'an Gonz\'alez-Almagro, Juan Luis Su\'arez, Pablo, S\'anchez-Bermejo, Jos\'e-Ram\'on Cano, Salvador Garc\'ia

PDF

Open Access

TL;DR

This paper introduces a novel semi-supervised clustering method that integrates pairwise constraints and monotonicity constraints using a new distance measure and EM optimization, effectively fusing two types of background knowledge.

Contribution

It is the first method to combine pairwise and monotonicity constraints in clustering, providing a formal framework and an optimization scheme for this integration.

Findings

01

Effective in benchmark datasets

02

Successful application to real-world data

03

Outperforms existing clustering methods

Abstract

This study addresses the problem of performing clustering in the presence of two types of background knowledge: pairwise constraints and monotonicity constraints. To achieve this, the formal framework to perform clustering under monotonicity constraints is, firstly, defined, resulting in a specific distance measure. Pairwise constraints are integrated afterwards by designing an objective function which combines the proposed distance measure and a pairwise constraint-based penalty term, in order to fuse both types of information. This objective function can be optimized with an EM optimization scheme. The proposed method serves as the first approach to the problem it addresses, as it is the first method designed to work with the two types of background knowledge mentioned above. Our proposal is tested in a variety of benchmark datasets and in a real-world case of study.

Tables4

Table 1. Table 1: Datasets and Constraint Sets Summary

Dataset	Instances	Classes	Features	$C S_{10}$		$C S_{15}$		$C S_{20}$
				ML	CL	ML	CL	ML	CL
Artiset	899	10	2	494	3422	1240	7671	2061	13870
Balance	625	3	4	832	1059	1799	2479	3332	4418
BostonHousing4CL	506	4	13	284	941	686	2089	1266	3784
Car	1728	4	6	7961	6745	18167	15244	32076	27264
ERA	1000	9	4	676	4274	1562	9613	2760	17140
ESL	488	9	4	216	912	521	2107	949	3707
LEV	1000	5	4	1381	3569	3174	8001	5692	14208
MachineCPU	209	4	6	41	149	99	366	205	615
Qualitative Bankruptcy	250	2	6	35	147	153	344	322	617
SWD	1000	4	10	1566	3384	3674	7501	6583	13317
Windsor Housing	546	2	11	915	516	2105	1135	3827	2059
Wisconsin	683	2	9	1273	1005	2834	2317	5146	4034

Table 2. Table 2: Results obtained by the five compared methods for the CS 10 constraint set.

Dataset	ARI ( $↑$ )					NMI ( $↓$ )					Unsat ( $↓$ )
	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans
Artiset	1.000	0.366	-	0.241	0.995	0.020	0.000	-	0.810	0.189	0.000	0.136	-	0.160	0.000
Balance	0.016	0.005	1.000	0.146	1.000	0.610	0.000	0.767	0.914	0.631	0.053	0.477	0.000	0.406	0.000
Bostonhousing4Cl	0.123	0.122	1.000	0.124	0.657	0.000	0.000	0.000	0.000	0.000	0.091	0.332	0.000	0.356	0.005
Car	0.825	0.036	1.000	0.112	0.993	0.058	0.000	0.057	0.128	0.164	0.008	0.500	0.000	0.458	0.000
ERA	0.996	0.013	-	-0.045	0.997	1.000	0.000	-	1.000	1.000	0.000	0.256	-	0.283	0.000
ESL	0.979	0.281	0.975	0.241	0.974	0.584	0.000	0.653	1.000	0.584	0.001	0.210	0.000	0.206	0.000
LEV	1.000	-0.224	1.000	0.071	0.999	0.989	0.000	0.971	1.000	0.971	0.000	0.558	0.000	0.345	0.000
MachineCPU	0.156	0.159	0.987	0.224	0.196	0.143	0.000	0.258	0.364	0.258	0.034	0.385	0.000	0.339	0.011
Qualitative Bankruptcy	1.000	0.665	-0.300	0.934	1.000	0.000	0.000	0.650	0.000	0.000	0.000	0.143	0.650	0.035	0.000
SWD	0.217	0.111	1.000	0.066	0.955	0.947	0.000	0.947	0.973	0.933	0.093	0.380	0.000	0.390	0.001
Windsor Housing	0.984	0.073	0.994	0.064	1.000	0.000	0.000	0.000	0.000	0.000	0.001	0.452	0.000	0.452	0.000
Wisconsin	1.000	0.857	1.000	0.849	1.000	0.009	0.000	0.009	0.764	0.009	0.000	0.070	0.000	0.074	0.000
Mean	0.691	0.205	0.555	0.252	0.897	0.363	0.000	0.526	0.579	0.395	0.023	0.325	0.221	0.292	0.001

Table 3. Table 3: Results obtained by the five compared methods for the CS 15 constraint set.

Dataset	ARI ( $↑$ )					NMI ( $↓$ )					Unsat ( $↓$ )
	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans
Artiset	0.591	0.370	1.000	0.244	0.404	0.182	0.000	0.000	0.999	0.957	0.002	0.134	0.000	0.157	0.001
Balance	1.000	0.005	1.000	0.140	1.000	0.566	0.000	0.790	0.914	0.778	0.000	0.482	0.000	0.392	0.000
Bostonhousing4Cl	0.999	0.122	1.000	0.127	0.989	0.000	0.000	0.000	0.000	0.000	0.000	0.342	0.000	0.349	0.000
Car	0.999	0.029	1.000	0.113	1.000	0.057	0.000	0.057	0.120	0.131	0.000	0.501	0.000	0.461	0.000
ERA	0.998	0.012	-	-0.070	0.573	1.000	0.000	-	1.000	1.000	0.000	0.252	-	0.307	0.000
ESL	0.995	0.273	0.993	0.245	0.366	0.403	0.000	0.626	0.998	0.594	0.000	0.209	0.000	0.211	0.001
LEV	0.934	-0.225	0.250	0.062	0.896	0.971	0.000	0.982	1.000	0.985	0.003	0.555	0.375	0.340	0.000
MachineCPU	1.000	0.159	1.000	0.216	0.803	0.212	0.000	0.258	0.349	0.258	0.000	0.344	0.000	0.377	0.000
Qualitative Bankruptcy	1.000	0.665	0.689	0.935	1.000	0.000	0.000	0.132	0.000	0.000	0.000	0.173	0.125	0.038	0.000
SWD	0.997	0.114	1.000	0.068	1.000	0.947	0.000	0.947	0.963	0.947	0.001	0.381	0.000	0.394	0.000
Windsor Housing	1.000	0.073	-	0.058	0.992	0.000	0.000	-	0.000	0.000	0.000	0.464	-	0.451	0.000
Wisconsin	1.000	0.857	-	0.848	1.000	0.009	0.000	-	0.764	0.009	0.000	0.069	-	0.075	0.000
Mean	0.960	0.205	0.411	0.249	0.835	0.362	0.000	0.566	0.592	0.472	0.001	0.326	0.292	0.296	0.000

Table 4. Table 4: Results obtained by the five compared methods for the CS 20 constraint set.

Dataset	ARI ( $↑$ )					NMI ( $↓$ )					Unsat ( $↓$ )
	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans	PCKM-Mono	P2Clust	COP-KMeans	KMeans	PCSKMeans
Artiset	0.941	0.364	-0.950	0.239	1.000	0.194	0.000	0.976	0.793	0.855	0.001	0.133	0.975	0.166	0.000
Balance	1.000	0.004	1.000	0.136	1.000	0.509	0.000	0.831	0.914	0.825	0.000	0.479	0.000	0.407	0.000
Bostonhousing4Cl	1.000	0.122	-	0.128	1.000	0.000	0.000	-	0.000	0.000	0.000	0.342	-	0.340	0.000
Car	0.999	0.030	-	0.107	1.000	0.057	0.000	-	0.122	0.063	0.000	0.504	-	0.464	0.000
ERA	0.040	0.012	0.034	-0.017	0.999	0.986	0.000	0.999	1.000	0.999	0.019	0.252	0.000	0.262	0.000
ESL	0.295	0.303	0.246	0.244	0.994	0.405	0.000	0.482	1.000	0.759	0.000	0.189	0.000	0.213	0.000
LEV	0.997	-0.229	-	0.078	0.012	0.971	0.000	-	0.996	0.973	0.000	0.551	-	0.343	0.001
MachineCPU	0.987	0.159	0.231	0.216	0.949	0.258	0.000	0.241	0.359	0.258	0.000	0.365	0.000	0.406	0.000
Qualitative Bankruptcy	0.727	0.665	1.000	0.938	0.937	0.000	0.000	0.000	0.116	0.031	0.106	0.151	0.000	0.031	0.000
SWD	0.998	0.113	-	0.066	1.000	0.947	0.000	-	0.963	1.000	0.001	0.385	-	0.397	0.000
Windsor Housing	0.286	0.073	1.000	0.027	0.133	0.000	0.000	0.000	0.000	0.002	0.190	0.466	0.000	0.456	0.006
Wisconsin	0.945	0.857	1.000	0.848	0.836	0.008	0.000	0.009	0.041	0.009	0.016	0.068	0.000	0.071	0.004
Mean	0.768	0.206	-0.037	0.251	0.822	0.361	0.000	0.628	0.525	0.481	0.028	0.324	0.415	0.296	0.001

Equations16

L_{1} (x_{i}, x_{j}) = d = 1 \sum u w_{d} ∣ x_{[i, d]} - x_{[j, d]} ∣.

L_{1} (x_{i}, x_{j}) = d = 1 \sum u w_{d} ∣ x_{[i, d]} - x_{[j, d]} ∣.

L_{1} (x_{i}, x_{j}) = d : x_{[i, d]} > x_{[j, d]} \sum u w_{d} x_{[i, d]} - w_{d} x_{[j, d]} + d : x_{[j, d]} > x_{[i, d]} \sum u w_{d} x_{[j, d]} - w_{d} x_{[i, d]} .

L_{1} (x_{i}, x_{j}) = d : x_{[i, d]} > x_{[j, d]} \sum u w_{d} x_{[i, d]} - w_{d} x_{[j, d]} + d : x_{[j, d]} > x_{[i, d]} \sum u w_{d} x_{[j, d]} - w_{d} x_{[i, d]} .

r (x_{i}, x_{j}) = d : x_{[i, d]} > x_{[j, d]} \sum u w_{d} x_{[i, d]} - w_{d} x_{[j, d]} .

r (x_{i}, x_{j}) = d : x_{[i, d]} > x_{[j, d]} \sum u w_{d} x_{[i, d]} - w_{d} x_{[j, d]} .

L_{1} (x_{i}, x_{j}) = r (x_{i}, x_{j}) + r (x_{j}, x_{i}) .

L_{1} (x_{i}, x_{j}) = r (x_{i}, x_{j}) + r (x_{j}, x_{i}) .

\begin{array}[]{lc}J_{PCKMM}=&\frac{1}{K}\sum_{k=1}^{K}\sum_{x_{i}\in c_{k}}|(r(x_{i},\mu_{k})-r(\mu_{k},x_{i}))|+\\ &\\ &\sum_{(x_{i},x_{j})\in C_{=}}\mathbb{1}\llbracket l_{i}\neq l_{j}\rrbracket+\sum_{(x_{i},x_{j})\in C_{\neq}}\mathbb{1}\llbracket l_{i}=l_{j}\rrbracket\end{array}.

\begin{array}[]{lc}J_{PCKMM}=&\frac{1}{K}\sum_{k=1}^{K}\sum_{x_{i}\in c_{k}}|(r(x_{i},\mu_{k})-r(\mu_{k},x_{i}))|+\\ &\\ &\sum_{(x_{i},x_{j})\in C_{=}}\mathbb{1}\llbracket l_{i}\neq l_{j}\rrbracket+\sum_{(x_{i},x_{j})\in C_{\neq}}\mathbb{1}\llbracket l_{i}=l_{j}\rrbracket\end{array}.

\begin{array}[]{lc}x_{i}\in c_{h^{*}}\;\;\text{{if}}\;\;h^{*}=&\texttt{argmin}_{h}\left(|\sum_{j=1}^{u}(x_{[i,j]}-\mu_{[h,j]})|+\right.\\ &\left.\sum_{x_{j}:(x_{i},x_{j})\in C_{=}}\mathbb{1}\llbracket l(c_{h})\neq l_{j}\rrbracket+\sum_{x_{j}:(x_{i},x_{j})\in C_{\neq}}\mathbb{1}\llbracket l(c_{h})=l_{j}\rrbracket\right)\end{array}.

\begin{array}[]{lc}x_{i}\in c_{h^{*}}\;\;\text{{if}}\;\;h^{*}=&\texttt{argmin}_{h}\left(|\sum_{j=1}^{u}(x_{[i,j]}-\mu_{[h,j]})|+\right.\\ &\left.\sum_{x_{j}:(x_{i},x_{j})\in C_{=}}\mathbb{1}\llbracket l(c_{h})\neq l_{j}\rrbracket+\sum_{x_{j}:(x_{i},x_{j})\in C_{\neq}}\mathbb{1}\llbracket l(c_{h})=l_{j}\rrbracket\right)\end{array}.

μ_{i} = \frac{1}{∣ c _{i} ∣} x_{i} \in c_{i} \sum x_{i}

μ_{i} = \frac{1}{∣ c _{i} ∣} x_{i} \in c_{i} \sum x_{i}

\begin{array}[]{lc}[P(\rho<r_{\text{min}})=P(A-B<0),\;\;P(\rho\in\text{rope})P(\rho>r_{\text{max}})=P(A-B>0)]\end{array}.

\begin{array}[]{lc}[P(\rho<r_{\text{min}})=P(A-B<0),\;\;P(\rho\in\text{rope})P(\rho>r_{\text{max}})=P(A-B>0)]\end{array}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Data Management and Algorithms · Advanced Clustering Algorithms Research

Full text

Semi-supervised Clustering with Two Types of Background Knowledge: Fusing Pairwise Constraints and Monotonicity Constraints

Germán González-Almagro

DaSCI Andalusian Institute

DECSAI

University of Granada

Granada, Spain

[email protected]

&Juan Luis Suárez

DaSCI Andalusian Institute

DECSAI

University of Granada

Granada, Spain

[email protected]

&Pablo Sánchez-Bermejo

DECSAI

University of Granada

Granada, Spain

[email protected]

&José-Ramón Cano

DaSCI Andalusian Institute

Dept. of Computer Science

University of Jaén

Jaén, Spain

[email protected]

&Salvador García

DaSCI Andalusian Institute

DECSAI

University of Granada

Granada, Spain

[email protected]

Abstract

This study addresses the problem of performing clustering in the presence of two types of background knowledge: pairwise constraints and monotonicity constraints. To achieve this, the formal framework to perform clustering under monotonicity constraints is, firstly, defined, resulting in a specific distance measure. Pairwise constraints are integrated afterwards by designing an objective function which combines the proposed distance measure and a pairwise constraint-based penalty term, in order to fuse both types of information. This objective function can be optimized with an EM optimization scheme. The proposed method serves as the first approach to the problem it addresses, as it is the first method designed to work with the two types of background knowledge mentioned above. Our proposal is tested in a variety of benchmark datasets and in a real-world case of study.

Keywords Pairwise constraints $\cdot$ Monotonicity constraints $\cdot$ Expectation-minimization $\cdot$ Semi-supervised learning $\cdot$ Machine learning

1 Introduction

Clustering constitutes a key research area in the unsupervised learning paradigm, where no information on how data should be handled is available. It can be viewed as the task of grouping instances from a dataset into groups (or clusters), with the aim to extract new information from them [1]. From the classic K-means algorithm to the newer proposals [2], clustering has been applied to many problems, such as time series monitoring [3], COVID-19 medical image segmentation [4] and regular image segmentation [5], noisy speech processing [6] or band selection in hyperspectral images [7]. Background knowledge can be integrated into the classic clustering framework, thus reframing it into the semi-supervised learning paradigm [8, 9], where partial or incomplete information about the dataset is given to the perform clustering.

When additional information is given in the form of constraints, the constrained clustering problem arises. Constraints can be understood in three main ways: cluster-level [10], instance-level pairwise (or simply pairwise) [11] and feature-level constrained clustering [12]. This study focuses on pairwise constraints, which indicate whether two specific instances of a dataset must be placed in the same or in different clusters, resulting in Must-link (ML) and Cannot-link (CL) constraints, respectively. Constrained clustering has been applied in a variety of real world problems before, such as: satellite image time series [13], storage location assignment in warehouses [14], obstructive sleep apnea analysis [15] or electoral district design [16]. Recent studies and proposals, such as [17], prove the growing interest in the area of constrained clustering.

Recently, a new type of background knowledge coming from the supervised learning paradigm has been integrated into unsupervised learning. Monotonic classification is a particular case of supervised learning where classes are a set of ordered categories and classification models must respect monotonicity constraints among instances based on their descriptive features. This means that, if an instance $x_{i}$ has greater feature values than those of instance $x_{j}$ , its assigned class must also be higher (greater) in the ordering than that of $x_{j}$ [18]. Considering the classic example of house pricing: for two houses in the same neighborhood, the bigger ones are constrained to have higher prices than smaller houses when the rest of the features of the houses are similar [19]. This defines an order relationship between houses (instances) based on the value of their features, therefore models predicting house prices must take this into account to produce accurate results. Monotonicity constraints are a type of background knowledge that can be used to produce more accurate predictive models [20], and has been successfully applied in real world problem such as fraudulent firm classification [21], real-time dynamic malware detection [22], or learning activities analysis based on students’ opinion surveys [23]. Additionally, a recent study from The Alan Turing Institute states that considering underlying data monotonicity in data science/machine learning models leads to fairer applications [24].

In [18] a methodology to perform clustering in the presence of monotonicity information (ordered clustering) is proposed within the Multi Criteria Decision Aid (MCDA) framework. It is done by defining a distance measure based on the concept of preference, which is later explained in detail (Section 2.3), and whose basic concept relies on comparing instances from the dataset, discriminating feature-level comparative relationships. This results in a distance measure that produces ordered labeling in terms of monotonicity, understanding it as in the monotonic classification models which have previously been described.

This study addresses the fusion of the two types of background knowledge mentioned above: pairwise constraints and monotonicity constraints. It extends a previous study by the same authors [25] in both the theoretical background of the proposed method and the testing of its capabilities. A real-world application is also presented in this study, addressing the Shanghai Ranking of World Universities (SRWU) dataset from a new perspective. A previous study which combines monotonicity constraints and cluster-size constraints (capacitated clustering) can be found in [26], where researchers are motivated by the existence of problems in which both types of background knowledge is available. This constitutes evidence in favor of the interest in the combination of different types of background knowledge, as there are real-world problems in which background knowledge is given in a heterogeneous fashion. Following this trend, our research is motivated by the existence of real-world problems in which monotonicity constraints and pairwise constraints are available, such as the SRWU partitioning problem. To the best of our knowledge, there is no previous research on this topic, as models to perform ordered clustering have emerged very recently. In this study, the logical relationship between monotonic classification and ordered clustering is tackled, producing the monotonic clustering paradigm, in which pairwise constraints are later included, resulting in Monotonic Constrained Clustering (MCC). An expectation-minimization (EM) scheme is proposed to optimize a hybrid objective function which fuses both monotonicity and pairwise constraints. The proposed hybrid objective function is composed of a monotonic distance metric and a penalty term for pairwise constraints violations. The overall proposed optimization method for MCC is coined as Pairwise Constrained K-Means - Monotonic (PCKM-Mono).

The rest of this study is organized as follows: background concerning classic clustering, pairwise constrained clustering, monotonic classification and ordered clustering, which is presented in Section 2 and whose content is later used in Section 3 to introduce the proposed MCC method. Once the experimental setup used to carry out our experiments is presented in Section 4, Sections 5 and 6 report and analyze the experimental results obtained by the proposed method. A real-world case of study is carried out in Section 7, where our proposal is used to perform clustering on the SRWU dataset and compare the results obtained by other methods in the same task. Lastly, our conclusions are discussed in Section 8.

2 Background

As stated before, partitional clustering is the action of grouping instances of a dataset into $k$ clusters. A dataset $X=\{x_{1},\cdots,x_{n}\}$ contains $n$ instances, each one described by $u$ features. The $i$ th instance from $X$ is noted as $x_{i}=(x_{[i,1]},\cdots,x_{[i,u]})$ . The goal of a clustering algorithm is to assign a class label $l_{i}$ to each instance in $X$ . The result is a list of labels $L=[l_{1},\cdots,l_{n}]$ , with $l_{i}\in\{1,\cdots,k\}\;\forall i\in\{1,\cdots,n\}$ , that effectively splits $X$ into $k$ non-overlapping clusters $c_{i}$ to form a partition called $C=\{c_{1},\cdots,c_{K}\}$ . The label associated with a given cluster $c_{i}$ can be accessed as $l(c_{i})$ . The cluster membership of every instance is determined by the similarity of the instance to the rest of instances in the same cluster, and the dissimilarity to instances in other clusters. Many types of distance measurements can be used to determine pairwise similarities [27].

2.1 Constrained Clustering

In real world applications, it is common to have some information about the analyzed datasets, even if this information is not given in the form of labels. In pairwise constrained clustering, a set of constraints is given to guide the clustering process. Constraints involve pairs of instances, indicating whether they must or must not belong to the same cluster; thus, two types of pairwise constraints can be formalized:

•

Must-link (ML) constraints $C_{=}(x_{i},x_{j})$ : instances $x_{i}$ and $x_{j}$ from $X$ must be placed in the same cluster.

•

Cannot-link (CL) constraints $C_{\neq}(x_{i},x_{j})$ : instances $x_{i}$ and $x_{j}$ from $X$ cannot be assigned to the same cluster.

It is known that ML constraints are transitive, reflexive and symmetrical, and therefore they constitute an equivalence relationship. This is not the case for CL constraints; however, they can be chained to deduce new ML constraints [28]. Pairwise constraints can be enforced in two ways: hard [28] and soft [29] constraints. The former must necessarily be satisfied in the output partition of any algorithm which makes use of them, while the latter are interpreted as strong suggestions by the algorithm but can be only partially satisfied in the output partition.

In CC (Constrained Clustering), the goal is to find a partition (clustering) of $k$ clusters such that $C=\{c_{1},\cdots,c_{k}\}$ of $X$ , ideally satisfying all constraints (in hard CC) or as many constraints as possible (in soft CC). The classic clustering requirements also have to be observed: it must be fulfilled that the sum of instances in each cluster $c_{i}$ is equal to the number of instances in $X$ , which has been defined as $n=|X|=\sum_{i=1}^{k}|c_{i}|$ .

2.2 Monotonicity Constraints in Classification

Monotonicity constraints were originally integrated into the supervised learning classification task, leading to monotonic classification. It can be viewed as a special case of standard classification where the classes constitute a set of ordered categories. Monotonic classification models must respect monotonicity constraints between the feature values of the instances and their class labels [20].

Formally, monotonic classification aims to predict the class label $y_{i}$ from an instance $x_{i}$ with $y\in\mathcal{Y}=\{l_{1},\cdots,l_{m}\}$ . The categories in $\mathcal{Y}$ are arranged in an order relation $\prec$ such as $l_{1}\prec l_{2}\prec\cdots\prec l_{m}$ . In doing so, features and class labels are monotonically constrained by the problem background knowledge i.e. $x_{i}\succeq x_{j}\rightarrow f(x_{i})\geq f(x_{j})$ where $x_{i}\succeq x_{j}$ implies that all features in $x_{i}$ compare to features in $x_{j}$ with operator $\geq$ , this is: $x_{i,q}\geq x_{j,q}\;\forall q\in\{1,\cdots,u\}$ [30]. This given relationship between instances referred as dominance. In this case $x_{1}$ dominates $x_{2}$ . The goal of monotonic classification is to build a classifier that does not violate monotonicity constraints (pairwise dominance relationships). The result is a monotonic classifier [20].

Much in the same way as it is done with constrained clustering methods, a distinction can be done in monotonic classifiers: soft monotonic models try to minimize the number of monotonic constrains violation, while hard monotonic models always produce monotonic predictions (never violate monotonic constraints) [19].

2.3 Partially Ordered Data Clustering in MCDA

In [18] the monotonicity constraints are integrated into unsupervised learning to produce the ordered clustering framework. Particularly, they are integrated into the MCDA paradigm, which is a subfield of operational research that concerns the structuring and solving decision problems including multiple criteria [31]. To do so, the classic symmetrical notion of distance in pattern recognition is replaced with the asymmetrical notion of preference from the MCDA paradigm. The preference of an instance over another evaluates the global advantages of the former over the latter with respect to some preference criteria. The notion of preference can be seen as a decomposition of a distance measure, taking into account the sign of the differences. To cluster instances in an MCDA context, the similarity between every pair of instances is evaluated in terms of preferences taking all the other alternatives into account. With this in mind, two instances are similar if they are preferred to or by the same set of instances. To formalize these concepts, let us consider the weighted $L_{1}$ distance (for the maximization case and without loss of generality) as in Equation 1, which can be simplified as in Equation 2, with $w_{d}\in[0,1]$ being the weight assigned to the $d$ th feature.

[TABLE]

Subsequently, let us define the preference of $x_{i}$ over $x_{j}$ as in Equation 3. To put this into words, $r(x_{i},x_{j})$ quantifies the sum of differences between $x_{i}$ and $x_{j}$ limited to the features in which $x_{i}$ has higher (lower) values than $x_{j}$ for the maximization (minimization) case. Intuitively, the preference $r(x_{i},x_{j})$ indicates the cumulative quantified value of the advantage of $x_{i}$ over $x_{j}$ . Please note that, as it has already been mentioned, the preference is not symmetrical: $r(x_{i},x_{j})\neq r(x_{j},x_{i})$ in most cases.

[TABLE]

Finally, note that the weighted $L_{1}$ distance between two instances can always be expressed as in Equation 4. This decomposition can be done the same way for any $L_{p}$ distance.

[TABLE]

3 The Proposal: Pairwise Constrained Monotonic Clustering

In this study, the combination of pairwise constraints and monotonicity constraints is investigated. Bearing in mind all formal concepts from monotonic classification and ordered clustering (from Section 2), establishing a logical relation between the concepts of dominance and preference is straightforward. This is: if an instance $x_{i}$ dominates $x_{j}$ , then it is also true that instance $x_{i}$ is preferred over $x_{j}$ (for uniform weights). More formally: $x_{i}\succeq x_{j}\rightarrow r(x_{i},x_{j})\geq r(x_{j},x_{i})$ . This way, any distance $L_{p}$ defined as in Equation 4 can be used to measure distances in clustering methods for them to produce output partition satisfying monotonicity constraints. This new clustering paradigm is coined as monotonic clustering.

To perform pairwise constrained monotonic clustering, an Expectation-Minimization (EM) optimization scheme is used, along with a hybrid objective function which takes into account both pairwise constraints and monotonicity constraints. To this end, a distance measure designed on the basis of the definition of preference (originally used in ordered clustering), and a pairwise constraint-based penalty term are combined to produce the already mentioned function. We named this approach Pairwise Constrained K-Means - Monotonic (PCKM-Mono).

The EM optimization scheme is widely used in the literature to approach clustering problems ranging from classic clustering problems to constrained clustering [32] and monotonic clustering [18]. Two steps build the EM optimization scheme: (1) in the Expectation step (E step), given a set of cluster representatives (centroids) $\{\mu_{1},\cdots,\mu_{K}\}$ , every instance $x_{i}$ is assigned to the cluster $c_{j}$ that minimizes its contribution to the objective function, computed with respect to the cluster representatives; (2) in the Minimization step (M step), the cluster representatives $\{\mu_{1},\cdots,\mu_{K}\}$ are reestimated for the current cluster assignment $\{c_{1},\cdots,c_{K}\}$ to minimize the objective function. The EM optimization scheme iterates between these two steps until some convergence criteria are met. With this in mind, two elements need to be defined in order to apply the EM scheme to the constrained monotonic problem: the objective function and the centroid computation criteria.

Cost function.

The cost function of the proposed PCKM-Mono algorithm combines two main elements: a monotonic distance measure (proposed in [18]) and a pairwise constraint-based penalty term. Equation 5 defines the hybrid objective function optimized by PCKM-Mono, where $\mathbb{1}\llbracket\cdot\rrbracket$ is the indicator function (returns 1 if the predicate given as argument holds, and 0 otherwise), and $\mu_{k}$ is the centroid associated with cluster $k$ . The first term in Equation 5 is a preference-based distance metric, while the other two terms refer to the cost of violating CL and ML constraints (the penalty term), respectively. Please note that the first term of Equation 5 produces completely stratified clusters when applied alone, which would produce perfectly monotonic partitions. However, this is not a desirable result in most real-world problems, as will be proved in Section 5.

[TABLE]

This cost function can be translated into an assignation rule as in Equation 6, which can be intuitively interpreted as: assign each instance to its closest (preferred) cluster among those where it produces the least violated constraints.

[TABLE]

Centroid update rule.

Regarding the computation of the centroid for every cluster after the E step, it is done by following its traditional form: every centroid is computed as the average of all instances which belong to the cluster it represents. This can be formalized as in Equation 7.

[TABLE]

The overall PCKM-Mono optimization procedure is summarized in Algorithm 1. It is clear that the proposed method is soft constrained for both pairwise constraints and monotonicity constraints.

4 Experimental Setup and Calibration

In order to evaluate the capabilities of our proposal and compare its performance with previous methods, monotonic datasets need to be used. In [19] a list 12 monotonic datasets is used to test the capabilities of monotonic methods. These are the datasets used in our experiments, which can be found in the Keel-dataset repository111https://sci2s.ugr.es/keel/category.php?cat=clas [33] and are used in recent research concerning monotonic classification [34]. Three constraint sets with incremental levels of constraint-based information are generated for each dataset. Since the Euclidean distance is used to measure pairwise distances in all compared algorithms, a standardization procedure is applied to all datasets. No other preprocessing step is performed on the datasets.

Constraints are generated following the method in [28]. Three constraint sets are generated for every datasets, namely: $CS_{10}$ , $CS_{15}$ and $CS_{20}$ . Each constraint set is associated with a small percentage of the size of the dataset: 10%, 15% and 20%, respectively. The formula $(n_{f}(n_{f}-1))/2$ tells us how many artificial constraints will be created for each constraint set, with $n_{f}$ being the fraction of the size of the dataset associated with each of these percentages. Table 1 displays a summary of all datasets and constraint sets used in our experiments.

4.1 Evaluation Method and Validation of Results

Given the hybrid nature of our proposal, different features of the obtained partitions results have to be inspected to assess their quality in terms of different measures. The Adjusted Rand Index (ARI) will be used to measure the overall degree of agreement between the obtained partitions and the ground truth [35]. The Rand Index measures the degree of agreement of two partitions $C_{1}$ and $C_{2}$ for the same given dataset $X$ , with $C_{1}$ and $C_{2}$ viewed as collections of $n(n-1)/2$ pairwise decisions. This measure is corrected for chance to obtain the ARI. For more details on ARI, see [35]. An ARI value of 1 indicates total agreement between $C_{1}$ and $C_{2}$ , while -1 means total disagreement. The quality with respect to the monotonicity of the obtained partition can be measured with the Non-Monotonic Index (NMI), which measures the degree to which monotonicity constraints are violated. It is defined as the rate of violations of monotonicity divided by the total number of examples in a dataset [36]. Finally, the Unsat measure is used to evaluate the quality of the results from the point of view of constrained clustering. Unsat is computed as the rate of violated constraints in a given partition [37].

Bayesian statistical tests are used in order to validate the results (which will be presented in Section 5), instead of using the classic Null Hypothesis Statistical Tests (NHST), whose disadvantages are analyzed in [38], where a new statistical comparative framework is also proposed. The Bayesian version of the frequentist non-parametric sign test is used in this study. In the Bayesian sign test, the statistical distribution of a given parameter $\rho$ is obtained according to the differences between two sets of results, assuming it is a Dirichlet distribution. To do so, the Bayesian sign test proceeds as follows: the number of times that $A-B<0$ , the number of times where there are no significant differences, and the number of times that $A-B>0$ , then the weights of the Dirichlet distribution are iteratively updated and finally sampled to obtain a large sample of the distribution. In order to identify cases where there are no significant differences, the region of practical equivalence (rope) $[r_{\text{min}},r_{\text{max}}]$ is defined, so that $P(A\approx B)=P(\rho\in\text{rope})$ . The result of this process is a set of triplets with the form described in Equation 8. The rNPBST R package is employed to apply the test, whose documentation and guide can be found in [39].

[TABLE]

4.2 Calibration

To demonstrate the capabilities of the proposed PCKM-Mono algorithm, it is compared with four other previous EM-style clustering algorithms, including the only existing purely monotonic clustering algorithm, two purely constrained clustering algorithms (including the most recent one), and a classic clustering algorithm:

•

P2Clust: The first approach to monotonic clustering. It modifies the distance measure used in the expectation step of the EM scheme to produce purely monotonic partitions. Monotonicity constraints are never violated in partitions produced by P2Clust [18], thus it is a hard constrained method for monotonicity constraints. It does not consider pairwise constraints, therefore it is purely monotonic.

•

COP-Kmeans: COnstrained Partitional K-means constitutes the first approach to constrained clustering [28]. It is taken as the baseline comparison for any constrained clustering method. To integrate constraints into the clustering process, it modifies the assignment rule of instances to a cluster in such a way that no constraints can be violated. The algorithm halts when a dead-end is reached. It produces partitions which satisfy all constraints when it does not arrive at dead-ends, thus it is a hard constrained method for pairwise constraints. It a purely constrained clustering algorithm.

•

Kmeans: The original Kmeans algorithm proposed in [40]. Neither pairwise constraints nor monotonicity constraints are considered in Kmeans.

•

PCSKMeans: The Pairwise Constrained Sparse K-Means algorithm is an extension of the classic Sparse K-Means algorithm that integrates constraints by means of a weighted penalty term [32]. It constitutes the most recent EM-style approach to constrained clustering.

Regarding the parameter setup, all algorithms use an EM scheme to find a partition of the datasets, thus sharing many of their parameters. The $k$ parameter, which indicates the number of clusters of the output partition is always set to the number of classes for every dataset (in Table 1). The maximum number of iterations allowed before convergence is set to 100 in all cases. The convergence criterion is centroid shifting: the EM optimization procedure is considered to have converged when average centroid shifting is less than $10^{-4}$ . Random centroid initialization is used for all algorithms. The P2Clust algorithm allows us to parameterize the computation of its internal $\alpha$ coefficient; this parameter is set to 1.1. The sparsity level of the PCSKMeans algorithm is set to 1.1. All parameters have been set by following the guidelines of the authors, and PCKM-Mono parameters have been decided upon preliminary experimentation. The final purpose of this work is to provide a fair comparison between algorithms, assessing their robustness in a common environment with multiple datasets.

5 Experimental Results

The experimental results obtained for all datasets and constraint sets are presented in this section. Since non-deterministic procedures are present in every compared method (such as the random initialization of centroids), the average results of 50 runs are presented in Tables 2, 3 and 4, aiming to mitigate the effects that stochastic procedures may cause. Please note that, in cases where the COP-Kmeans algorithm is not able to produce a partition, we assign that particular run the worst possible benchmark values. Cases where no result is reported are cases in which COP-Kmeans was never able to produce an output partition. Let us remember that ARI is a maximization external quality index, while NMI and Unsat are both for minimization.

Figures 1, 2 and 3 are used to compare average results for all methods, and we refer to them as violinplots. They allow for a quick view of the distribution of results achieved by each method, as they contain a boxplot in addition to the outer violinplot.

By examining the results, it seems obvious that the proposed algorithm, PCKM-Mono, is able to find a balance between constraint satisfaction and the monotonicity of the output partition. Clearly P2Clust, which is a purely monotonic algorithm, always produces the best results with respect to NMI, as shown in Figures 1(b), 2(b), and 3(b). Similarly, Figures 1(a), 2(a), and 3(a) show how purely constrained clustering algorithms (COP-Kmeans and PCSKMeans) produce the best results with respect to Unsat. However, PCKM-Mono is able to produce the best average ARI results (see Figures 1(c), 2(c), and 3(c)), while also achieving better NMI results than purely constrained clustering algorithms, and better Unsat results than purely monotonic clustering algorithms. This is indicative of the viability of the combination of pairwise and monotonic constraints to solve benchmark problems in both areas; moreover, it provides evidence in favor of the proposed EM optimization scheme, which is simple but can be, nonetheless, suitable for this task.

Some of the particular numerical results are worth noting, for example: the COP-Kmeans algorithm achieves near-optimum results for the CS10 constraint set. The reason for this being that, the lower the number of constraints, the easier it is for the algorithm to find a feasible partition, which is usually a very accurate partition in the case of COP-Kmeans. With regard to the results obtained by PCKM-Mono for Unsat and NMI, both are observed to be stable with the increasing amount of constraint based information, while the ARI is observed to scale with it (although not in a consistent manner). Please note that, the results obtained by Kmeans and P2Clust are practically identical, independent of the constraint set, which is a virtually average result, as they are not affected at all by constraints.

6 Statistical Analysis of Results

In contrast with NHST, it is possible to create illustrative graphical representations of the results of the Bayesian sign test. To do so, the obtained distribution is sampled to obtain a set of triplets, which are interpreted as barycentric coordinates in an equilateral triangle, thus producing a cloud of points with varying density. This is known as a heatmap. Figure 4 shows heatmaps which compare the proposed method PCKM-Mono with the rest of the benchmarked methods for the three measures obtained: ARI, NMI and Unsat. The region of practical equivalence is set to $rope=[-0.02,0.02]$ for ARI, and to $rope=[-0.01,0.01]$ for NMI and Unsat, following the guidelines in [41]. The results produced by PCKM-Mono are always taken as $B$ in 8, and $A$ represents the set of results obtained by the compared method. Please note that, as ARI is a measure to maximize, a cloud of points located in the region of the map corresponding to MPCK-Means would indicate statistically significant differences between the two methods in favor of MPCK-Means. The opposite situation is found for NMI and Unsat.

All heatmaps reinforce the conclusions obtained in the Experimental Results Section 5. It is clear that PCKM-Mono represents a statistically significant improvement over all compared method with respect to ARI, except for the comparison against PCSKMeans, which is the most debated one with a slight advantage for PCSKMeans. Heatmap 4(d) gives the general advantage to PCSKMeans for the ARI measure, but not by a wide margin, indicating no significant differences in some cases and advantage of PCKM-Mono in a significant portion of the experiments. When it comes to the comparison concerning NMI, and Unsat, conclusions remain unchanged. Heatmap 4(e) confirms the indisputable superiority of purely monotonic algorithms with respect to NMI. However, 4(l) reveals no statistically significant differences between PCKM-Mono and PCSKMeans with respect to Unsat, and 4(j) an advantage of PCKM-Mono over COP-Kmeans for the same measure. With this in mind, it is reasonable to assert that, for the experiments conducted in this study, the proposed PCKM-Mono algorithm has the same or better capabilities than previous CC algorithms to include constraints into the clustering process. Please note that, even if PCKM-Mono and PCSKMeans feature disputed results for the ARI measure, PCKM-Mono is indisputably superior to PCSKMeans for NMI and statistically similar to PCSKMeans regarding the Unsat measure, thus it is fair to claim an advantage of PCKM-Mono over PCSKMeans in the general case.

7 A case of study: The Shanghai Ranking dataset

In this section we assess the applicability of our proposal for a real-world problem. The Shanghai Ranking of World Universities (SRWU) dataset has been used before to test the capabilities of monotonic clustering methods, e.g. in [18] the top 100 institutions are used to test the P2Clust method. In our experiments we used the dataset available in this kaggle repository 222https://www.kaggle.com/code/saurav9786/eda-for-university-ranking/data?select=shanghaiData.csv, which contains the SWRU results for years 2005 to 2015. Only the results from the year 2015 are used in our experiments. In our dataset, institutions are ranked from best to worst in chunks of size 50 for the first 100 institutions and in chunks of size 100 for the rest of them, generating a total of 7 classes for the 500 institution in the dataset. Our goal is to cluster the dataset so that institutions ranked in the same chunk appear in the same cluster in the final partition.

Originally, the dataset has 9 features, although some of them do not provide any valuable information for clustering methods and thus they can be removed, such as the institution name or its national rank. The dataset has 6 features after removing the useless ones, and can be visualized in the pairplot in Figure 5. Observing Figure 5, it seems clear that the SRWU is in fact a monotonic dataset, an therefore, it is has to be addressed with monotonic methods. However, there are some exceptions to this monotonicity. In fact, if we compute the NMI value for the true partition of the dataset, we obtain a value of $0.07$ as a result, which indicates that the monotonicity is broken by some instances. This is the reason why constraints can help improve the results, as if the dataset was purely monotonic, a method like P2Clust, which is hard constrained for monotonicity constraints, could solve it more accurately.

Scaling, standardization and missing values imputation (basic Knn imputer) steps are performed before applying all 5 clustering methods considered in this study to the dataset. Figure 6 shows the results obtained by all method for the three quality measures and with increasing values for $n$ in the formula $(n_{f}(n_{f}-1))/2$ , and thus, generating increasing levels of constraint-based information. This is conducted to observe how the results scale with the number of available constraints. Constraints are generated as it is done for benchmark datasets (see Section 4).

In Figure 6(a) it can be clearly observed that PCKM-Mono represents the best option to generate scaling quality results for the SRWU partitioning problem. It is followed by the purely monotonic P2Clust method, which maintains stable results, as it does not consider constraints. It is also interesting to note how COP-Kmeans scale the results even by a greater factor than PCKM-Mono, although achieving worse results, as it cannot deal with the monotonicity of the data.

Regarding the results for Unsat, presented in Figure 6(b), we can observe how Unsat values produced by PCKM-Mono scale inversely proportional with respect to the number of constraints. This is indicative of constraints helping the clustering process to find the true shape of the cluster, therefore making it easier for the method to satisfy a higher number of them. The rest of the methods maintain a stable Unsat, with COP-Kmeans always producing a value of 0 for this measure (as it can never generate partitions which violate any constraint) and with PCSKMeans featuring the worse value for it. This is indicative of the method not being suitable at all for the problem, as even non-constrained non-monotonic methods such as Kmeans are able to obtain better Unsat results.

In Figure 6(c) we can observe one of the most interesting effects of constraints. Please note that NMI results for PCKM-Mono decline as the number of constraints increases. The interpretation of this result can be counterintuitive, as one could expect it to decrease. However, the NMI is actually shifting towards the NMI value produced by the true labels of SRWU ( $0,07$ ), thus being more accurate in practice. With regard to non-constrained methods, they maintain an stable NMI value (as expected), with P2Clust always producing an NMI of 0, as it can never generate partitions which violate monotonicity. For the non-monotonic constrained clustering methods, it can be observed that the influence of constraints in COP-Kmeans is enough to divert clusters from the hyperspherical shape produced by the Euclidean distance, and thus generating an acceptable NMI value, which is not the case for PCSKMeans.

All of these results are in favor of the hypothesis of pairwise constraints and monotonicity constraints benefiting from each other when combined. Please note that, PCKM-Mono would produce the same NMI values as P2Cust if it were not for pairwise constraints, which have proved to divert the method from this trend and towards more accurate NMI results.

8 Conclusion

In this study, the first method which addresses Monotonic Constrained Clustering (MCC) is proposed: Pairwise Constrained K-Means - Monotonic (PCKM-Mono). An expectation-minimization scheme is used to locally optimize a hybrid objective function, integrating a monotonic distance metric and a pairwise constraint-based penalty term. The experimental results obtained from a variety of datasets and their following statistical analysis confirm the viability of the proposed method when compared with purely monotonic and purely pairwise constrained clustering techniques. Even if PCKM-Mono obtains results similar to those obtained by previous approaches for specific monotonicity and pairwise constraint satisfaction, there is strong statistical evidence in favor of PCKM-Mono regarding general clustering quality measures.

Acknowledgements

Our work has been supported by the research projects PID2020-119478GB-I00, A-TIC-434-UGR20 and PREDOC_01648.

Conflict of interest

The authors declare that there is no conflict of interest.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Absalom E Ezugwu, Abiodun M Ikotun, Olaide O Oyelade, Laith Abualigah, Jeffery O Agushaka, Christopher I Eke, and Andronicus A Akinyelu. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence , 110:104743, 2022.
2[2] Xiaosha Cai, Dong Huang, Guang-Yu Zhang, and Chang-Dong Wang. Seeking commonness and inconsistencies: A jointly smoothed approach to multi-view subspace clustering. Information Fusion , 91:364–375, 2023.
3[3] Jonatan Enes, Roberto R Expósito, José Fuentes, Javier López Cacheiro, and Juan Touriño. A pipeline architecture for feature-based unsupervised clustering using multivariate time series from hpc jobs. Information Fusion , 93:1–20, 2023.
4[4] Mohamed Abd Elaziz, Mohammed AA Al-Qaness, Esraa Osama Abo Zaid, Songfeng Lu, Rehab Ali Ibrahim, and Ahmed A. Ewees. Automatic clustering method to segment covid-19 ct images. P Lo S One , 16(1):e 0244416, 2021.
5[5] Li Guo, Pengfei Shi, Long Chen, Chenglizhao Chen, and Weiping Ding. Pixel and region level information fusion in membership regularized fuzzy clustering for image segmentation. Information Fusion , 92:479–497, 2023.
6[6] H Vani, MA Anusuya, and ML Chayadevi. Fuzzy clustering algorithms-comparative studies for noisy speech signals. Ictact J. Soft Comput. , 9(3):1920–1926, 2019.
7[7] Jun Wang, Chang Tang, Zhenglai Li, Xinwang Liu, Wei Zhang, En Zhu, and Lizhe Wang. Hyperspectral band selection via region-aware latent features fusion based clustering. Information Fusion , 79:162–173, 2022.
8[8] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine Learning , 109(2):373–440, 2020.