Scalable Clustering: Large Scale Unsupervised Learning of Gaussian   Mixture Models with Outliers

Yijia Zhou; Kyle A. Gallivan; Adrian Barbu

arXiv:2302.14599·stat.ML·October 16, 2024

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Yijia Zhou, Kyle A. Gallivan, Adrian Barbu

PDF

Open Access

TL;DR

This paper presents a scalable, theoretically guaranteed clustering algorithm robust to outliers, suitable for large datasets like ImageNet, and effective as an initialization for k-means.

Contribution

It introduces a new loss minimization-based clustering method with provable guarantees that scales efficiently and handles outliers in large datasets.

Findings

01

High accuracy with high probability under certain assumptions

02

Effective as an initialization for k-means clustering

03

Outperforms classic methods in speed and accuracy on large datasets

Abstract

Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for $k$ -means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a $k$ -means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as…

Tables2

Table 1. Table 1: Accuracy of clustering algorithms on five image datasets.

Accuracy(%)	MNIST	CIFAR-10	CIFAR-100	ImageNet val	ImageNet
CL	26.50	10.05	10.29	29.83	-
SC	82.46	63.47	25.17	43.96	-
EM	77.03	60.29	34.21	43.07	-
TD	73.38	64.76	37.55	-	-
t-SNE+ $k$ -means++	90.83	75.45	39.97	50.81	-
$k$ -means++	74.99	58.06	33.75	44.73	47.71
SCRLM	58.17	36.96	20.17	36.16	34.01
SCRLM+ $k$ -means	80.06	64.00	36.66	47.24	48.61

Table 2. Table 2: Computation time of clustering algorithms on five image datasets.

Time(s)	MNIST	CIFAR-10	CIFAR-100	ImageNet val	ImageNet
CL	328	1,285	1,286	252	-
SC	398	2,178	2,235	1,621	-
EM	21.3	129	1,944	2,658	-
TD	21.7	1,471	1,594	-	-
t-SNE+ $k$ -means++	478	2,255	2,294	421	-
$k$ -means++	5.61	23.8	207	52.1	10,005
SCRLM	0.46	3.25	28.2	40.6	1,269
SCRLM+ $k$ -means	10.5	33.6	327	67.8	2,610

Equations168

p (x ∣ Θ) = i = 1 \sum m w_{i} N (x ∣ μ_{i}, Σ_{i}) + w_{- 1} O (x)

p (x ∣ Θ) = i = 1 \sum m w_{i} N (x ∣ μ_{i}, Σ_{i}) + w_{- 1} O (x)

N (x ∣ μ_{i}, Σ_{i}) = N (x ∣ μ_{i}, σ_{i}^{2}) = \frac{1}{( 2 π ) ^{p /2} ∣ σ _{i}^{2} I _{p} ∣ ^{1/2}} exp {- \frac{1}{2} (x - μ_{i})^{T} (σ_{i}^{2} I_{p})^{- 1} (x - μ_{i})}

N (x ∣ μ_{i}, Σ_{i}) = N (x ∣ μ_{i}, σ_{i}^{2}) = \frac{1}{( 2 π ) ^{p /2} ∣ σ _{i}^{2} I _{p} ∣ ^{1/2}} exp {- \frac{1}{2} (x - μ_{i})^{T} (σ_{i}^{2} I_{p})^{- 1} (x - μ_{i})}

L (x; ρ) = i = 1 \sum N ℓ (x_{i} - x; ρ) = i = 1 \sum N min (\frac{∥ x _{i} - x ∥ ^{2}}{p ρ ^{2}} - F, 0)

L (x; ρ) = i = 1 \sum N ℓ (x_{i} - x; ρ) = i = 1 \sum N min (\frac{∥ x _{i} - x ∥ ^{2}}{p ρ ^{2}} - F, 0)

ℓ (d; ρ) = min (\frac{∥ d ∥ ^{2}}{p ρ ^{2}} - F, 0) .

ℓ (d; ρ) = min (\frac{∥ d ∥ ^{2}}{p ρ ^{2}} - F, 0) .

μ = x_{k}, where k = i \in S argmin L (x_{i}; ρ) .

μ = x_{k}, where k = i \in S argmin L (x_{i}; ρ) .

1 - 10 N^{2} exp {- p /128} - m exp {- na / m} - 2 m exp {- p /128} - m exp {- a (N - 1) / m}

1 - 10 N^{2} exp {- p /128} - m exp {- na / m} - 2 m exp {- p /128} - m exp {- a (N - 1) / m}

p > 128 (2 lo g N + lo g \frac{40}{δ}),

p > 128 (2 lo g N + lo g \frac{40}{δ}),

p > 128 (lo g m + lo g \frac{8}{δ}),

p > 128 (lo g m + lo g \frac{8}{δ}),

n > \frac{m}{a} (lo g m + lo g \frac{4}{δ}),

n > \frac{m}{a} (lo g m + lo g \frac{4}{δ}),

N > \frac{m}{a} (lo g m + lo g \frac{4}{δ}) + 1,

N > \frac{m}{a} (lo g m + lo g \frac{4}{δ}) + 1,

p > ⌈ 128 (2 lo g N + lo g \frac{40}{δ})⌉

p > ⌈ 128 (2 lo g N + lo g \frac{40}{δ})⌉

n > ⌈ \frac{m}{a} (lo g m + lo g \frac{4}{δ})⌉ .

n > ⌈ \frac{m}{a} (lo g m + lo g \frac{4}{δ})⌉ .

p > ⌈ 128 (lo g m + lo g \frac{8}{δ})⌉

p > ⌈ 128 (lo g m + lo g \frac{8}{δ})⌉

N > ⌈ \frac{m}{a} (lo g m + lo g \frac{4}{δ}) + 1 ⌉ .

N > ⌈ \frac{m}{a} (lo g m + lo g \frac{4}{δ}) + 1 ⌉ .

P (\frac{1}{n} i = 1 \sum n Z_{i}^{2} - 1 \geq ϵ) \leq 2 exp {- n ϵ^{2} /8} .

P (\frac{1}{n} i = 1 \sum n Z_{i}^{2} - 1 \geq ϵ) \leq 2 exp {- n ϵ^{2} /8} .

P (\frac{1}{p} ∥ x ∥^{2} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8} .

P (\frac{1}{p} ∥ x ∥^{2} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8} .

P (\frac{1}{2 p} ∥ x - y ∥^{2} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8} .

P (\frac{1}{2 p} ∥ x - y ∥^{2} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8} .

∥ x_{i} - x_{k} ∥^{2} > 1.5 p, x_{i}, x_{k} \in H .

∥ x_{i} - x_{k} ∥^{2} > 1.5 p, x_{i}, x_{k} \in H .

P (\frac{∥ x _{i} - x _{k} ∥ ^{2}}{2 p} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8},

P (\frac{∥ x _{i} - x _{k} ∥ ^{2}}{2 p} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8},

P (∥ x_{i} - x_{k} ∥^{2} \leq 2 p (1 - ϵ)) \leq 2 exp {- p ϵ^{2} /8} .

P (∥ x_{i} - x_{k} ∥^{2} \leq 2 p (1 - ϵ)) \leq 2 exp {- p ϵ^{2} /8} .

∥ x_{i} - x_{k} ∥^{2} > 2 p (1 - ϵ) .

∥ x_{i} - x_{k} ∥^{2} > 2 p (1 - ϵ) .

∥ x_{i} - x_{k} ∥^{2} > 1.5 p .

∥ x_{i} - x_{k} ∥^{2} > 1.5 p .

∥ x_{i} - x_{k} ∥^{2} < 2.5 p σ_{j}^{2}, x_{i}, x_{k} \in S_{j} .

∥ x_{i} - x_{k} ∥^{2} < 2.5 p σ_{j}^{2}, x_{i}, x_{k} \in S_{j} .

P (\frac{∥ x _{i} - x _{k} ∥ ^{2}}{2 p σ _{j}^{2}} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8},

P (\frac{∥ x _{i} - x _{k} ∥ ^{2}}{2 p σ _{j}^{2}} - 1 \geq ϵ) \leq 2 exp {- p ϵ^{2} /8},

P (∥ x_{i} - x_{k} ∥^{2} \geq 2 p σ_{j}^{2} (1 + ϵ)) \leq 2 exp {- p ϵ^{2} /8} .

P (∥ x_{i} - x_{k} ∥^{2} \geq 2 p σ_{j}^{2} (1 + ϵ)) \leq 2 exp {- p ϵ^{2} /8} .

P (∥ x_{i} - x_{k} ∥^{2} \geq 2.5 p σ_{j}^{2}) \leq 2 exp {- p /128} .

P (∥ x_{i} - x_{k} ∥^{2} \geq 2.5 p σ_{j}^{2}) \leq 2 exp {- p /128} .

∥ x_{i} - x_{k} ∥^{2} < 2.5 p σ_{j}^{2} .

∥ x_{i} - x_{k} ∥^{2} < 2.5 p σ_{j}^{2} .

∥ x_{i} - x_{k} ∥^{2} > p (1.5 + 0.75 σ_{j}^{2}), x_{i} \in H, x_{k} \in S_{j} .

∥ x_{i} - x_{k} ∥^{2} > p (1.5 + 0.75 σ_{j}^{2}), x_{i} \in H, x_{k} \in S_{j} .

E (∥ x_{i} - x_{k} ∥^{2}) = E [(μ_{j} + ϵ_{1} σ_{j}^{2} + 1)^{T} (μ_{j} + ϵ_{1} σ_{j}^{2} + 1)] = E (∥ μ_{j} ∥^{2}) + (σ_{j}^{2} + 1) E (ϵ_{1}^{T} ϵ_{1}),

E (∥ x_{i} - x_{k} ∥^{2}) = E [(μ_{j} + ϵ_{1} σ_{j}^{2} + 1)^{T} (μ_{j} + ϵ_{1} σ_{j}^{2} + 1)] = E (∥ μ_{j} ∥^{2}) + (σ_{j}^{2} + 1) E (ϵ_{1}^{T} ϵ_{1}),

E (∥ x_{i} - x_{k} ∥^{2}) = p + (σ_{j}^{2} + 1) E (∥ ϵ_{1} ∥^{2}) = (2 + σ_{j}^{2}) p .

E (∥ x_{i} - x_{k} ∥^{2}) = p + (σ_{j}^{2} + 1) E (∥ ϵ_{1} ∥^{2}) = (2 + σ_{j}^{2}) p .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Face and Expression Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Yijia Zhou , Kyle A. Gallivan

Department of Mathematics, Florida State University

and

Adrian Barbu

Department of Statistics, Florida State University

Abstract

Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for $k$ -means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a $k$ -means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet.

Keywords: $k$ -means, Gaussian mixture models, clustering

1 Introduction

Clustering is an important unsupervised learning technique with applications in many areas including information retrieval (Jardine and van Rijsbergen, 1971), image segmentation (Coleman and Andrews, 1979), pattern recognition (Diday et al., 1981), data mining (Mirkin, 2005), disease diagnosis (Alashwal et al., 2019), and more. One of the most commonly used non-probabilistic clustering approaches is the $k$ -means algorithm (Lloyd, 1982). Probabilistic clustering models, which can be characterized as more sophisticated versions of $k$ -means, are based on Gaussian Mixture Models (GMM) and yield more flexibility than $k$ -means. With GMMs, it is assumed that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular around a mean. In this way, a mean vector and the covariance matrix can be used to describe a cluster.

The motivation for this work comes from the problem of object recognition from images. An image usually contains one or more regions/objects of interest and the rest is meaningless background. This paper introduces a Gaussian mixture model with outliers where the Gaussian mixture components represent the objects of interest (positives), and the outliers (negatives) represent the background images that do not cluster together. However, since data is usually standardized to zero mean and standard deviation one, it is assumed that the outliers come from a zero mean Gaussian distribution.

In this paper, we are only interested in scalable clustering - methods that scale well to data with millions of observations, thousands of dimensions and a large number (e.g. thousands) of clusters. Furthermore, the proposed algorithm has theoretical guarantees of convergence of the estimated clusters to the true cluster labels.

The main contributions of this paper are summarized as follows:

The Gaussian mixture model with outliers is introduced as a simple framework for image classification problems in computer vision. 2. 2.

A novel clustering algorithm, Scalable Clustering by Robust Loss Minimization (SCRLM), is developed for the model. The basic idea of SCRLM is to find the positive clusters as local minima of a robust loss function that is non-zero only within a certain radius from the cluster centers and zero everywhere else, and extract the clusters one-by-one. 3. 3.

Theoretical guarantees are given that SCLRM is able to correctly cluster all the inliers and detect all the outliers with high probability under certain assumptions. 4. 4.

The performance predictions are validated with experiments using simulated data and real data. The simulation results indicate that when the assumptions are met, SCRLM outperforms other algorithms such as $k$ -means++, EM, t-SNE and spectral clustering. 5. 5.

Experiments on real data indicate that SCRLM is very effective when the number of clusters and the data dimension are large, and it can be used as an initialization method for $k$ -means clustering, outperforming $k$ -means++ in accuracy and computation time.

The rest of the paper is organized as follows. In Section 2, an overview of the literature on various existing clustering methods are given. Section 3 will develop a novel algorithm, SCRLM, to solve the clustering problem in the Gaussian mixture model with outliers, and the theoretical guarantees will be derived. In Section 4, experiments on both synthetic and real data will show that the proposed algorithm is scalable, efficient and accurate. Section 5 summarizes the findings and concludes the paper with a discussion of future research work.

2 Literature Reviews

The study of Gaussian mixture models can be traced back to Pearson (1894). The idea of using Gaussian mixtures in unsupervised learning was popularized by Duda et al. (1973). The Expectation Maximization (EM) algorithm (Dempster et al., 1977) was one of the first clustering algorithms for GMM. Xu and Jordan (1996) analyzed the convergence of EM for well-separated Gaussian mixtures. Dasgupta and Schulman (2007) proposed a two-round variant of the EM algorithm and showed that, with high probability, it can recover the parameters of the Gaussians to near-optimal precision. In recent years, approaches have been proposed to improve convergence guarantees and applied to different kinds of GMMs. Dwivedi et al. (2018) provided theoretical guarantees in two classes of misspecified mixture models and Segol and Nadler (2021) improved sample size requirements for accurate estimation by EM and gradient EM.

Tensor Decomposition, a spectral decomposition technique, also played an important role in learning GMMs. Hsu and Kakade (2013) developed a method based on moments with up to third order and provided theoretical guarantees that non-degenerate mixtures of spherical gaussians can be learned in polynomial time without any separation condition.

The algorithms above, designed for GMMs, are categorized as distribution-based clustering. Aside from those methods, there are other clustering methods that do not use statistical distributions to cluster the data objects.

Hierarchical clustering (Johnson, 1967), also known as connectivity-based clustering, creates a complete dendrogram of the data. It is either agglomerative (bottom-up) or divisive (top-down). In agglomerative hierarchical clustering, the similarities between clusters are measured by distances between points, which is referred as linkage. In complete linkage, the distance between the farthest points are taken as the intra cluster distance which is less susceptible to outliers than single linkage (Tan et al., 2016). The main disadvantage of hierarchical clustering is, due to high time and space complexity, it is not suitable for large-scale datasets.

In contrast to hierarchical clustering, $k$ -means (Lloyd, 1982) is one of the most famous centroid-based clustering algorithms, which works by minimizing the squared distances between every point and its nearest centroid. Since $k$ -means is an iterative algorithm involving initialization, clustering and centroids updates, proper initialization techniques such as Maxmin (Gonzalez, 1985), Refine (Bradley and Fayyad, 1998) and $k$ -means++ (Arthur and Vassilvitskii, 2006) have been proposed to improve the clustering results. Fränti and Sieranoja (2019) demonstrated that for well-separated clusters, the performance of $k$ -means depends completely on the goodness of initialization and $k$ -means++ is the best one among those methods.

Spectral clustering (Donath and Hoffman, 1973; Shi and Malik, 2000; Meilă and Shi, 2001; Ng et al., 2002; Von Luxburg, 2007) is a graph-based clustering algorithm that utilizes the eigenvectors of the adjacency matrix for dimension reduction. It is simple to implement but computationally expensive unless the graph is sparse and the similarity matrix can be efficiently constructed. Vempala and Wang (2004) investigated the theoretical performance of spectral clustering in the isotropic Gaussian mixture model and proved that with high probability, exact recovery of the underlying cluster structure was achieved under a strong separation condition. Löffler et al. (2021) showed that spectral clustering is minimax optimal in Gaussian mixture models with isotropic covariance, when the number of clusters is fixed and the signal-to-noise ratio is large enough.

$t$ -distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008) is a technique that visualizes high-dimensional data that is usually processed before clustering. $k$ -means++ and other clustering algorithms can be applied to the low-dimensional feature space obtained from t-SNE.

In summary, $k$ -means++ has high scalability with theoretical guarantees but it does not perform well in high dimension and is very sensitive to outliers. Spectral clustering and tensor decomposition are not sensitive to outliers but do not perform well on large-scale and high dimensional data. The proposed SCRLM method is a novel algorithm with efficiency and strong theoretical guarantees in dealing with outliers, scalability, and is suitable for high dimensional data.

3 Scalable Clustering by Robust Loss Minimization

Given a set $X=\{\mathbf{x}_{i}\in\mathbb{R}^{p},i=1,...,N\}$ of $N$ points from a Gaussian mixture model with outliers containing $m$ Gaussians, the goal is to group these points into $m$ compact subsets. In this paper, we only deal with large and high-dimensional data and a reasonably large number of clusters, specifically, $N\approx 10^{6},p\approx 10^{3},m\approx 10^{3}$ .

A Gaussian mixture model with outliers is a weighted sum of $m$ component Gaussian densities and outliers as given by the equation,

[TABLE]

where $\mathbf{x}\in\mathbb{R}^{p}$ is a $p$ -dimensional data point, $w_{i},i\in\{-1,1,2,\ldots,m\}$ , are the mixture weights, ${\cal N}\left(\mathbf{x}\mid\boldsymbol{\mu}_{i},\Sigma_{i}\right),i=1,\ldots,m$ , are the component Gaussian densities and $O(\mathbf{x})$ is the distribution of the outliers.

It is assumed that each component density is a $p$ -variate isotropic Gaussian function of the form,

[TABLE]

with mean vector $\boldsymbol{\mu}_{i}$ and covariance matrix $\Sigma_{i}=\sigma_{i}^{2}I_{p}$ .

Let $l(\mathbf{x})\in\{-1,1,2,\ldots,m\}$ be the label of observation $\mathbf{x}$ , i.e. the mixture component from which it was generated. The samples $\mathbf{x}_{i}$ with $l(\mathbf{x}_{i})>0$ are called positives and the outliers (with $l(\mathbf{x}_{i})=-1$ ) are also called negatives.

Inspired by real data examples, where the observations are standardized feature vectors generated by a convolutional neural network (CNN) from real images of certain objects or background, in this paper, it is assumed that the $m$ centers ${\boldsymbol{\mu}}_{i},i=1,...m$ and all outliers are generated from $O(\mathbf{x})=\mathcal{N}\left(\boldsymbol{0},I_{p}\right)$ and the label $i$ positives are generated with frequency $w_{i}$ from $\mathcal{N}({\boldsymbol{\mu}}_{i},\sigma_{i}^{2})$ , where $\sigma_{i}<1,\forall i$ .

The structure of the Gaussian mixture model with outliers used in this paper is shown in Figure 1.

The problem of interest is to cluster a set of unlabeled observations generated from such a Gaussian mixture model with outliers and recover the labels $l(\mathbf{x}_{i})$ as well as the mixture parameters $w_{i},{\boldsymbol{\mu}}_{i},\sigma_{i}$ .

3.1 Robust Loss Function

A loss minimization approach is taken, using the following loss function,

[TABLE]

where the per-observation loss, illustrated in Figure 2 is,

[TABLE]

The loss function $\ell(\mathbf{d};\rho)$ is zero outside a ball of radius $R_{\rho}=\rho\sqrt{pF}$ . The constant $F$ is set in this paper to $F=2.5$ .

The idea of the algorithm is to find the cluster centers $\boldsymbol{\mu}$ as local minima of the loss function (3). For computational reasons, the centers ${\boldsymbol{\mu}}$ are sought among a subsample $S\subset\{1,...,N\}$ of the observations $\mathbf{x}_{i},i=1,...,N$ ,

[TABLE]

After one cluster center ${\boldsymbol{\mu}}=\mathbf{x}_{k}$ has been found, all samples from $S$ within the radius $\rho\sqrt{pF}$ from the ${\boldsymbol{\mu}}$ are considered as belonging to this cluster and are removed. The process is repeated until $\min\limits_{j\in S}L\left(\boldsymbol{\mathbf{x}_{j}};\rho\right)=-F$ .

The process is illustrated in Figure 3. Suppose there are two clusters and some outliers, as shown in Figure 3a). The first cluster center is found as the sample with minimum loss (3) among all subsamples. Then, all subsamples within the given radius $R=\rho\sqrt{pF}$ to this center are identified as belongings to this cluster (Figure 3b)) and are removed. Then the process is repeated to find another cluster center (Figure 3c)) and all subsamples within the same radius are removed. Now all the subsamples have loss values $-F$ , which means there are no clusters left, so the remaining subsamples are negatives (Figure 3d)).

Once all the cluster centers are found, the label of each observation is assigned to its nearest center based on the norm distances. If the nearest distance is greater than $\rho\sqrt{pF}$ , it is classified as an outlier. The procedure is summarized in Algorithm 1.

3.2 Theoretical Guarantees

First, we summarize the notations used in this paper and the main assumption used in the derivation of the main result.

•

$N$ : the total number of observations

•

$S$ : a subset of all the observations

•

$n$ : the cardinality of $S$ , $n=|S|$

•

$p$ : the dimension of the observations $\mathbf{x}_{i}\in\mathbb{R}^{p}$

•

$l(\mathbf{x})$ : the true label (cluster assignment) of observation $\mathbf{x}$

•

$m$ : the true number of positives clusters

•

$T$ : the number of iterations (maximum number of clusters desired) in Algorithm 1

•

$S_{k}$ : the elements of $S$ with label $k$ , $S=\{\mathbf{x}\in S,l(\mathbf{x})=k\}$

•

$w_{k}$ : the weight of positive cluster $k$ , $k=\overline{1,m}$

•

$w_{-1}$ : the weight of negatives (observations $\mathbf{x}$ with $l(\mathbf{x})=-1$ )

•

${\boldsymbol{\mu}}_{k},\sigma_{k}$ : true mean and standard deviation of positive cluster $k$

•

$\sigma_{max}=\max_{k\geq 0}\sigma_{k}$ : the maximum standard deviation among all positive clusters

•

$H$ : the set of all the negatives

•

$F$ : a constant in the loss function (3), in this paper, $F=2.5$

•

$\rho$ : the bandwidth parameter in the loss function (3)

•

$R_{\rho}=\rho\sqrt{pF}$ : the radius of the support of the loss function (3)

•

$L(\mathbf{x};\rho)$ : the loss function (3)

•

$\ell(\mathbf{x};\rho)$ : the per-observation loss function (4)

The following is the main assumption that is needed for the theoretical guarantees.

Assumption 1.

$\sigma_{max}\leq\rho<\sqrt{0.6}$ , where $\rho$ is the bandwidth parameter for the loss function (3).

Then, we obtain the following theorem guaranteeing that Algorithm 1 (SCLRM) can detect all outliers and cluster all positives correctly with high probability.

Theorem 1.

Given $N$ samples from a GMM with outliers, with $w_{i}\geq a/m,i=\overline{1,m}$ for some $a>0$ and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then Algorithm 1 (SCLRM) using $|S|=n$ subsamples has $100\%$ accuracy with probability at least

[TABLE]

The proof of this theorem is given in Appendix D.

Based on Theorem 1, we have the following Corollary 1 on the theoretical bounds for parameters $p$ , $n$ and $N$ .

Corollary 1.

Given $N$ samples from a GMM with outliers, with $w_{i}\geq a/m,i=\overline{1,m}$ for some $a>0$ and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , for any $\delta>0$ , if

[TABLE]

then Algorithm 1 (SCRLM) using $|S|=n$ subsamples will have $100\%$ accuracy with probability at least $1-\delta$ .

The proof of this corollary is given in Appendix D.

3.3 Computational Complexity

Computing $L(\mathbf{x}_{j};\rho),j\in S$ (Step 4 of Algorithm 1) is $O(nNp)$ . Each iteration of steps 6-12 is $O(np)$ , so steps 5-13 take $O(nmp)$ . Similarly, steps 14-21 take $O(Nmp)$ . Therefore, the computation complexity of Algorithm 1 is $O(nNp+nmp+Nmp)=O(nNp+Nmp)$ . From Corollary 1 one could see that the subsample size $n$ should be chosen on the order of $O(m\log m)$ . Therefore, the computational complexity of Algorithm 1 is $O(mpN\log m)$ , it is linear in the dimension $p$ and the number of observations $N$ and log-linear in the number of clusters $m$ .

4 Experiments

This section presents an empirical evaluation of the performance of SCLRM using synthetic data and real datasets from computer vision. First, the tightness of the parameter bounds given in the theoretical guarantees are evaluated using synthetic data. Then, the effectiveness of SCRLM in real applications is evaluated using five real image datasets.

To compare the performances of SCRLM on synthetic and real datasets, two evaluation measures are defined for a true labeling vector $\mathbf{l}\in\mathbb{Z}^{N}$ and an obtained labeling vector $\hat{\mathbf{l}}\in\mathbb{Z}^{N}$ :

$\operatorname{accuracy}(\mathbf{l},\hat{\mathbf{l}})=\frac{1}{N}\max_{\pi\in P}|\pi(\hat{\mathbf{l}})\cap\mathbf{l}|$ 2. 2.

$\operatorname{purity}(\mathbf{l},\hat{\mathbf{l}})=\frac{1}{N}\sum_{i=1}^{T}\max_{j}|\hat{\mathbf{l}}^{-1}(i)\cap\mathbf{l}^{-1}(j)|$

where $P$ is the set of all permutations of $\{1,...,m\}$ , and $\mathbf{l}^{-1}(j)=\{i,\mathbf{l}_{i}=j\}$ . The accuracy is computed in polynomial time using the Hungarian algorithm.

In order to assess the effectiveness of SCLRM, its performance is compared with the following clustering methods: $k$ -means++ (Arthur and Vassilvitskii, 2006), Complete Linkage Clustering (CL) (Johnson, 1967), Spectral Clustering (SC) (Ng et al., 2002), Tensor Decomposition (TD) (Hsu and Kakade, 2013), Expectation Maximization (EM) (Dempster et al., 1977) and t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten and Hinton, 2008).

For consistency in comparing the accuracy and running time, experiments use an implementations of SCLRM and the state-of-the-art algorithms in MATLAB. For $k$ -means++, the built in function kmeans that implements $k$ -means++ is used. The built-in function linkage and spectralcluster are used for CL and SC respectively. In terms of EM, a standard EM for GMM is used. In terms of TD, Theorem 2 from (Hsu and Kakade, 2013) has been implemented in MATLAB. For t-SNE+ $k$ -means++, the built-in tsne function is used to generate a matrix of two-dimensional embeddings followed by an application of $k$ -means++ to obtain the final results. For SCRLM+ $k$ -means, $k$ -means is applied using the initial centers obtained from SCRLM.

4.1 Simulation Experiments

This section shows experiments on synthetic data generated from a Gaussian mixture model with outliers described in Equation 1.

4.1.1 Comparison of Observed and Theoretical Accuracy

This section evaluates the tightness of the theoretical bounds for Algorithm 1. For simplicity, data is generated without outliers. The minimum and maximum weights for the positive clusters are taken to be $0.8/m$ and $1.2/m$ respectively. The standard deviations $\sigma_{i}$ of positive clusters are linearly increasing with $i$ from $1/16$ to $1/4$ . The experiments use $\rho=0.5$ .

The regions for different parameter combinations where the theoretical bound guarantees of achieving $100\%$ accuracy with at least $99\%$ probability are compared with similar regions obtained experimentally. The theoretical regions are described below on a case-by-case basis. The experimental regions are obtained by running Algorithm 1 with different parameter combinations on an exponential grid. For each parameter combination the algorithm is run 100 times and the number of times the algorithm has $100\%$ accuracy is recorded. The area where at least 99 of the 100 runs had $100\%$ accuracy is shown in light gray in Figure 4.

Figure 4 a) displays the results for the data dimension $p$ vs. the sample size $N$ , keeping the number of clusters $m$ fixed to $m=3$ and the subsample size $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ . According to Corollary 1, the theoretical $p$ in this case should be at least

[TABLE]

when $\delta=0.01$ . This area is shown in dark gray in Figure 4 a). From the plot, one could see that the theoretical bound on $p$ is not very tight, since there is a large gap, by a factor of over 64 between the dark region (theoretical) and the light gray region (experimental).

Figure 4 b) displays the results for the subsample size $n$ vs. the number of clusters $m$ , when the sample size is $N=20,000$ and $p=3700$ . According to Corollary 1, the theoretical $n$ is at least

[TABLE]

The plot indicates that the theoretical bound for $n$ is tight, by a factor around 1.2.

Figure 4 c) displays the results for the data dimension $p$ vs. the number of clusters $m$ , when $N$ is fixed to be $N=20,000$ and $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ . According to Corollary 1, the theoretical $p$ should be at least

[TABLE]

when $\delta=0.01$ . The plot indicates that the bound on $p$ in not very tight, off by a factor over 32.

Figure 4 d) displays the results for the sample size $N$ vs. the number of clusters $m$ , when $p=3700$ and $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ . According to Corollary 1, the theoretical $N$ is at least

[TABLE]

Since the smallest $N$ one can pick is $n$ , that explains why the theoretical bound almost overlaps the experimental bound in this case.

The empirical results support the conclusions that the theoretical bound for $p$ is conservative and accurate results are obtained with smaller values of $p$ in practice, but the theoretical bounds for $N$ and $n$ are in good agreement with values needed in practice.

4.1.2 Stability of SCRLM w.r.t. the Bandwidth Parameter

This following experiments evaluate the tightness of the theoretical bounds of $\rho$ for Algorithm 1. The experiments use $\sigma_{max}=0.25$ .

Figure 5 a) displays the results for the bandwidth parameter $\rho$ vs. the sample size $N$ , keeping the number of clusters $m$ fixed to $m=3$ and the subsample size $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ .

Figure 5 b) displays the results for the bandwidth parameter $\rho$ vs. the data dimension $p$ , when $N=32$ , $m=3$ and $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ .

Figure 5 c) displays the results for the the bandwidth parameter $\rho$ vs. the number of clusters $m$ , when $N$ is fixed to be $N=20000$ , $p$ is fixed to be $3700$ and $n=\lceil\frac{m}{a}(\log m+\log\frac{4}{\delta})\rceil$ .

Figure 5 d) displays the results for the bandwidth parameter $\rho$ vs. the number of subsamples $n$ , when $N=20000$ , $p=4200$ and $m=3$ .

According to Assumption 1, for all of the experiments, the theoretical upper bound of $\rho$ is $\sqrt{0.6}$ , and the theoretical lower bound of $\rho$ is $\sigma_{max}=0.25$ . From Figure 5, one could see that the theoretical upper bound on $\rho$ is not very tight with a difference of more than 0.1, but the theoretical lower bound on $\rho$ is very tight with the difference less than 0.02.

The empirical results support the conclusions that the theoretical upper bound for $\rho$ is not tight, that 100% accuracy can be achieved with $\rho>\sigma_{max}$ in practice, but the theoretical lower bounds for $\rho$ are in good agreement with values needed in practice.

4.1.3 Comparison with other clustering methods

For these simulations, the data is generated with different number of clusters ( $m$ ), different dimension ( $p$ ) and different number of observations ( $N$ ). The data is generated to contain $50\%$ positives and $50\%$ negatives (outliers). The number of desired clusters is specified as $m+1$ for the other methods evaluated besides SCRLM. For SCRLM, the number of desired clusters $T$ was selected to be $T=N$ and thus the actual number of clusters was found automatically. From Figure 6, one could see that only SCRLM, SC and TD are able to detect outliers, the other methods are very sensitive to outliers. SCRLM and TD achieve $100\%$ accuracy in all cases.

4.2 Real Data Experiments

To show that the SCRLM is an effective method, it was applied to four real datasets: the MNIST (Deng, 2012), CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009) and ImageNet ILSVRC-2012 dataset (Russakovsky et al., 2015).

MNIST (Deng, 2012) has 70,000 images of handwritten digits from 0 to 9 with 60,000 images used for training and 10,000 images used for testing. CIFAR-10 (Krizhevsky et al., 2009) consists of 60000 images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. CIFAR-100 (Krizhevsky et al., 2009) is just like the CIFAR-10, except it has 100 classes containing 600 images each. The ImageNet (Russakovsky et al., 2015) validation dataset has 50000 observations on 1000 classes with 50 observation per class and the ImageNet training dataset has almost 1.3 million observations on 1000 classes.

Data preprocessing. Feature extraction for image data obtains a compact feature vector from the interesting parts of an image. The model SimCLR (Chen et al., 2020) was used to obtain a version of the MNIST dataset as real vectors with dimension $p=512$ . The images from the CIFAR-10 and CIFAR-100 were resized to $144\times 144$ pixels, then a pre-trained CNN, CLIP ResNet $50\times 64$ (Radford et al., 2021) with average pooling was used to obtain a $p=4096$ dimensional feature vector for each image. The images from the ImageNet were resized to $224\times 224$ pixels, then a $p=640$ dimensional feature vector for each image was obtained using CLIP ResNet $50\times 4$ (Radford et al., 2021) and attention pooling.

Results. Figure 7 shows the cluster centers obtained by SCRLM when the number of desired clusters $T$ is set to be 100 in MNIST. One could see that each cluster center is a good representation of that cluster. The variations of simple digits like 1 and 4 are relatively small, while complex digits like 2 and 3 have more variations. This shows MNIST is likely to have a hierarchical structure that can be used to cluster data when the number of clusters has a range of values.

Figure 8 and 9 support the conclusion that the SCLRM-based methods are superior to other methods for problems with a large number of clusters. From the plot, one could see that the purity of SCRLM and SCRLM+ $k$ -means increases as the number of clusters increases. However, the purity of TD does not have an obvious increase as the number of clusters increases, and the running time of EM increases significantly as the number of clusters increases. Therefore, SCRLM+ $k$ -means is the most efficient in producing a particular level of accuracy within a particular time.

The comparison of accuracy and time is shown in Figure 10 and summarized in Tables 1 and 2. In all the cases, SCRLM outperforms all other methods in terms of running time. EM performs well when the number of clusters is small but has prohibitive computation cost for CIFAR-100 and ImageNet validation datasets. t-SNE and TD achieve the best accuracy but only have acceptable running time when the dimension is small. Therefore, only SCRLM, SCLRM+ $k$ -means and $k$ -means++ are compared for the ImageNet training dataset. From Tables 1 and 2 one could see that SCLRM+ $k$ -means achieves a higher accuracy on ImageNet than $k$ -means++ in far less time, by a factor of 3.83. This demonstrates that SCRLM can be used as an initialization technique for $k$ -means clustering that has a better performance than $k$ -means++.

5 Conclusion

In this paper, a novel algorithm named SCRLM is proposed for clustering large scale Gaussian mixture models with outliers. The basic assumptions of the algorithm are: isotropic Gaussians for the foreground (positives) clusters, and a constraint on the range of values of the bandwidth parameter $\rho$ of the loss function. Unlike most clustering methods, the algorithm has strong theoretical guarantees that, with high probability, it is able to detect all the outliers and cluster all the observations correctly. Theoretical and numerical results confirm that SCRLM is an effective clustering method when the number of clusters and dimension are large. Moreover, it can be used as an initialization strategy for $k$ -means clustering and was observed to have better performance than other centroid initialization methods in extensive experiments.

There are still some drawbacks of SCRLM that must be overcome with additional work in the future. First, the clustering results of SCRLM depend strongly on the bandwidth parameter $\rho$ in the loss function. Its value is currently determined by trial and error. Second, it was observed that the clustering results of SCRLM for large numbers of clusters are more satisfactory than for small numbers of clusters. Hence, the future work will focus on two aspects. First, strategies for determining an effective value of $\rho$ based on the distribution assumptions and the given data will be explored. Second, a hierarchical clustering method based on SCLRM that is able to handle a large numbers of clusters, on the order of tens of thousands to millions will be designed and evaluated.

SUPPLEMENTARY MATERIALS

In the supplement, the basic separation and concentration results for pairs of training examples are presented in Appendix A. Appendix B contains proofs of Proposition 1 and Lemma 2, and Appendix C contains proofs of the basic propositions on loss bounds. The proofs of Theorem 1 and Corollary 1 are given in Appendix D.

Appendix A Preliminaries

Lemma 1.

(From (Wainwright, 2019), Example 2.5) If $Z_{1},...,Z_{n}$ are i.i.d Gaussian random variables $Z_{i}\sim\mathcal{N}(0,1)$ , then for any $\epsilon\in(0,1)$ ,

[TABLE]

Corollary 2.

If $\mathbf{x}=(X_{1},...,X_{p})$ is a multivariate Gaussian random variable $\mathbf{x}\sim\mathcal{N}(\mathbf{0},I_{p})$ , then $\mathbb{E}\left(\|\mathbf{x}\|^{2}\right)=p$ and for any $\epsilon\in(0,1)$ ,

[TABLE]

Proof.

Follows from Lemma 1 above taking $Z_{i}=X_{i},i=1,...,p$ . ∎

Corollary 3.

If $\mathbf{x}=(X_{1},...,X_{p}),\mathbf{y}=(Y_{1},...,Y_{p})$ are independent multivariate Gaussian random variables $\mathbf{x},\mathbf{y}\sim\mathcal{N}(\mathbf{0},I_{p})$ , then $\mathbb{E}\left(\|\mathbf{x}-\mathbf{y}\|^{2}\right)=2p$ and for any $\epsilon\in(0,1)$ ,

[TABLE]

Proof.

Follows from Lemma 1 above taking $Z_{i}=(X_{i}-Y_{i})/\sqrt{2},i=1,...,p$ . ∎

Using these results, it follows that with high probability the negatives are far away from each other.

Corollary 4 (Separation between negatives).

For two negatives $\mathbf{x}_{i}$ and $\mathbf{x}_{k}$ , with probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

Proof.

Since $\mathbf{x}_{k}\sim\mathcal{N}(\mathbf{0},I_{p})$ and $\mathbf{x}_{i}\sim\mathcal{N}(\mathbf{0},I_{p})$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}\sim\mathcal{N}(\mathbf{0},2I_{p})$ , thus $\mathbb{E}\left(\|\mathbf{x}_{i}-\mathbf{x}_{k}\|^{2}\right)=2p$ . According to Corollary 3, it follows that

[TABLE]

then

[TABLE]

Then with high probability at least $1-2\exp\{-p\epsilon^{2}/8\}$ , the separation satisfies

[TABLE]

Now take $\epsilon=1/4$ so that with high probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

∎

It then follows that the positives from the same cluster are within a certain radius from each other with high probability.

Corollary 5 (Concentration of positives in the same cluster).

For any positive cluster $S_{j}$ with mean ${\boldsymbol{\mu}}_{j}$ and covariance matrix $\sigma_{j}^{2}I_{p}$ , with probability at least $1-2\exp\{-p/128\}$ , the concentration is bounded as

[TABLE]

Proof.

Since $\mathbf{x}_{i}\sim\mathcal{N}({\boldsymbol{\mu}}_{j},\sigma_{j}^{2}I_{p})$ and $\mathbf{x}_{k}\sim\mathcal{N}({\boldsymbol{\mu}}_{j},\sigma_{j}^{2}I_{p})$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}\sim\mathcal{N}(\mathbf{0},2\sigma_{j}^{2}I_{p})$ , thus $\mathbb{E}\left(\|\mathbf{x}_{i}-\mathbf{x}_{k}\|^{2}\right)=2p\sigma_{j}^{2}$ . According to Corollary 2, it follows that

[TABLE]

then

[TABLE]

Take $\epsilon=1/4$ , yields

[TABLE]

Therefore, with probability at least $1-2\exp\{-p/128\}$ , the concentration is bounded as

[TABLE]

∎

We then prove that the positives are far away from the negatives with high probability.

Corollary 6 (Separation between positives and negatives).

For negative $\mathbf{x}_{i}$ and positive $\mathbf{x}_{k}$ from cluster $S_{j}$ with mean ${\boldsymbol{\mu}}_{j}$ and covariance matrix $\sigma_{j}^{2}I_{p}$ , with probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

Proof.

Since $\mathbf{x}_{k}\sim\mathcal{N}({\boldsymbol{\mu}}_{j},\sigma_{j}^{2}I_{p})$ and $\mathbf{x}_{i}\sim\mathcal{N}(\mathbf{0},I_{p})$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}\sim\mathcal{N}({\boldsymbol{\mu}}_{j},\sigma_{j}^{2}I_{p}+I_{p})$ , thus $\mathbf{x}_{i}-\mathbf{x}_{k}={\boldsymbol{\mu}}_{j}+\boldsymbol{\epsilon_{1}}\sqrt{\sigma_{j}^{2}+1}$ with $\boldsymbol{\epsilon_{1}}\sim\mathcal{N}\left(\mathbf{0},I_{p}\right).$ Since ${\boldsymbol{\mu}}_{j}\sim\mathcal{N}\left(\mathbf{0},I_{p}\right)$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}$ is a Gaussian with $\mathbb{E}\left(\mathbf{x}_{i}-\mathbf{x}_{k}\right)=\mathbf{0}$ and

[TABLE]

thus

[TABLE]

According to Corollary 2, it follows immediately that,

[TABLE]

then

[TABLE]

Then with probability at least $1-2\exp\{-p\epsilon^{2}/8\}$ , the separation satisfies

[TABLE]

Now take $\epsilon=1/4$ , so that with probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

∎

Moreover, positives from different clusters are also far from each other with high probability.

Corollary 7 (Separation between positives in different clusters).

For positive $\mathbf{x}_{i}$ from cluster $S_{i}$ with true mean $\boldsymbol{\mu}_{i}$ and covariance matrix $\sigma_{i}^{2}I_{p}$ and positive $\mathbf{x}_{k}$ from another cluster $S_{j}$ with true mean $\boldsymbol{\mu}_{j}$ and covariance matrix $\sigma_{j}^{2}I_{p}$ , with probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

Proof.

Since $\mathbf{x}_{k}\sim\mathcal{N}({\boldsymbol{\mu}}_{j},\sigma_{j}^{2}I_{p})$ and $\mathbf{x}_{i}\sim\mathcal{N}({\boldsymbol{\mu}}_{i},\sigma_{i}^{2}I_{p})$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}\sim\mathcal{N}({\boldsymbol{\mu}}_{j}-{\boldsymbol{\mu}}_{i},\sigma_{j}^{2}I_{p}+\sigma_{i}^{2}I_{p})$ , thus $\mathbf{x}_{i}-\mathbf{x}_{k}={\boldsymbol{\mu}}_{j}-{\boldsymbol{\mu}}_{i}+\boldsymbol{\epsilon_{1}}\sqrt{\sigma_{j}^{2}+\sigma_{i}^{2}}$ with $\boldsymbol{\epsilon_{1}}\sim\mathcal{N}\left(\mathbf{0},I_{p}\right).$ Since ${\boldsymbol{\mu}}_{j}\sim\mathcal{N}\left(\mathbf{0},I_{p}\right)$ and ${\boldsymbol{\mu}}_{i}\sim\mathcal{N}\left(\mathbf{0},I_{p}\right)$ , then ${\boldsymbol{\mu}}_{j}-{\boldsymbol{\mu}}_{i}\sim\mathcal{N}\left(\mathbf{0},2I_{p}\right)$ , then $\mathbf{x}_{i}-\mathbf{x}_{k}$ is a Gaussian with $\mathbb{E}\left(\mathbf{x}_{i}-\mathbf{x}_{k}\right)=\mathbf{0}$ and

[TABLE]

thus

[TABLE]

According to Corollary 2, it follows that

[TABLE]

then

[TABLE]

Then with probability at least $1-2\exp\{-p\epsilon^{2}/8\}$ , the separation satisfies

[TABLE]

Now take $\epsilon=1/4$ so that with probability at least $1-2\exp\{-p/128\}$ , the separation satisfies

[TABLE]

∎

The previous corollaries are used to prove that with high probability, all positives from each cluster are within $2.5p\rho^{2}$ of each other, and $2.5p\rho^{2}$ away from the other clusters and from the negatives.

Appendix B Proof of Key Lemmas and Propositions

Proposition 1.

Given $N$ samples from a GMM with outliers, and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then with probability at least $1-6N^{2}\exp\{-p/128\}$ , the distance between positives within a cluster satisfies

[TABLE]

and the distance between positives from a cluster and other samples not in that cluster satisfies

[TABLE]

Proof.

From Corollary 5, with probability at least $1-2\exp\{-p/128\}$ , the distance between two positives in the same cluster is bounded as

[TABLE]

Using the union bound, with probability at least $1-2N^{2}\exp\{-p/128\}$ , the distances between all positives in the same cluster are bounded as

[TABLE]

From Corollary 6, with probability at least $1-2\exp\{-p/128\}$ , the distance between a positive and a negative satisfies

[TABLE]

Using the union bound, with probability at least $1-2N^{2}\exp\{-p/128\}$ , the distance between any positive and negative satisfies

[TABLE]

Given $\sigma_{max}\leq\rho<\sqrt{\frac{1.5+0.75\sigma_{k}^{2}}{2.5}}$ , since $0<\sigma_{k}<1$ , therefore, with $\sigma_{max}\leq\rho<\sqrt{\frac{1.5}{2.5}}=\sqrt{0.6}$ , then with probability at least $1-2N^{2}\exp\{-p/128\}$ , the distance between any positive and negative satisfies

[TABLE]

From Corollary 7, with probability at least $1-2\exp\{-p/128\}$ , the distance between two positives from different clusters satisfies

[TABLE]

Using the union bound, with probability at least $1-2N^{2}\exp\{-p/128\}$ , the distance between any two positives from different clusters satisfies

[TABLE]

Given $\sigma_{max}\leq\rho<\sqrt{\frac{1.5+0.75(\sigma_{i}^{2}+\sigma_{j}^{2})}{2.5}}$ , since $0<\sigma_{i},\sigma_{j}<1$ , therefore, with $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then with probability at least $1-2N^{2}\exp\{-p/128\}$ , the distance between any two positives from different clusters satisfies

[TABLE]

Therefore, with probability at least $1-4N^{2}\exp\{-p/128\}$ , the distance between any positive and any sample not from that cluster satisfies

[TABLE]

Therefore, with probability at least $1-6N^{2}\exp\{-p/128\}$ , the following bounds on positives within a cluster and between clusters are satisfied

[TABLE]

and

[TABLE]

∎

Finally, a bound for the probability that a sample $S$ has at least one element from each positive cluster is proven.

Lemma 2.

If the clusters have weights $w_{1},...,w_{m}$ , with $\sum_{k=1}^{m}w_{k}\leq 1$ , then the probability that a sample $S$ of size $|S|=n$ contains at least one observation from each cluster is at least

[TABLE]

Proof.

The probability that $S$ contains no elements from cluster $k$ is

[TABLE]

Then using the union bound, the probability that there is a $k$ such that $S$ does not contain any elements from cluster $k$ is

[TABLE]

which implies the result. ∎

Appendix C Proofs of Loss Bounds

In this section, the obtained concentration and separation results from Appendix A and Appendix B are used to obtain bounds on the loss function values.

First, it is proven that with high probability, the loss value of a negative is $-F$ .

Proposition 2.

Given $N$ samples from a GMM with outliers, and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then for a negative sample $\mathbf{x}_{j},l(\mathbf{x}_{j})=-1$ , with probability at least $1-4N\exp\{-p/128\}$ , the loss satisfies $L(\mathbf{x}_{j};\rho)=-F.$

Proof.

From Corollary 4, for $\mathbf{x}_{i},l(\mathbf{x}_{i})=-1$ , with probability at least $1-2\exp\{-p/128\}$ , the distance between a negative and $\mathbf{x}_{j}$ satisfies

[TABLE]

Using the union bound, with probability at least $1-2N\exp\{-p/128\}$ , the distance between any other negative and $\mathbf{x}_{j}$ satisfies

[TABLE]

Given $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then with probability at least $1-2N\exp\{-p/128\}$ , the distance between any other negative and $\mathbf{x}_{j}$ satisfies

[TABLE]

From Corollary 6, for $\mathbf{x}_{i},l(\mathbf{x}_{i})>0$ , with probability at least $1-2\exp\{-p/128\}$ , the distance between a positive and $\mathbf{x}_{j}$ satisfies

[TABLE]

Using the union bound, with probability at least $1-2N\exp\{-p/128\}$ , the distance between any positive and $\mathbf{x}_{j}$ satisfies

[TABLE]

Given $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then with probability at least $1-2N\exp\{-p/128\}$ , the distance between any positive and $\mathbf{x}_{j}$ satisfies

[TABLE]

Therefore, with probability at least $1-4N\exp\{-p/128\}$ , the distance between any other sample and $\mathbf{x}_{j}$ satisfies

[TABLE]

Therefore, with probability at least $1-4N\exp\{-p/128\}$ , it follows that

[TABLE]

Therefore, with probability at least $1-4N\exp\{-p/128\}$ , the loss satisfies

[TABLE]

since $\ell\left(\left\|\mathbf{x}_{j}-\mathbf{x}_{j}\right\|;\rho\right)=\ell(0;\rho)=-F$ . ∎

Next, it is proven with high probability, the loss value of a positive is less than $-F.$

Proposition 3.

Given $N$ samples from a GMM with outliers, and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , then for a positive sample $\mathbf{x}_{j},l(\mathbf{x}_{j})=k>0$ , with probability $1-2\exp\{-p/128\}-\exp\{-(N-1)w_{k}\}$ , the loss is bounded as $L(\mathbf{x}_{j};\rho)<-F.$

Proof.

The probability that a sample of size $N-1$ contains no elements from cluster $S_{k}$ is

[TABLE]

Therefore, with probability at least $1-\exp\{-(N-1)w_{k}\}$ , there is at least one more sample $\mathbf{x}_{a},a\not=j$ besides $\mathbf{x}_{j}$ in cluster $S_{k}$ .

From Corollary 5, with probability at least $1-2\exp\{-p/128\}$ , the distance between $\mathbf{x}_{a}$ and $\mathbf{x}_{j}$ is bounded as

[TABLE]

Given $\sigma_{max}\leq\rho<\sqrt{0.6}$ , with probability at least $1-2\exp\{-p/128\}$ , the distance between $\mathbf{x}_{a}$ and $\mathbf{x}_{j}$ is bounded as

[TABLE]

Therefore, with probability at least $1-2\exp\{-p/128\}-\exp\{-(N-1)w_{k}\}$ , the following equality holds

[TABLE]

Therefore, with probability at least $1-2\exp\{-p/128\}-\exp\{-(N-1)w_{k}\}$ , the loss is bounded above as

[TABLE]

∎

Proposition 4.

Given $N$ samples from a GMM with outliers, with $w_{i}\geq a/m,i=\overline{1,m}$ for some $a>0$ and $\sigma_{max}\leq\rho<\sqrt{0.6}$ , randomly select a set $S$ of $|S|=n$ subsamples from it, then with probability at least $1-m\exp\{-na/m\}-2m\exp\{-p/128\}-m\exp\{-a(N-1)/m\}$ for each $k=\overline{1,m}$ there exists $\mathbf{x}_{j}\in S_{k}=\{\mathbf{x}\in S,l(x)=k\}$ such that $L(\mathbf{x}_{j},\rho)<-F$ .

Proof.

According to Lemma 2, the probability that a sample $S$ of size $n$ contains at least one observation from each cluster is

[TABLE]

and without loss of generality let $\mathbf{x}_{j}$ be the observation from cluster $S_{k},k=\overline{1,m}$ . Applying Proposition 3 repeatedly to these $m$ samples and using the union bound, with probability at least $1-2m\exp\{-p/128\}-\sum_{i=1}^{m}\exp\{-(N-1)w_{i}\}$ , the loss is bounded above as

[TABLE]

Since $\forall w_{i}\geq a/m$ , therefore $\sum_{i=1}^{m}\exp\{-(N-1)w_{i}\}\leq m\exp\{-a(N-1)/m\}$ . Therefore, with probability at least $1-m\exp\{-na/m\}-2m\exp\{-p/128\}-m\exp\{-a(N-1)/m\},$ for each $k=\overline{1,m}$ there exists $\mathbf{x}_{j}\in S_{k}$ such that

[TABLE]

∎

Appendix D Proofs of Theorem 1 and Corollary 1

In this section, the proofs of Theorem 1 and Corollary 1 are given.

Proof.

of Theorem 1. From Proposition 2, for a negative sample $\mathbf{x}_{j}$ , with probability at least $1-4N\exp\{-p/128\}$ , $L(\mathbf{x}_{j},\rho)=-F$ , then for all the negatives, with probability at least $1-4N^{2}\exp\{-p/128\}$ , the loss satisfies $L(\mathbf{x}_{j},\rho)=-F.$

From Proposition 4, with probability at least $1-m\exp\{-na/m\}-2m\exp\{-p/128\}-m\exp\{-a(N-1)/m\}$ , for each $k=\overline{1,m}$ there is $\mathbf{x}_{j}\in S_{k},L(\mathbf{x}_{j},\rho)<-F$ .

Combining Proposition 2 and Proposition 4, with probability at least $1-4N^{2}\exp\{-p/128\}-m\exp\{-na/m\}-2m\exp\{-p/128\}-m\exp\{-a(N-1)/m\}$ , only positives will be selected at step 8 of SCRLM.

From Proposition 1, with probability at least $1-6N^{2}\exp\{-p/128\}$ , all positives are correctly identified in Steps 9 and 17 and removed from negatives.

So with probability at least

[TABLE]

SCRLM will have $100\%$ accuracy. ∎

Proof.

of Corollary 1. The condition

[TABLE]

is equivalent to

[TABLE]

The condition

[TABLE]

is equivalent to

[TABLE]

The condition

[TABLE]

is equivalent to

[TABLE]

Finally, the condition

[TABLE]

is equivalent to:

[TABLE]

These conditions together imply that

[TABLE]

According to Theorem 1, SCRLM has 100% accuracy with probability at least $1-\delta$ . ∎

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alashwal et al. (2019) Alashwal, H., El Halaby, M., Crouse, J. J., Abdalla, A., and Moustafa, A. A. (2019), “The application of unsupervised clustering methods to Alzheimer’s disease,” Frontiers in Computational Neuroscience , 13, 31.
2Arthur and Vassilvitskii (2006) Arthur, D. and Vassilvitskii, S. (2006), “k-means++: The advantages of careful seeding,” Technical report, Stanford.
3Bradley and Fayyad (1998) Bradley, P. S. and Fayyad, U. M. (1998), “Refining initial points for k-means clustering.” in ICML , vol. 98, Citeseer, pp. 91–99.
4Chen et al. (2020) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020), “A simple framework for contrastive learning of visual representations,” ar Xiv preprint ar Xiv:2002.05709 .
5Coleman and Andrews (1979) Coleman, G. B. and Andrews, H. C. (1979), “Image segmentation by clustering,” Proceedings of the IEEE , 67, 773–785.
6Dasgupta and Schulman (2007) Dasgupta, S. and Schulman, L. J. (2007), “A probabilistic analysis of EM for mixtures of separated, spherical Gaussians,” Journal of Machine Learning Research , 8, 203–226.
7Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological) , 39, 1–22.
8Deng (2012) Deng, L. (2012), “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine , 29, 141–142.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Abstract

1 Introduction

2 Literature Reviews

3 Scalable Clustering by Robust Loss Minimization

3.1 Robust Loss Function

3.2 Theoretical Guarantees

Assumption 1**.**

Theorem 1**.**

Corollary 1**.**

3.3 Computational Complexity

4 Experiments

4.1 Simulation Experiments

4.1.1 Comparison of Observed and Theoretical Accuracy

4.1.2 Stability of SCRLM w.r.t. the Bandwidth Parameter

4.1.3 Comparison with other clustering methods

4.2 Real Data Experiments

5 Conclusion

Appendix A Preliminaries

Lemma 1**.**

Corollary 2**.**

Proof.

Corollary 3**.**

Proof.

Corollary 4** (Separation between negatives).**

Proof.

Corollary 5** (Concentration of positives in the same cluster).**

Proof.

Corollary 6** (Separation between positives and negatives).**

Proof.

Corollary 7** (Separation between positives in different clusters).**

Proof.

Appendix B Proof of Key Lemmas and Propositions

Proposition 1**.**

Proof.

Lemma 2**.**

Proof.

Appendix C Proofs of Loss Bounds

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

Proposition 4**.**

Proof.

Appendix D Proofs of Theorem 1 and Corollary 1

Proof.

Proof.

Assumption 1.

Theorem 1.

Corollary 1.

Lemma 1.

Corollary 2.

Corollary 3.

Corollary 4 (Separation between negatives).

Corollary 5 (Concentration of positives in the same cluster).

Corollary 6 (Separation between positives and negatives).

Corollary 7 (Separation between positives in different clusters).

Proposition 1.

Lemma 2.

Proposition 2.

Proposition 3.

Proposition 4.