Variational Information Bottleneck for Unsupervised Clustering: Deep   Gaussian Mixture Embedding

Yigit Ugur; George Arvanitakis; Abdellatif Zaidi

arXiv:1905.11741·cs.LG·April 22, 2020

Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding

Yigit Ugur, George Arvanitakis, Abdellatif Zaidi

PDF

TL;DR

This paper introduces an unsupervised clustering method that combines the Variational Information Bottleneck with Gaussian Mixture Models, leveraging neural networks and variational inference for effective data embedding.

Contribution

It develops a novel unsupervised generative clustering framework integrating the Variational Information Bottleneck with Gaussian Mixture Models, including a new bound and inference algorithm.

Findings

01

Demonstrates efficiency on real datasets

02

Provides a new variational bound generalizing ELBO

03

Uses neural networks for flexible encoding

Abstract

In this paper, we develop an unsupervised generative clustering framework that combines the Variational Information Bottleneck and the Gaussian Mixture Model. Specifically, in our approach, we use the Variational Information Bottleneck method and model the latent space as a mixture of Gaussians. We derive a bound on the cost function of our model that generalizes the Evidence Lower Bound (ELBO) and provide a variational inference type algorithm that allows computing it. In the algorithm, the coders' mappings are parametrized using neural networks, and the bound is approximated by Monte Carlo sampling and optimized with stochastic gradient descent. Numerical results on real datasets are provided to support the efficiency of our method.

Tables2

Table 1. TABLE I: Comparison of the clustering accuracy of various algorithms. The algorithms are run without pretraining. Each algorithm is run ten times. The values in ( ⋅ ) ⋅ (\cdot) correspond to the standard deviations of clustering accuracies.

	MNIST		STL-10
	Best Run	Average Run	Best Run	Average Run
GMM	44.1	40.5 (1.5)	78.9	73.3 (5.1)
DEC			80.6^†
VaDE	91.8	78.8 (9.1)	85.3	74.1 (6.4)
VIB-GMM	$95.1$	$83.5$ (5.9)	$93.2$	$82.1$ (5.6)

Table 2. TABLE II: Comparison of the clustering accuracy of various algorithms. A stacked autoencoder is used to pretrain the DNNs of the encoder and decoder before running algorithms (DNNs are initialized with the same weights and biases of [ 19 ] ). Each algorithm is run ten times. The values in ( ⋅ ) ⋅ (\cdot) correspond to the standard deviations of clustering accuracies.

	MNIST		REURTERS10K
	Best Run	Average Run	Best Run	Average Run
DEC	84.3^‡		72.2^‡
VaDE	94.2	93.2 (1.5)	79.8	79.1 (0.6)
VIB-GMM	$96.1$	$95.8$ (0.1)	$81.6$	$81.2$ (0.4)

Equations73

P_{U ∣ X} min I (X; U) - s I (C; U),

P_{U ∣ X} min I (X; U) - s I (C; U),

C - \circ - - X - \circ - - U .

C - \circ - - X - \circ - - U .

C - \circ - - U - \circ - - X .

C - \circ - - U - \circ - - X .

L_{s} (P) = I (C; U) - s I (X; U),

L_{s} (P) = I (C; U) - s I (X; U),

\tilde{L}_{s} (P) := I (X; U) - s I (X; U) = (a) H (X) - H (X ∣ U) - s [H (U) - H (U ∣ X)],

\tilde{L}_{s} (P) := I (X; U) - s I (X; U) = (a) H (X) - H (X ∣ U) - s [H (U) - H (U ∣ X)],

L_{s}^{'} (P) :

L_{s}^{'} (P) :

= E_{P_{X}} [E_{P_{U ∣ X}} [lo g P_{X ∣ U} + s lo g P_{U} - s lo g P_{U ∣ X}]] .

L_{s}^{VB} (P, Q) := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - s D_{KL} (P_{U ∣ X} ∥ Q_{U})] .

L_{s}^{VB} (P, Q) := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - s D_{KL} (P_{U ∣ X} ∥ Q_{U})] .

L_{s}^{VB} (P, Q) \leq L_{s}^{'} (P), for all Q .

L_{s}^{VB} (P, Q) \leq L_{s}^{'} (P), for all Q .

Q_{X ∣ U}^{*} = P_{X ∣ U}, Q_{U}^{*} = P_{U} .

Q_{X ∣ U}^{*} = P_{X ∣ U}, Q_{U}^{*} = P_{U} .

P max L_{s}^{'} (P) = P max Q max L_{s}^{VB} (P, Q) .

P max L_{s}^{'} (P) = P max Q max L_{s}^{VB} (P, Q) .

L_{1}^{VaDE} := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - D_{KL} (P_{C ∣ X} ∥ Q_{C}) - E_{P_{C ∣ X}} [D_{KL} (P_{U ∣ X} ∥ Q_{U ∣ C})]] .

L_{1}^{VaDE} := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - D_{KL} (P_{C ∣ X} ∥ Q_{C}) - E_{P_{C ∣ X}} [D_{KL} (P_{U ∣ X} ∥ Q_{U ∣ C})]] .

L_{s}^{VaDE} := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - s D_{KL} (P_{C ∣ X} ∥ Q_{C}) - s E_{P_{C ∣ X}} [D_{KL} (P_{U ∣ X} ∥ Q_{U ∣ C})]] .

L_{s}^{VaDE} := E_{P_{X}} [E_{P_{U ∣ X}} [lo g Q_{X ∣ U}] - s D_{KL} (P_{C ∣ X} ∥ Q_{C}) - s E_{P_{C ∣ X}} [D_{KL} (P_{U ∣ X} ∥ Q_{U ∣ C})]] .

L_{s}^{VB} (P, Q) = L_{s}^{VaDE} + s E_{P_{X}} [E_{P_{U ∣ X}} [D_{KL} (P_{C ∣ X} ∥ Q_{C ∣ U})]],

L_{s}^{VB} (P, Q) = L_{s}^{VaDE} + s E_{P_{X}} [E_{P_{U ∣ X}} [D_{KL} (P_{C ∣ X} ∥ Q_{C ∣ U})]],

L_{s}^{VaDE}

L_{s}^{VaDE}

\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\mathbb{E}_{P_{\mathbf{X}}}\Big{[}\mathbb{E}_{P_{\mathbf{U}|\mathbf{X}}}[\log Q_{\mathbf{X}|\mathbf{U}}]-sD_{\mathrm{KL}}(P_{\mathbf{U}|\mathbf{X}}\|Q_{\mathbf{U}})-s\mathbb{E}_{P_{\mathbf{U}|\mathbf{X}}}\big{[}D_{\mathrm{KL}}(P_{C|\mathbf{X}}\|Q_{C|\mathbf{U}})\Big{]}

\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\mathcal{L}_{s}^{\mathrm{VB}}(\mathbf{P},\mathbf{Q})-s\mathbb{E}_{P_{\mathbf{X}}}\Big{[}\mathbb{E}_{P_{\mathbf{U}|\mathbf{X}}}\big{[}D_{\mathrm{KL}}(P_{C|\mathbf{X}}\|Q_{C|\mathbf{U}})\big{]}\Big{]}\;,

P_{θ} (u ∣ x) = N (u; μ_{θ}, Σ_{θ}), where [μ_{θ}, Σ_{θ}] = f_{θ} (x), Q_{ϕ} (x ∣ u) = g_{ϕ} (u) = [\hat{x}],

P_{θ} (u ∣ x) = N (u; μ_{θ}, Σ_{θ}), where [μ_{θ}, Σ_{θ}] = f_{θ} (x), Q_{ϕ} (x ∣ u) = g_{ϕ} (u) = [\hat{x}],

Q_{ψ} (u) = c \sum π_{c} N (u; μ_{c}, Σ_{c}) .

Q_{ψ} (u) = c \sum π_{c} N (u; μ_{c}, Σ_{c}) .

θ, ϕ, ψ max L_{s}^{NN} (θ, ϕ, ψ)

θ, ϕ, ψ max L_{s}^{NN} (θ, ϕ, ψ)

~{}\mathcal{L}_{s}^{\mathrm{NN}}(\theta,\phi,\psi):=\mathbb{E}_{P_{\mathbf{X}}}\Big{[}\mathbb{E}_{P_{\theta}(\mathbf{U}|\mathbf{X})}[\log Q_{\phi}(\mathbf{X}|\mathbf{U})]-sD_{\mathrm{KL}}(P_{\theta}(\mathbf{U}|\mathbf{X})\|Q_{\psi}(\mathbf{U}))\Big{]}\;.

~{}\mathcal{L}_{s}^{\mathrm{NN}}(\theta,\phi,\psi):=\mathbb{E}_{P_{\mathbf{X}}}\Big{[}\mathbb{E}_{P_{\theta}(\mathbf{U}|\mathbf{X})}[\log Q_{\phi}(\mathbf{X}|\mathbf{U})]-sD_{\mathrm{KL}}(P_{\theta}(\mathbf{U}|\mathbf{X})\|Q_{\psi}(\mathbf{U}))\Big{]}\;.

θ, ϕ, ψ max \frac{1}{N} i = 1 \sum N L_{s, i}^{emp} (θ, ϕ, ψ),

θ, ϕ, ψ max \frac{1}{N} i = 1 \sum N L_{s, i}^{emp} (θ, ϕ, ψ),

L_{s, i}^{emp} (θ, ϕ, ψ) = E_{P_{θ} (U_{i} ∣ X_{i})} [lo g Q_{ϕ} (X_{i} ∣ U_{i})] - s D_{KL} (P_{θ} (U_{i} ∣ X_{i}) ∥ Q_{ψ} (U_{i})) .

L_{s, i}^{emp} (θ, ϕ, ψ) = E_{P_{θ} (U_{i} ∣ X_{i})} [lo g Q_{ϕ} (X_{i} ∣ U_{i})] - s D_{KL} (P_{θ} (U_{i} ∣ X_{i}) ∥ Q_{ψ} (U_{i})) .

E_{P_{θ} (U_{i} ∣ X_{i})} [lo g Q_{ϕ} (X_{i} ∣ U_{i})] = \frac{1}{M} m = 1 \sum M lo g q (x_{i} ∣ u_{i, m}), u_{i, m} = μ_{θ, i} + Σ_{θ, i}^{\frac{1}{2}} \cdot ϵ_{m}, ϵ_{m} \sim N (0, I),

E_{P_{θ} (U_{i} ∣ X_{i})} [lo g Q_{ϕ} (X_{i} ∣ U_{i})] = \frac{1}{M} m = 1 \sum M lo g q (x_{i} ∣ u_{i, m}), u_{i, m} = μ_{θ, i} + Σ_{θ, i}^{\frac{1}{2}} \cdot ϵ_{m}, ϵ_{m} \sim N (0, I),

D_{KL} (P_{θ} (U_{i} ∣ X_{i}) ∥ Q_{ψ} (U_{i})) = - lo g c = 1 \sum ∣ C ∣ π_{c} exp (- D_{KL} (N (μ_{θ, i}, Σ_{θ, i}) ∥ N (μ_{c}, Σ_{c})) .

D_{KL} (P_{θ} (U_{i} ∣ X_{i}) ∥ Q_{ψ} (U_{i})) = - lo g c = 1 \sum ∣ C ∣ π_{c} exp (- D_{KL} (N (μ_{θ, i}, Σ_{θ, i}) ∥ N (μ_{c}, Σ_{c})) .

\displaystyle D_{\mathrm{KL}}(P_{\boldsymbol{\theta}}(\mathbf{U}_{i}|\mathbf{X}_{i})\|Q_{\boldsymbol{\psi}}(\mathbf{U}_{i}))=-\log\sum_{c=1}^{|\mathcal{C}|}\pi_{c}\exp\bigg{(}-\frac{1}{2}\sum_{j=1}^{n_{u}}\Big{[}\frac{(\mu_{\theta,i,j}-\mu_{c,j})^{2}}{\sigma_{c,j}^{2}}+\log\frac{\sigma_{c,j}^{2}}{\sigma_{\theta,i,j}^{2}}-1+\frac{\sigma_{\theta,i,j}^{2}}{\sigma_{c,j}^{2}}\Big{]}\bigg{)}\;,

\displaystyle D_{\mathrm{KL}}(P_{\boldsymbol{\theta}}(\mathbf{U}_{i}|\mathbf{X}_{i})\|Q_{\boldsymbol{\psi}}(\mathbf{U}_{i}))=-\log\sum_{c=1}^{|\mathcal{C}|}\pi_{c}\exp\bigg{(}-\frac{1}{2}\sum_{j=1}^{n_{u}}\Big{[}\frac{(\mu_{\theta,i,j}-\mu_{c,j})^{2}}{\sigma_{c,j}^{2}}+\log\frac{\sigma_{c,j}^{2}}{\sigma_{\theta,i,j}^{2}}-1+\frac{\sigma_{\theta,i,j}^{2}}{\sigma_{c,j}^{2}}\Big{]}\bigg{)}\;,

p (c ∣ x_{i}) = q (c ∣ u_{i}) = \frac{q _{ψ^{⋆}} ( c ) q _{ψ^{⋆}} ( u _{i} ∣ c )}{q _{ψ^{⋆}} ( u _{i} )} = \frac{π _{c}^{⋆} N ( u _{i} ; μ _{c}^{⋆} , Σ _{c}^{⋆} )}{\sum _{c} π _{c}^{⋆} N ( u _{i} ; μ _{c}^{⋆} , Σ _{c}^{*} )},

p (c ∣ x_{i}) = q (c ∣ u_{i}) = \frac{q _{ψ^{⋆}} ( c ) q _{ψ^{⋆}} ( u _{i} ∣ c )}{q _{ψ^{⋆}} ( u _{i} )} = \frac{π _{c}^{⋆} N ( u _{i} ; μ _{c}^{⋆} , Σ _{c}^{⋆} )}{\sum _{c} π _{c}^{⋆} N ( u _{i} ; μ _{c}^{⋆} , Σ _{c}^{*} )},

L_{s}^{'} (P) =

L_{s}^{'} (P) =

=

L_{s}^{VB} (P, Q) :=

L_{s}^{VB} (P, Q) :=

L_{s}^{'} (P) - L_{s}^{VB} (P, Q) = E_{P_{U}} [D_{KL} (P_{X ∣ U} ∥ Q_{X ∣ U})] + s D_{KL} (P_{U} ∥ Q_{U}) \geq 0,

L_{s}^{'} (P) - L_{s}^{VB} (P, Q) = E_{P_{U}} [D_{KL} (P_{X ∣ U} ∥ Q_{X ∣ U})] + s D_{KL} (P_{U} ∥ Q_{U}) \geq 0,

L_{s}^{VB} (P, Q) =

L_{s}^{VB} (P, Q) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Variational Information Bottleneck

for Unsupervised Clustering:

Deep Gaussian Mixture Embedding

Yiğit Uğur †**‡ George Arvanitakis † Abdellatif Zaidi ‡

† Mathematical and Algorithmic Sciences Lab, Paris Research Center, Huawei Technologies,

Boulogne-Billancourt, 92100, France

‡ Laboratoire d’informatique Gaspard-Monge, Université Paris-Est, Champs-sur-Marne, 77454, France

{[email protected], [email protected], [email protected]}

Abstract

In this paper, we develop an unsupervised generative clustering framework that combines the Variational Information Bottleneck and the Gaussian Mixture Model. Specifically, in our approach, we use the Variational Information Bottleneck method and model the latent space as a mixture of Gaussians. We derive a bound on the cost function of our model that generalizes the Evidence Lower Bound (ELBO) and provide a variational inference type algorithm that allows computing it. In the algorithm, the coders’ mappings are parametrized using neural networks, and the bound is approximated by Monte Carlo sampling and optimized with stochastic gradient descent. Numerical results on real datasets are provided to support the efficiency of our method.

I Introduction

Clustering consists of partitioning a given dataset into various groups (clusters) based on some similarity metric, such as the Euclidean distance, $L_{1}$ norm, $L_{2}$ norm, $L_{\infty}$ norm, the popular logarithmic loss measure, or others. The principle is that each cluster should contain elements of the data that are closer to each other than to any other element outside that cluster, in the sense of the defined similarity measure. If the joint distribution of the clusters and data is not known, one should operate blindly in doing so, i.e., using only the data elements at hand; and the approach is called unsupervised clustering [1, 2]. Unsupervised clustering is perhaps one of the most important tasks of unsupervised machine learning algorithms currently, due to a variety of application needs and connections with other problems.

Clustering can be formulated as follows. Consider a dataset that is composed of $N$ samples $\{\mathbf{x}_{i}\}_{i=1}^{N}$ , which we wish to partition into $|\mathcal{C}|\geq 1$ clusters. Let $\mathcal{C}=\{1,\dots,|\mathcal{C}|\}$ be the set of all possible clusters and $C$ designate a categorical random variable that lies in $\mathcal{C}$ and stands for the index of the actual cluster. If $\mathbf{X}$ is a random variable that models elements of the dataset, given that $\mathbf{X}=\mathbf{x}_{i}$ induces a probability distribution on $C$ , which the learner should learn, thus mathematically, the problem is that of estimating the values of the unknown conditional probability $P_{C|\mathbf{X}}(\cdot|\mathbf{x}_{i})$ for all elements $\mathbf{x}_{i}$ of the dataset. The estimates are sometimes referred to as the assignment probabilities.

Examples of unsupervised clustering algorithms include the very popular $K$ -means [3] and Expectation Maximization (EM) [4]. The $K$ -means algorithm partitions the data in a manner that the Euclidean distance among the members of each cluster is minimized. With the EM algorithm, the underlying assumption is that the data comprise a mixture of Gaussian samples, namely a Gaussian Mixture Model (GMM); and one estimates the parameters of each component of the GMM while simultaneously associating each data sample with one of those components. Although they offer some advantages in the context of clustering, these algorithms suffer from some strong limitations. For example, it is well known that the $K$ -means is highly sensitive to both the order of the data and scaling; and the obtained accuracy depends strongly on the initial seeds (in addition to that, it does not predict the number of clusters or $K$ -value). The EM algorithm suffers mainly from slow convergence, especially for high-dimensional data.

Recently, a new approach has emerged that seeks to perform inference on a transformed domain (generally referred to as latent space), not the data itself. The rationale is that because the latent space often has fewer dimensions, it is more convenient computationally to perform inference (clustering) on it rather than on the high-dimensional data directly. A key aspect then is how to design a latent space that is amenable to accurate low-complexity unsupervised clustering, i.e., one that preserves only those features of the observed high-dimensional data that are useful for clustering while removing all redundant or non-relevant information. Along this line of work, we can mention [5], which utilized Principal Component Analysis (PCA) [6, 7] for dimensionality reduction followed by $K$ -means for clustering the obtained reduced dimension data; or [8], which used a combination of PCA and the EM algorithm. Other works that used alternatives for the linear PCA include kernel PCA [9], which employs PCA in a non-linear fashion to maximize variance in the data.

Tishby’s Information Bottleneck (IB) method [10] formulates the problem of finding a good representation $\mathbf{U}$ that strikes the right balance between capturing all information about the categorical variable $C$ that is contained in the observation $\mathbf{X}$ and using the most concise representation for it. The IB problem can be written as the following Lagrangian optimization

[TABLE]

where $I(\cdot\>;\>\cdot)$ denotes Shannon’s mutual information and $s$ is a Lagrange-type parameter, which controls the trade-off between accuracy and regularization. In [11, 12], a text clustering algorithm is introduced for the case in which the joint probability distribution of the input data is known. This text clustering algorithm uses the IB method with an annealing procedure, where the parameter $s$ is increased gradually. When $s\rightarrow 0$ , the representation $\mathbf{U}$ is designed with the most compact form, i.e., $|\mathcal{U}|=1$ , which corresponds to the maximum compression. By gradually increasing the parameter $s$ , the emphasis on the relevance term $I(C;\mathbf{U})$ increases, and at a critical value of $s$ , the optimization focuses on not only the compression, but also the relevance term. To fulfill the demand on the relevance term, this results in the cardinality of $\mathbf{U}$ bifurcating. This is referred as a phase transition of the system. The further increases in the value of $s$ will cause other phase transitions, hence additional splits of $|\mathcal{U}|$ until it reaches the desired level, e.g., $|\mathcal{U}|=|\mathcal{C}|$ .

However, in the real-world applications of clustering with large-scale datasets, the joint probability distributions of the datasets are unknown. In practice, the usage of Deep Neural Networks (DNN) for unsupervised clustering of high-dimensional data on a lower dimensional latent space has attracted considerable attention, especially with the advent of Autoencoder (AE) learning and the development of powerful tools to train them using standard backpropagation techniques [13, 14]. Advanced forms include Variational Autoencoders (VAE) [13, 14], which are generative variants of AE that regularize the structure of the latent space, and the more general Variational Information Bottleneck (VIB) of [15], which is a technique that is based on the Information Bottleneck method and seeks a better trade-off between accuracy and regularization than VAE via the introduction of a Lagrange-type parameter $s$ , which controls that trade-off and whose optimization is similar to deterministic annealing [12] or stochastic relaxation.

In this paper, we develop an unsupervised generative clustering framework that combines VIB and the Gaussian Mixture Model. Specifically, in our approach, we use the Variational Information Bottleneck method and model the latent space as a mixture of Gaussians. We derive a bound on the cost function of our model that generalizes the Evidence Lower Bound (ELBO) and provide a variational inference type algorithm that allows computing it. In the algorithm, the coders’ mappings are parametrized using Neural Networks (NN), and the bound is approximated by Monte Carlo sampling and optimized with stochastic gradient descent. Furthermore, we show how tuning the hyperparameter $s$ appropriately by gradually increasing its value with iterations (number of epochs) results in a better accuracy. Furthermore, the application of our algorithm to the unsupervised clustering of various datasets, including the MNIST [16], REUTERS [17], and STL-10 [18], allows a better clustering accuracy than previous state-of-the-art algorithms. For instance, we show that our algorithm performs better than the Variational Deep Embedding (VaDE) algorithm of [19], which is based on VAE and performs clustering by maximizing the ELBO. Our algorithm can be seen as a generalization of the VaDE, whose ELBO can be recovered by setting $s=1$ in our cost function. In addition, our algorithm also generalizes the VIB of [15], which models the latent space as an isotropic Gaussian, which is generally not expressive enough for the purpose of unsupervised clustering. Other related works, which are of lesser relevance to the contribution of this paper, are the Deep Embedded Clustering (DEC) of [20] and the Improved Deep Embedded Clustering (IDEC) of [21] and [22]. For a detailed survey of clustering with deep learning, the readers may refer to [23].

To the best of our knowledge, our algorithm performs the best in terms of clustering accuracy by using deep neural networks without any prior knowledge regarding the labels (except the usual assumption that the number of classes is known) compared to the state-of-the-art algorithms of the unsupervised learning category. In order to achieve the outperforming accuracy: (i) we derive a cost function that contains the IB hyperparameter $s$ that controls optimal trade-offs between the accuracy and regularization of the model; (ii) we use a lower bound approximation for the KL term in the cost function, that does not depend on the clustering assignment probability (note that the clustering assignment is usually not accurate in the beginning of the training process); and (iii) we tune the hyperparameter $s$ by following an annealing approach that improves both the convergence and the accuracy of the proposed algorithm.

Throughout this paper, we use the following notation. Uppercase letters are used to denote random variables, e.g., $X$ ; lowercase letters are used to denote realizations of random variables, e.g., $x$ ; and calligraphic letters denote sets, e.g., $\mathcal{X}$ . The cardinality of a set $\mathcal{X}$ is denoted by $|\mathcal{X}|$ . Probability mass functions (pmfs) are denoted by $P_{X}(x)=\mathrm{Pr}\{X=x\}$ and, sometimes, for short, as $p(x)$ . Boldface uppercase letters denote vectors or matrices, e.g., $\mathbf{X}$ , where context should make the distinction clear. We denote the covariance of a zero mean, complex-valued, vector $\mathbf{X}$ by $\mathbf{\Sigma}_{\mathbf{x}}=\mathbb{E}[\mathbf{XX}^{{\dagger}}]$ , where $(\cdot)^{{\dagger}}$ indicates the conjugate transpose. For random variables $X$ and $Y$ , the entropy is denoted as $H(X)$ , i.e., $H(X)=\mathbb{E}_{P_{X}}[-\log P_{X}]$ , and the mutual information is denoted as $I(X;Y)$ , i.e., $I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=\mathbb{E}_{P_{X,Y}}[\log\frac{P_{X,Y}}{P_{X}P_{Y}}]$ . Finally, for two probability measures $P_{X}$ and $Q_{X}$ on a random variable $X\in\mathcal{X}$ , the relative entropy or Kullback–Leibler divergence is denoted as $D_{\mathrm{KL}}(P_{X}\|Q_{X})$ , i.e., $D_{\mathrm{KL}}(P_{X}\|Q_{X})=\mathbb{E}_{P_{X}}[\log\frac{P_{X}}{Q_{X}}]$ .

II Proposed Model

In this section, we explain the proposed model, the so-called Variational Information Bottleneck with Gaussian Mixture Model (VIB-GMM), in which we use the VIB framework and model the latent space as a GMM. The proposed model is depicted in Figure 1, where the parameters $\pi_{c}$ , $\boldsymbol{\mu}_{c}$ , $\boldsymbol{\Sigma}_{c}$ , for all values of $c\in\mathcal{C}$ , are to be optimized jointly with those of the employed NNs as instantiation of the coders. Furthermore, the assignment probabilities are estimated based on the values of latent space vectors instead of the observations themselves, i.e., $P_{C|\mathbf{X}}=Q_{C|\mathbf{U}}$ . In the rest of this section, we elaborate on the inference and generative network models for our method, which are illustrated below.

II-A Inference Network Model

We assume that observed data $\mathbf{x}$ are generated from a GMM with $|\mathcal{C}|$ components. Then, the latent representation $\mathbf{u}$ is inferred according to the following procedure:

One of the components of the GMM is chosen according to a categorical variable $C$ . 2. 2.

The data $\mathbf{x}$ are generated from the $c^{\text{th}}$ component of the GMM, i.e., $P_{\mathbf{X}|C}\sim\mathcal{N}(\mathbf{x};\tilde{\boldsymbol{\mu}}_{c},\tilde{\boldsymbol{\Sigma}}_{c})$ . 3. 3.

Encoder maps $\mathbf{x}$ to a latent representation $\mathbf{u}$ according to $P_{\mathbf{U}|\mathbf{X}}\sim\mathcal{N}(\boldsymbol{\mu}_{\theta},\boldsymbol{\Sigma}_{\theta})$ .

3.1.

The encoder is modeled with a DNN $f_{\theta}$ , which maps $\mathbf{x}$ to the parameters of a Gaussian distribution, i.e., $[\boldsymbol{\mu}_{\theta},\boldsymbol{\Sigma}_{\theta}]=f_{\theta}(\mathbf{x})$ . 2. 3.2.

The representation $\mathbf{u}$ is sampled from $\mathcal{N}(\boldsymbol{\mu}_{\theta},\boldsymbol{\Sigma}_{\theta})$ .

For the inference network, shown in Figure 3, the following Markov chain holds

[TABLE]

II-B Generative Network Model

Since the encoder extracts useful representations of the dataset and we assume that the dataset is generated from a GMM, we model our latent space also with a mixture of Gaussians. To do so, the categorical variable $C$ is embedded with the latent variable $\mathbf{U}$ . The reconstruction of the dataset is generated according to the following procedure:

One of the components of the GMM is chosen according to a categorical variable $C$ , with a prior $Q_{C}$ . 2. 2.

The representation $\mathbf{u}$ is generated from the $c^{\text{th}}$ component, i.e., $Q_{\mathbf{U}|C}\sim\mathcal{N}(\mathbf{u};\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_{c})$ . 3. 3.

The decoder maps the latent representation $\mathbf{u}$ to $\hat{\mathbf{x}}$ , which is the reconstruction of the source $\mathbf{x}$ by using the mapping $Q_{\mathbf{X}|\mathbf{U}}$ .

3.1.

The decoder is modeled with a DNN $g_{\phi}$ that maps $\mathbf{u}$ to the estimate $\hat{\mathbf{x}}$ , i.e., $[\hat{\mathbf{x}}]=g_{\phi}(\mathbf{u})$ .

For the generative network, shown in Figure 3, the following Markov chain holds

[TABLE]

III Proposed Method

In this section, we present our clustering method. First, we provide a general cost function for the problem of the unsupervised clustering that we study here based on the variational IB framework; and we show that it generalizes the ELBO bound developed in [19]. We then parametrize our model using NNs whose parameters are optimized jointly with those of the GMM. Furthermore, we discuss the influence of the hyperparameter $s$ that controls optimal trade-offs between accuracy and regularization.

III-A Brief review of Variational Information Bottleneck for Unsupervised Learning

As described in Section II, the stochastic encoder $P_{\mathbf{U}|\mathbf{X}}$ maps the observed data $\mathbf{x}$ to a representation $\mathbf{u}$ . Similarly, the stochastic decoder $Q_{\mathbf{X}|\mathbf{U}}$ assigns an estimate $\hat{\mathbf{x}}$ of $\mathbf{x}$ based on the vector $\mathbf{u}$ . As per the IB method [10], a suitable representation $\mathbf{U}$ should strike the right balance between capturing all information about the categorical variable $C$ that is contained in the observation $\mathbf{X}$ and using the most concise representation for it. This leads to maximizing the following Lagrange problem

[TABLE]

where $s\geq 0$ designates the Lagrange multiplier and, for convenience, $\mathbf{P}$ denotes the conditional distribution $P_{\mathbf{U}|\mathbf{X}}$ .

Instead of (4), which is not always computable in our unsupervised clustering setting, we find it convenient to maximize an upper bound of $\mathcal{L}_{s}(\mathbf{P})$ given by

[TABLE]

where $(a)$ is due to the definition of mutual information (using the Markov chain $C-\!\!\!\!\minuso\!\!\!\!-\mathbf{X}-\!\!\!\!\minuso\!\!\!\!-\mathbf{U}$ , it is easy to see that $\tilde{\mathcal{L}}_{s}(\mathbf{P})\geq\mathcal{L}_{s}(\mathbf{P})$ for all values of $\mathbf{P}$ ). Noting that $H(\mathbf{X})$ is constant with respect to $P_{\mathbf{U}|\mathbf{X}}$ , maximizing $\tilde{\mathcal{L}}_{s}(\mathbf{P})$ over $\mathbf{P}$ is equivalent to maximizing

[TABLE]

For a variational distribution $Q_{\mathbf{U}}$ on $\mathcal{U}$ (instead of the unknown $P_{\mathbf{U}}$ ) and a variational stochastic decoder $Q_{\mathbf{X}|\mathbf{U}}$ (instead of the unknown optimal decoder $P_{\mathbf{X}|\mathbf{U}}$ ), let $\mathbf{Q}:=\{Q_{\mathbf{X}|\mathbf{U}},Q_{\mathbf{U}}\}$ . Furthermore, let

[TABLE]

Lemma 1.

For given $\mathbf{P}$ , we have:

[TABLE]

In addition, there exists a unique $\mathbf{Q}$ that achieves the maximum $\operatorname*{max}_{\mathbf{Q}}\mathcal{L}_{s}^{\mathrm{VB}}(\mathbf{P},\mathbf{Q})=\mathcal{L}^{\prime}_{s}(\mathbf{P})$ and is given by

[TABLE]

Proof.

The proof of Lemma 1 is given in Appendix A. ∎

Using Lemma 1, maximization of (6) can be written in term of the variational IB cost as follows

[TABLE]

Next, we develop an algorithm that solves the maximization problem (9), where the encoding mapping $P_{\mathbf{U}|\mathbf{X}}$ , the decoding mapping $Q_{\mathbf{X}|\mathbf{U}}$ , as well as the prior distribution of the latent space $Q_{\mathbf{U}}$ are optimized jointly.

Remark 1.

As we already mentioned in the beginning of this section, the related work [19] performed unsupervised clustering by combining VAE with GMM. Specifically, it maximizes the following ELBO bound

[TABLE]

Let, for an arbitrary non-negative parameter $s$ , $\mathcal{L}_{s}^{\mathrm{VaDE}}$ be a generalization of the ELBO bound in (10) of [19] given by

[TABLE]

Investigating the right-hand side (RHS) of (11), we get

[TABLE]

where the equality holds since

[TABLE]

where $(a)$ can be obtained by expanding and re-arranging terms under the Markov chain $C-\!\!\!\!\minuso\!\!\!\!-\mathbf{X}-\!\!\!\!\minuso\!\!\!\!-\mathbf{U}$ (for a detailed treatment, please look at Appendix B); and $(b)$ follows from the definition of $\mathcal{L}_{s}^{\mathrm{VB}}(\mathbf{P},\mathbf{Q})$ in (8).

Thus, by the non-negativity of relative entropy, it is clear that $\mathcal{L}_{s}^{\mathrm{VaDE}}$ is a lower bound on $\mathcal{L}_{s}^{\mathrm{VB}}(\mathbf{P},\mathbf{Q})$ . Furthermore, if the variational distribution $\mathbf{Q}$ is such that the conditional marginal $Q_{C|\mathbf{U}}$ is equal to $P_{C|\mathbf{X}}$ , the bound is tight since the relative entropy term is zero in this case. $\blacksquare$

III-B Proposed Algorithm: VIB-GMM

In order to compute (9), we parametrize the distributions $P_{\mathbf{U}|\mathbf{X}}$ and $Q_{\mathbf{X}|\mathbf{U}}$ using DNNs. For instance, let the stochastic encoder $P_{\mathbf{U}|\mathbf{X}}$ be a DNN $f_{\theta}$ and the stochastic decoder $Q_{\mathbf{X}|\mathbf{U}}$ be a DNN $g_{\boldsymbol{\phi}}$ . That is

[TABLE]

where $\theta$ and $\phi$ are the weight and bias parameters of the DNNs. Furthermore, the latent space is modeled as a GMM with $|\mathcal{C}|$ components with parameters $\psi:=\{\pi_{c},\boldsymbol{\mu}_{c},\mathbf{\Sigma}_{c}\}_{c=1}^{|\mathcal{C}|}$ , i.e.,

[TABLE]

Using the parametrizations above, the optimization of (9) can be rewritten as

[TABLE]

where the cost function $\mathcal{L}_{s}^{\mathrm{NN}}(\theta,\phi,\psi)$ is given by

[TABLE]

Then, for given observations of $N$ samples, i.e., $\{\mathbf{x}_{i}\}_{i=1}^{N}$ , (18) can be approximated in terms of an empirical cost as follows

[TABLE]

where $\mathcal{L}_{s,i}^{\mathrm{emp}}(\theta,\phi,\psi)$ is the empirical cost for the $i^{\text{th}}$ observation $\mathbf{x}_{i}$ and given by

[TABLE]

Furthermore, the first term of the RHS of (21) can be computed using Monte Carlo sampling and the reparametrization trick [13]. In particular, $P_{\theta}(\mathbf{u}|\mathbf{x})$ can be sampled by first sampling a random variable $\mathbf{Z}$ with distribution $P_{\mathbf{Z}}$ , i.e., $P_{\mathbf{Z}}=\mathcal{N}(\mathbf{0},\mathbf{I})$ , then transforming the samples using some function $\tilde{f}_{\theta}:\mathcal{X}\times\mathcal{Z}\rightarrow\mathcal{U}$ , i.e., $\mathbf{u}=\tilde{f}_{\theta}(\mathbf{x},\mathbf{z})$ . Thus,

[TABLE]

where $M$ is the number of samples for the Monte Carlo sampling step.

The second term of the RHS of (21) is the KL divergence between a single component multivariate Gaussian and a GMM with $|\mathcal{C}|$ components. An exact closed-form solution for the calculation of this term does not exist. However, a variational lower bound approximation [24] of it can be obtained as

[TABLE]

In particular, in the specific case in which the covariance matrices are diagonal, i.e., $\mathbf{\Sigma}_{\theta,i}:=\mathrm{diag}(\{\sigma_{\theta,i,j}^{2}\}_{j=1}^{n_{u}})$ and $\mathbf{\Sigma}_{c}:=\mathrm{diag}(\{\sigma_{c,j}^{2}\}_{j=1}^{n_{u}})$ , with $n_{u}$ denoting the latent space dimension, (22) can be computed as follows

[TABLE]

where $\mu_{\theta,i,j}$ and $\sigma_{\theta,i,j}^{2}$ are the mean and variance of the $i^{\text{th}}$ representation in the $j^{\text{th}}$ dimension of the latent space. Furthermore, $\mu_{c,j}$ and $\sigma_{c,j}^{2}$ represent the mean and variance of the $c^{\text{th}}$ component of the GMM in the $j^{\text{th}}$ dimension of the latent space.

Finally, we train NNs to maximize the cost function (19) over the parameters $\theta,\phi$ , as well as those $\psi$ of the GMM. For the training step, we use the ADAM optimization tool [25]. The training procedure is detailed in Algorithm 1.

Once our model is trained, we assign the given dataset into the clusters. As mentioned in Section II, we do the assignment from the latent representations, i.e., $Q_{C|\mathbf{U}}=P_{C|\mathbf{X}}$ . Hence, the probability that the observed data $\mathbf{x}_{i}$ belongs to the $c^{\text{th}}$ cluster is computed as follows

[TABLE]

where ⋆ indicates the optimal values of the parameters as found at the end of the training phase. Finally, the right cluster is picked based on the largest assignment probability value.

Remark 2.

It is worth mentioning that with the use of the KL approximation as given by (22), our algorithm does not require the assumption $P_{C|\mathbf{U}}=Q_{C|\mathbf{U}}$ to hold (which is different from [19]). Furthermore, the algorithm is guaranteed to converge. However, the convergence may be to (only) local minima; and this is due to the problem (18) being generally non-convex. Related to this aspect, we mention that while without a proper pre-training, the accuracy of the VaDE algorithm may not be satisfactory, in our case, the above assumption is only used in the final assignment after the training phase is completed. $\blacksquare$

Remark 3.

In [26], it is stated that optimizing the original IB problem with the assumption of independent latent representations amounts to disentangled representations. It is noteworthy that with such an assumption, the computational complexity can be reduced from $\mathcal{O}(n_{u}^{2})$ to $\mathcal{O}(n_{u})$ . Furthermore, as argued in [26], the assumption often results only in some marginal performance loss; and for this reason, it is adopted in many machine learning applications. $\blacksquare$

III-C Effect of the Hyperparameter

As we already mentioned, the hyperparameter $s$ controls the trade-off between the relevance of the representation $\mathbf{U}$ and its complexity. As can be seen from (19) for small values of $s$ , it is the cross-entropy term that dominates, i.e., the algorithm trains the parameters so as to reproduce $\mathbf{X}$ as accurately as possible. For large values of $s$ , however, it is most important for the NN to produce an encoded version of $\mathbf{X}$ whose distribution matches the prior distribution of the latent space, i.e., the term $D_{\mathrm{KL}}(P_{\boldsymbol{\theta}}(\mathbf{U}|\mathbf{X})\|Q_{\boldsymbol{\psi}}(\mathbf{U}))$ is nearly zero.

In the beginning of the training process, the GMM components are randomly selected; and so, starting with a large value of the hyperparameter $s$ is likely to steer the solution towards an irrelevant prior. Hence, for the tuning of the hyperparameter $s$ in practice, it is more efficient to start with a small value of $s$ and gradually increase it with the number of epochs. This has the advantage of avoiding possible local minima, an aspect that is reminiscent of deterministic annealing [12], where $s$ plays the role of the temperature parameter. The experiments that will be reported in the next section show that proceeding in the above-described manner for the selection of the parameter $s$ helps in obtaining higher clustering accuracy and better robustness to the initialization (i.e., no need for a strong pretraining). The pseudocode for annealing is given in Algorithm 2.

Remark 4.

As we mentioned before, a text clustering algorithm is introduced by Slonim et al. [12, 11], which uses the IB method with an annealing procedure, where the parameter $s$ is increased gradually. In [12], the critical values of $s$ (so-called phase transitions) are observed such that if these values are missed during increasing $s$ , the algorithm ends up with the wrong clusters. Therefore, how to choose the step size in the update of $s$ is very important. We note that tuning $s$ is also very critical in our algorithm, such that the step size $\epsilon_{s}$ in the update of $s$ should be chosen carefully, otherwise phase transitions might be skipped that would cause a non-satisfactory clustering accuracy score. However, the choice of the appropriate step size (typically very small) is rather heuristic; and there exists no concrete method for choosing the right value. The choice of step size can be seen as a trade-off between the amount of computational resource spared for running the algorithm and the degree of confidence about scanning $s$ values not to miss the phase transitions. $\blacksquare$

IV Experiments

IV-A Description of the Datasets Used

In our empirical experiments, we apply our algorithm to the clustering of the following datasets.

MNIST: A dataset of gray-scale images of 70,000 handwritten digits of dimensions $28\times 28$ pixel.

STL-10: A dataset of color images collected from 10 categories. Each category consists of 1300 images of size of $96\times 96$ (pixels) $\times 3$ (RGB code). Hence, the original input dimension $n_{x}$ is 27,648. For this dataset, we use a pretrained convolutional NN model, i.e., ResNet-50 [27] to reduce the dimensionality of the input. This preprocessing reduces the input dimension to 2048. Then, our algorithm and other baselines are used for clustering.

REUTERS10K: A dataset that is composed of 810,000 English stories labeled with a category tree. As in [20], 4 root categories (corporate/industrial, government/social, markets, economics) are selected as labels, and all documents with multiple labels are discarded. Then, tf-idf features are computed on the 2000 most frequently occurring words. Finally, 10,000 samples are taken randomly, which are referred to as the REUTERS10K dataset.

IV-B Network Settings and Other Parameters

We use the following network architecture: the encoder is modeled with NNs with 3 hidden layers with dimensions $n_{x}-500-500-2000-n_{u}$ , where $n_{x}$ is the input dimension and $n_{u}$ is the dimension of the latent space. The decoder consists of NNs with dimensions $n_{u}-2000-500-500-n_{x}$ . All layers are fully connected. For comparison purposes, we chose the architecture of the hidden layers as well as the dimension of the latent space $n_{u}=10$ to coincide with those made for the DEC algorithm of [20] and the VaDE algorithm of [19]. All except the last layers of the encoder and decoder are activated with ReLU function. For the last (i.e., latent) layer of the encoder we use a linear activation; and for the last (i.e., output) layer of the decoder we use sigmoid function for MNIST and linear activation for the remaining datasets. The batch size is 100 and the variational bound (20) is maximized by the Adam optimizer of [25]. The learning rate is initialized with 0.002 and decreased gradually every 20 epochs with a decay rate of 0.9 until it reaches a small value (0.0005 is our experiments). The reconstruction loss is calculated by using the cross-entropy criterion for MNIST and mean squared error function for the other datasets.

IV-C Clustering Accuracy

We evaluate the performance of our algorithm in terms of the so-called unsupervised clustering accuracy (ACC), which is a widely used metric in the context of unsupervised learning [23]. For comparison purposes, we also present those of algorithms from the previous state-of-the-art.

For each of the aforementioned datasets, we run our VIB-GMM algorithm for various values of the hyperparameter $s$ inside an interval $[s_{\mathrm{min}},s_{\mathrm{max}}]$ , starting from the smaller valuer $s_{\mathrm{min}}$ and gradually increasing the value of $s$ every $n_{\text{epoch}}$ epochs. For the MNIST dataset, we set $(s_{\mathrm{min}},s_{\mathrm{max}},n_{\text{epoch}})=(1,5,500)$ ; and for the STL-10 dataset and the REUTERS10K dataset, we choose these parameters to be $(1,20,500)$ and $(1,5,100)$ , respectively. The obtained ACC accuracy results are reported in Table I and Table II. It is important to note that the reported ACC results are obtained by running each algorithm ten times. For the case in which there is no pretraining111In [19] and [20], the DEC and VaDE algorithms are proposed to be used with pretraining; more specifically, the DNNs are initialized with a stacked autoencoder [28]., Table I states the accuracies of the best case run and average case run for the MNIST and STL-10 datasets. It is seen that our algorithm outperforms significantly the DEC algorithm of [20], as well as the VaDE algorithm of [19] and GMM for both the best case run and average case run. Besides, in Table I, the values in parentheses correspond to the standard deviations of clustering accuracies. As seen, the standard deviation of our algorithm VIB-GMM is lower than the VaDE; which can be expounded by the robustness of VIB-GMM to non-pretraining. For the case in which there is pretraining, Table II states the accuracies of the best case run and average case run for the MNIST and REUTERS10K datasets. A stacked autoencoder is used to pretrain the DNNs of the encoder and decoder before running algorithms (DNNs are initialized with the same weights and biases of [19]). It is seen that our algorithm outperforms significantly the DEC algorithm of [20], as well as the VaDE algorithm of [19] and GMM for both the best case run and average case run. The effect of pretraining can be observed comparing Table I and Table II for MNIST. Using a stacked autoencoder prior to running the VaDE and VIB-GMM algorithms results in a higher accuracy, as well as a lower standard deviation of accuracies; therefore, supporting the algorithms with a stacked autoencoder is beneficial for a more robust system. Finally, for the STL-10 dataset, Figure 5 depicts the evolution of the best case ACC with iterations (number of epochs) for the four compared algorithms.

Figure 5 shows the evolution of the reconstruction loss of our VIB-GMM algorithm for the STL-10 dataset, as a function of simultaneously varying the values of the hyperparameter $s$ and the number of epochs (recall that, as per the described methodology, we start with $s=s_{\mathrm{min}}$ , and we increase its value gradually every $n_{\text{epoch}}=500$ epochs). As can be seen from the figure, the few first epochs are spent almost entirely on reducing the reconstruction loss (i.e., a fitting phase), and most of the remaining epochs are spent making the found representation more concise (i.e., smaller KL divergence). This is reminiscent of the two-phase (fitting vs. compression) that was observed for supervised learning using VIB in [29].

Remark 5.

For a fair comparison, our algorithm VIB-GMM and the VaDE of [19] are run for the same number of epochs, e.g., $n_{\text{epoch}}$ . In the VaDE algorithm, the cost function (11) is optimized for a particular value of hyperparameter $s$ . Instead of running $n_{\text{epoch}}$ epochs for $s=1$ as in VaDE, we run $n_{\text{epoch}}$ epochs by gradually increasing $s$ to optimize the cost (21). In other words, the computational resources are distributed over a range of $s$ values. Therefore, the computational complexity of our algorithm and the VaDE are equivalent. $\blacksquare$

IV-D Visualization on the Latent Space

In this section, we investigate the evolution of the unsupervised clustering of the STL-10 dataset on the latent space using our VIB-GMM algorithm. For this purpose, we find it convenient to visualize the latent space through application of the t-SNE algorithm of [30] in order to generate meaningful representations in a two-dimensional space. Figure 6 shows $4000$ randomly chosen latent representations before the start of the training process and respectively after $1$ , $5$ , and $500$ epochs. The shown points (with a $\boldsymbol{\cdot}$ marker in the figure) represent latent representations of data samples whose labels are identical. Colors are used to distinguish between clusters. Crosses (with an $\mathbf{x}$ marker in the figure) correspond to the centroids of the clusters. More specifically, Figure 6a shows the initial latent space before the training process. If the clustering is performed on the initial representations, it allows ACC as small as $10\%$ , i.e., as bad as a random assignment. Figure 6b shows the latent space after one epoch, from which a partition of some of the points starts to be already visible. With five epochs, that partitioning is significantly sharper, and the associated clusters can be recognized easily. Observe, however, that the cluster centers seem not to have converged yet. With $500$ epochs, the ACC accuracy of our algorithm reaches $\%91.6$ , and the clusters and their centroids are neater, as is visible from Figure 6d.

V Conclusions and Future Work

In this paper, we propose and analyze the performance of an unsupervised algorithm for data clustering. The algorithm uses the Variational Information Bottleneck approach and models the latent space as a mixture of Gaussians. It is shown to outperform state-of-the-art algorithms such as the VaDE of [19] and the DEC of [20]. We note that although it is here assumed that the number of classes is known beforehand (as is the case for almost all competing algorithms in its category), that number can be found (or estimated to within accuracy) through inspection of the resulting bifurcations on the associated information-plane, as was observed for the standard Information Bottleneck method. Finally, we mention that among interesting research directions on this line of work, one important question pertains to the distributed learning setting, i.e., along the counterpart, to the unsupervised setting, of the recent work [31, 32, 33], which contains distributed IB algorithms for both discrete and vector Gaussian data models.

Appendix A The Proof of Lemma 1

First, we expand $\mathcal{L}^{\prime}_{s}(\mathbf{P})$ as follows

[TABLE]

Then, $\mathcal{L}_{s}^{\mathrm{VB}}(\mathbf{P},\mathbf{Q})$ is defined as follows

[TABLE]

Hence, we have the following relation

[TABLE]

where equality holds under equalities $Q_{\mathbf{X}|\mathbf{U}}=P_{\mathbf{X}|\mathbf{U}}$ and $Q_{\mathbf{U}}=P_{\mathbf{U}}$ . We note that $s\geq 0$ .

Now, we complete the proof by showing that (25) is equal to (8). To do so, we proceed (25) as follows

[TABLE]

Appendix B Alternative Expression $\mathcal{L}_{s}^{\mathrm{VaDE}}$

Here, we show that (13) is equal to (14).

To do so, we start with (14) and proceed as follows

[TABLE]

where $(a)$ and $(b)$ follow due to the Markov chain $C-\!\!\!\!\minuso\!\!\!\!-\mathbf{X}-\!\!\!\!\minuso\!\!\!\!-\mathbf{U}$ .

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Sculley, “Web-scale K 𝐾 K -means clustering,” in Proceedings of the 19th International Conference on World Wide Web , April 2010, pp. 1177–1178.
2[2] Z. Huang, “Extensions to the K 𝐾 K -means algorithm for clustering large datasets with categorical values,” Data Mining and Knowledge Discovery , vol. 2, no. 3, pp. 283–304, September 1998.
3[3] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K 𝐾 K -means clustering algorithm,” Journal of the Royal Statistical Society , vol. 28, no. 1, pp. 100–108, 1979.
4[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society , vol. 39, no. 1, pp. 1–38, 1977.
5[5] C. Ding and X. He, “ K 𝐾 K -means clustering via principal component analysis,” in Proceedings of the 21st International Conference on Machine Learning , July 2004.
6[6] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine , vol. 2, no. 11, pp. 559–572, November 1901.
7[7] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and Intelligent Laboratory Systems , vol. 2, no. 1–3, pp. 37–52, August 1987.
8[8] S. Roweis, “EM algorithms for PCA and SPCA,” in Proceedings of Advances in Neural Information Processing Systems 10 10 10 , December 1997, pp. 626–632.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Variational Information Bottleneck

Abstract

I Introduction

II Proposed Model

II-A Inference Network Model

II-B Generative Network Model

III Proposed Method

III-A Brief review of Variational Information Bottleneck for Unsupervised Learning

Lemma 1**.**

Proof.

Remark 1**.**

III-B Proposed Algorithm: VIB-GMM

Remark 2**.**

Remark 3**.**

III-C Effect of the Hyperparameter

Remark 4**.**

IV Experiments

IV-A Description of the Datasets Used

IV-B Network Settings and Other Parameters

IV-C Clustering Accuracy

Remark 5**.**

IV-D Visualization on the Latent Space

V Conclusions and Future Work

Appendix A The Proof of Lemma 1

Appendix B Alternative Expression LsVaDE\mathcal{L}_{s}^{\mathrm{VaDE}}LsVaDE​

Lemma 1.

Remark 1.

Remark 2.

Remark 3.

Remark 4.

Remark 5.

Appendix B Alternative Expression $\mathcal{L}_{s}^{\mathrm{VaDE}}$