Generalized Dirichlet-process-means for $f$-separable distortion measures
Masahiro Kobayashi, Kazuho Watanabe

TL;DR
This paper extends DP-means clustering to $f$-separable distortion measures, providing a robust, unified algorithm that adapts to different functions and improves outlier resistance, validated through experiments.
Contribution
It introduces a generalized framework for DP-means using $f$-separable measures and analyzes robustness via influence functions, enhancing outlier handling.
Findings
Improved robustness against outliers demonstrated in experiments.
Unified algorithm effectively adapts to various $f$-separable measures.
Numerical results show competitive clustering performance.
Abstract
DP-means clustering was obtained as an extension of -means clustering. While it is implemented with a simple and efficient algorithm, it can estimate the number of clusters simultaneously. However, DP-means is specifically designed for the average distortion measure. Therefore, it is vulnerable to outliers in data, and can cause large maximum distortion in clusters. In this work, we extend the objective function of the DP-means to -separable distortion measures and propose a unified learning algorithm to overcome the above problems by selecting the function . Further, the influence function of the estimated cluster center is analyzed to evaluate the robustness against outliers. We demonstrate the performance of the generalized method by numerical experiments using real datasets.
| behavior | monotonic decrease | calculation order | ||
|---|---|---|---|---|
| average distortion minimization | yes | |||
| robustness against outliers | yes | |||
| approaches the maximum distortion minimization | *1 | *2 |
| behavior | |||||
| 1 | average distortion minimization | ||||
| robustness against outliers | |||||
| approaches the maximum distortion minimization | |||||
| maximum distortion minimization | |||||
| dataset | |||
|---|---|---|---|
| Breast Cancer Wisconsin555https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data | |||
| Heart666https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data | |||
| HeartK2 | |||
| HTRU2 | |||
| Iris | |||
| Mice Protein Expression | 552 | 8 | 77 |
| Pima | |||
| Seeds | |||
| Thyroid777http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/new-thyroid.data | |||
| Wine | |||
| Yeast |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Generalized Dirichlet-process-means for -separable distortion measures
Masahiro Kobayashi111Email:[email protected]
Kazuho Watanabe222Email:[email protected]
Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1, Hibarigaoka, Tempaku-cho, Toyohashi 441-8580, Japan
Abstract
DP-means clustering was obtained as an extension of -means clustering. While it is implemented with a simple and efficient algorithm, it can estimate the number of clusters simultaneously. However, DP-means is specifically designed for the average distortion measure. Therefore, it is vulnerable to outliers in data, and can cause large maximum distortion in clusters. In this work, we extend the objective function of the DP-means to -separable distortion measures and propose a unified learning algorithm to overcome the above problems by selecting the function . Further, the influence function of the estimated cluster center is analyzed to evaluate the robustness against outliers. We demonstrate the performance of the generalized method by numerical experiments using real datasets.
keywords:
Clustering , Dirichlet-process-means , -separable distortion measures , Bregman divergence , Influence function , Maximum distortion
††journal: Neurocomputing
{highlights}
The objective function of DP-means is generalized to -separable distortion measures.
It achieves maximum distortion minimization or obtains robustness.
Two functions families which have real parameter are introduced.
Monotonic decreasing property of the objective function is guaranteed.
The influence function is derived, which investigates robustness against outliers.
1 Introduction
-means is one of the most popular clustering methods. This is because its algorithm is simple and can be executed at high speed in linear time with respect to the number of data. However, it is necessary to specify the number of clusters in advance. Therefore, it is necessary to apply some heuristics or examine results with multiple cluster numbers. Although clustering methods that can estimate the number of clusters from given data have been proposed so far, they have problems such as long computation time and many parameters to be tuned. Affinity propagation [1] and Mean shift clustering [2] are difficult to apply to large scale data because they require the squared order of the number of data to execute the algorithms. Gamma-clust based on a -divergence has been proposed as a clustering method that is robust against scatted outliers [3]. It learns means and covariance matrices of -Gaussian mixtures based on -divergence. It requires four or five hyper parameters to be set.
In this paper, we focus on the Dirichlet-Process-means (DP-means) clustering [4] which is a simple algorithm, and an extension of -means that can be performed with a linear order of the number of data. Historically, to estimate the number of clusters, a learning method of Gaussian mixtures by the nonparametric Bayes approach was proposed [5]. DP-means was obtained as an extension of -means capable of estimating the number of clusters in the limit where the variances of the Gaussian components approach [math]. DP-means retains the advantages of -means. It can be executed in linear time with respect to the number of data, is easy to apply to large scale data, and is can be implemented using a simple algorithm.
In an attempt to further speed up DP-means, parallelization can be applied with optimistic concurrency control when a new cluster is created [6]. In addition, computation time has been significantly reduced by dividing data into weighted subsets called coresets, although at the expense of accuracy [7]. DP-means specialized for application to large scale genetic data has been devised, and shown to be superior to existing methods from the aspect both of accuracy and efficiency [8]. To improve the accuracy of clustering, studies have been made to avoid local minimum solutions [9]. An extension using Bregman divergence was also given, which introduces an appropriate distance measure when data has a special type such as binary or non-negative integer value [10, 11].
From the viewpoint of information theory, the algorithm of DP-means monotonically decreases the average distortion of the training data, whereas the penalty parameter, which controls the number of clusters, has been interpreted as the maximum distortion of the data [12]. This motivated us to consider modifying DP-means, where the maximum distortion would be minimized instead of the average distortion. This problem is also known as the -center problem [13]. It is also demonstrated in the smallest enclosing ball problem when the number of clusters is one [14], and a method to calculate the smallest enclosing Bregman ball with the radius of the smallest enclosing ball measured by Bregman divergence has been studied [15].
DP-means has a problem that it is prone to the influence of outliers because of the nature of the objective function, the average distortion. Therefore, we extended the objective function of DP-means and invented two objective functions, either of which could bridge maximum distortion and robust distortion measures, and constructed the algorithms to minimize them [16]. However, the degree of the robustness against outliers induced by these objective functions has yet to be clarified.
Further, in order to extend the linear distortion measure as the average distortion to nonlinear distortion measures with respect to the distortion of each data point, -separable distortion measures using -mean has been proposed. In particular, for this distortion measure, the rate-distortion function showing the limit of lossy compression was elucidated [17].
In this paper, we further capitalize on the above previous work and extend the objective function of DP-means to -separable distortion measure using a monotonically increasing function . We derive a cluster center update rule for a sufficiently wide class of the function , and show that the objective function monotonically decreases if is concave. As concrete examples of the function , we show that two kinds of functions including a parameter can unify the minimization of robust distortion measures and the minimization of the maximum distortion by adjusting the parameter . Furthermore, we derive the influence function and evaluate the robustness against outliers. Experiments using real datasets demonstrate that DP-means generalized by the function improve the performance of the original DP-means.
The paper is organized as follows. Section 2 introduces DP-means. Section 3 generalizes the objective function of DP-means to -separable distortion measures and explains the behavior of the objective function corresponding to the selection of the function . In Section 4, the generalized DP-means algorithm is constructed based on the generalized objective function. In Section 5, we derive the influence function and evaluate robustness against outliers. In Section 6, we present the results of numerical experiments using real datasets to demonstrate the effectiveness of the generalized DP-means. In Section 7, we discuss further modification of the objective function in terms of the pseudo-distance. Finally, Section 8 concludes this paper.
2 DP-means
-means is one of the most popular clustering methods, is a simple algorithm, and can be executed at high speed in linear time with respect to the number of data. However, it is necessary to specify the number of clusters in advance. DP-means can estimate the number of clusters and retains the advantages of -means. Figure 1 schematically explains the difference between -means, original DP-means and generalized DP-means. The generalized DP-means will be explained in next section. -means, a.k.a. Bregman hard clustering, is obtained in the limit of Expectation-Maximization algorithm for the mixture of regular exponential family distributions, a.k.a. Bregman soft clustering, as the variances of component distributions tend to [math] [10]. Precisely, while -means minimizes the sum of squared distance, Bregman hard clustering minimizes the sum of Bregman divergence, the resulting algorithm is identical to -means [10]. Thus, we refer to the Bregman hard clustering as -means. On the other hand, introducing the Dirichlet-process (DP) prior for the cluster assignments in the mixture of regular exponential family distributions, we obtain the DP mixture model [5]. The DP-means is obtained as the same small-variance limit as above from the DP mixture model [4, 11].
DP-means requires data and penalty parameter as inputs. Suppose that each data point is -dimensional, . The algorithm of DP-means is basically the same as -means. Let be cluster centers. DP-means executes a calculation of the cluster centers and assigns the data point to clusters until the following objective function converges:
[TABLE]
Here, denotes the cluster label of the data point . However, the following two points are different from -means. DP-means is initialized with one cluster, . When the pseudo-distance between a data point and its nearest cluster center is greater than the penalty parameter , a new cluster is created. In other words, a new cluster is generated when the following is satisfied:
[TABLE]
Also, the cluster center is calculated as the average of data points assigned to the -th cluster,
[TABLE]
Here, is given by
[TABLE]
In this paper, we assume the Bregman divergence as the pseudo-distance, which generalizes the squared distance. Specifically, the Bregman divergence is the pseudo-distance determined from a differentiable strictly convex function as
[TABLE]
where represents the gradient vector of and is the inner product. The Bregman divergence satisfies non-negativity and identity of indiscernibles in distance axioms. Moreover, the Bregman divergences are related to the probability distributions in the regular exponential family, bijectively. For example, if data are of a particular type such as a binary or non-negative integer, the Bregman divergences corresponding to the Bernoulli and Poisson distributions are used as the more suitable distance measures than the usual squared distance for real numbers [10, 11].
Under a fixed number of clusters, the only difference between the -means and original DP-means is the initialization. -means is initialized with random assignment of clusters while in the DP-means, the order of data can be considered as the random initialization.
The average distortion and the maximum distortion are defined by
[TABLE]
respectively. Note that the minimization of the objective function (1) with respect to is equivalent to that of the average distortion. On the other hand, the penalty parameter , which determines the number of clusters, can be interpreted as the maximum distortion [12]. Therefore, we can consider the maximum distortion as the measure for cluster increment.
3 Generalized objective function
3.1 Generalization with -separable distortion measures
In this paper, we propose an objective function that generalizes the objective function of DP-means to -separable distortion measures as follows:
[TABLE]
As we will discuss in Section 4.2, this objective function is guaranteed to decrease monotonically with respect to and . In this paper, we assume that the function is a differentiable and continuous monotonically increasing, where is the set of non-negative real numbers. The argument is given by the Bregman divergence or the penalty parameter . In particular, we consider the following three types:
- :
linear,
- :
concave,
- :
convex.
If the function is a differentiable and strictly monotonically increasing function, an inverse function exists, and (4) can be normalized to a distortion measure as -mean [18]. The -separable distortion measures using this inverse function correspond to that in the literature [17]. The generalized objective function (4) can be monotonically transformed with the inverse function . Therefore, minimizing (4) is equivalent to minimizing the -mean,
[TABLE]
Table 1 summarizes the behavior of the objective function, its monotonic improvement property, and the calculation order required to execute the learning algorithm. If is linear (), it becomes the original objective function (1), which corresponds to the average distortion in the original DP-means. If is convex (), a distortion with a larger pseudo-distance value tends to be minimized. In particular, the faster the function diverges to infinity, the more the objective function approaches the maximum distortion. Conversely, If is concave (), a distortion with a smaller pseudo-distance value will be prioritized to be minimized. That is, the influence of data points far from other data points, such as outliers, is weakened. In other words, there is a trade-off between the maximum distortion, which is the maximum radius of the cluster, and the robustness against outliers. Robustness against outliers is explained more in detail in Section 5.
Eguchi and Kano generalized the likelihood of a probabilistic model to -likelihood using a convex function like (4), and devised a -estimator as a robust estimator against outliers [19]. The -estimator focuses on the following two points. The first is to obtain robustness against outliers. We consider a wide class of functions, not only the robustness against outliers but also the maximum distortion minimization within our focus. The second point is to guarantee the unbiased estimation equation. For this, a bias correction term, whose calculation is complicated in general, is included in the objective function. The -likelihood assumes a probabilistic model, whereas in this study only the Bregman divergence is assumed. As we will discuss, the update rule of the cluster center derived from the combination of the function and Bregman divergence enables us to execute the learning algorithm in the linear order on the number of the training data as the original DP-means.
3.2 Examples of function
In this subsection, we show two concrete examples of functions with a parameter . When the parameter is changed, the generalized objective function changes its behavior as average distortion, maximum distortion, or robust distortion measures.
3.2.1 Power mean objective
For the function
[TABLE]
the corresponding -mean is given by111 It can also be expressed as by using Tsallis -function , for which [20].
[TABLE]
The first term of (6) is called power mean. The parameters are , . Parameter determines the effect of the objective function, and parameter is introduced to avoid an algorithmic disadvantage. The disadvantage is that the cluster center is not updated when it overlaps the nearest data point. If , this problem does not occur. If , some algorithmic modification is required. We will explain the details in Section 4.3. Table 3 shows the characteristics of the objective function (6) and the corresponding function for different choices of .
As shown in Table 3, the behavior of the objective function varies around : it shows robust characteristics when , and the smaller the , the smaller the influence of outliers. When , the larger the value of , the closer the objective function approaches the maximum distortion. In particular, implies maximum distortion minimization. When the Bregman divergence is the squared distance, the objective function using (6) with and corresponds to the objective function derived in the framework of MAP-based Asymptotic Derivations (MAD-Bayes) [21], in which the generalized Gaussian distribution is assumed as the component of the nonparametric mixture model. The proof for this is in A.1. Similarly, assuming a deformed -distribution as the component of a nonparametric mixture model, the same objective function when and is obtained. The proof is in A.2.
3.2.2 Log-sum-exp objective
For the function
[TABLE]
the corresponding -mean is given by222 It can also be expressed as .
[TABLE]
Equation (8) is a differentiable approximation of the maximum value function when , known as the log-sum-exp function [22]. As in the case of the power mean, determines the characteristics of the objective function as a parameter. Table 3 shows the characteristics of the objective function using (8) and the corresponding function for different choices of .
Table 3 also shows how the objective function behaves differently around , as in the case of the power mean. It becomes robust when , and approaches the maximum distortion when . In particular, when , its limit is maximum distortion. In addition, (8) corresponds to the objective function for the estimation of mixture models in [23] when the variance of each component approaches [math].
4 Construction of generalized algorithm
In this section, we construct a generalized DP-means algorithm based on the objective function proposed in Section 3. First, we derive the update rule of the cluster center in Section 4.1. Second, we show that the objective function decreases monotonically with the derived update rule in Section 4.2. Third, we discuss minor problems in the execution of the algorithm and offer solutions to them in Section 4.3. Finally, we discuss the execution time of the generalized algorithm in Section 4.4. The generalized algorithms are constructed from the original DP-means by replacing the update rule of cluster centers and the objective function used for convergence diagnosis (Algorithm 1). This algorithm differs only to the original algorithm in the update rule of the cluster center, and the computation time required for execution is of linear order with respect to the number of data.
4.1 Derivation of update rules
The updated equations for the cluster centers are derived from the stationary conditions when the gradient of the cluster center of the generalized objective function (4) is . Here, represents the derivative of the function . Thus, the update rule of the cluster center is
[TABLE]
[TABLE]
which is the weighted mean of weighted by with the Bregman divergence as its argument. However, the objective function monotonically decreases by this update rule (9) only when the function is concave (or linear). If function is convex, the cluster center is updated by gradient descent optimization such as the steepest gradient or Newton’s method and so on. The update rule with Newton’s method is put in B.
4.2 Guarantee of monotonic decreasing property
The DP-means algorithm iterates the cluster center updating step and the assignments of data points to clusters. Because of the monotonic decreasing property of the objective function in each step, the algorithm converges within finite iterations. Even in the generalized objective function (4), the cluster assignment step is same as the original DP-means. Therefore, only the monotonic decreasing property of the objective function in the cluster center updating step is considered. In the following, we explain the two cases, concave and convex.
4.2.1 Concave case
The following theorem applies when the function is concave.
Theorem 1
If the function is concave, the updating of the cluster center using (9) monotonically decreases the objective function for general Bregman divergence.
Proof 1
We show that the objective function (4) monotonically decreases when the -th cluster center is newly updated to by (9). More specifically, we prove that, , where is the sum of the terms related to in (4):
[TABLE]
Here, the tangent line of at the point is expressed by the following equation:
[TABLE]
Furthermore, since follows from the concavity of the function , the following inequality holds:
[TABLE]
From this inequality, the following holds:
[TABLE]
where
[TABLE]
derived from (9) and (10) was used in (12). \qed**
The following corollary immediately follows from Theorem 1.
Corollary 1
When the objective function is constructed by the power mean (6) or the log-sum-exp function (8), the following holds. When , the updating of the cluster center using (9) monotonically decreases the objective function for general Bregman divergence.
If the function satisfies , the algorithm converges within finite iterations. Let be the number of times the -th cluster center has been updated, and denote it by . The corresponding objective function is . The objective function sequence converges to a finite value, because it has monotonic decreasing property (Theorem 1 ) and lower bounded, that is, . Therefore, the following holds:
[TABLE]
In other words,
[TABLE]
That is, the algorithm converges in finite iterations for threshold . The proof of Theorem 1 is a generalization of the monotonic decreasing property of ei-means () proposed in [24], which corresponds to the case where and . The proof of Theorem 1 is also interpreted by the Majorization-Minimization algorithm [25].
4.2.2 Convex case
The function is convex when the objective function (4) exhibits a characteristic close to the maximum distortion minimization. If the cluster center is updated by (9), the value of the objective function can oscillate and may not decrease monotonically. Therefore, it is necessary to apply a gradient descent optimization such as the steepest descent method or Newton’s method. When the pseudo-distance is the general Bregman divergence, the problem of calculating the cluster centers is not generally a convex optimization problem, so the gradient is not necessarily in the descent direction. However, monotonic decreasing property is also guaranteed for the Bregman divergence in general by using the algorithm that updates to the descending direction of the gradient like the modified Newton’s method. In particular, when symmetry is satisfied among distance axioms such as the squared distance, the problem of calculating the cluster centers is reduced to a convex optimization problem, so the gradient direction is always in the descent direction [15][22, Section 3.2]. When the function is convex, the calculation time in the case of the steepest descent method is , and Newton’s method with Cholesky decomposition is , where is the dimension of data points. In this case, there is no change in the linear order with respect to the number of data.
4.3 Problem and solution
When the objective function shows robustness against outliers, that is, when the function is concave, if a cluster center overlaps a data point, subsequent updates are not performed. In the assignment step of DP-means, a new cluster center is generated exactly on the data point satisfying (2). We can see that a data point and a cluster center frequently overlap. We will now show the condition where cluster center updating does not take place and offer a solution.
When the function is a concave and satisfies
[TABLE]
if a cluster center overlaps a data point, updating of the cluster center does not occur. It is assumed that one point in overlaps the cluster center . That is, . Here, if the function satisfies (13),
[TABLE]
holds. Therefore, we have
[TABLE]
which means that the cluster center does not move from the data point.
Next, we examine the case where function satisfies (13) and the cluster center overlaps the nearest data point with new data points additionally assigned to the cluster. Here, updating can be performed by shifting the cluster center from the data point to some point such as the average point, and applying the update rule.
In the case of the power mean (5), when , since (13) is satisfied, updating of the cluster center does not occur. Therefore, in this case, the procedure described above is required. However, in the case of the power mean (5) with , (13) is not satisfied, so the update of the cluster center is performed without the above procedure. However, it must be smaller than pseudo-distance values. Similarly, in the case of the log-sum-exp function (7), the cluster center can be updated without the above procedure because (13) is not satisfied.
4.4 Computational time
Although the computational complexity of the generalized DP-means is as discussed in Section 4.2.2, its actual clustering time is proportional to the number of outer loops times the number of inner loops. Furthermore, the former depends on the estimated number of clusters. The latter is required for non-linear and it generally increases as goes away from the linear function. Hence, the generalized DP-means requires more clustering time than the original DP-means in general. This means that there is a trade-off between the clustering time and robustness.
5 Analysis of influence function
The influence function is one of the indicators of the robustness against outliers and shows how much the estimator is influenced by the contamination of a small number of outliers. In this section, we derive the influence function and evaluate how robust the generalized objective function is against outliers.
5.1 Influence function of general function
Derivation of the influence function in this subsection is based on the derivation of the influence function of the total Bregman divergence [26, 27] (see also Section 7). We consider the influence function when an outlier is mixed in the -th cluster. However, since the influence of mixed outliers is independent for each cluster, in the derivation of subsequent influence functions, the subscript of is omitted and expressed as . Suppose that the -th cluster contains samples of data , and the cluster center estimated from is . When an outlier is mixed into the data , a new cluster center is calculated for the data including outliers . If we let be the difference between the estimator including outliers and the estimator without outliers,
[TABLE]
The influence function is defined by
[TABLE]
The influence function (14) defined by a finite sample is also specifically called a sensitivity curve in the field of robust statistics [28]. Then, the new cluster center minimizes
[TABLE]
Now we compute the first three terms of the Taylor expansion around the old cluster center , which minimizes . Therefore, the first derivative with respect to of the first term of (15) becomes . The second derivative with respect to of the second term of (15) becomes very small when is large because
[TABLE]
where (14) was used. Here, expresses the gradient with respect to . From the above arguments, the term related to is given as follows:
[TABLE]
The that minimizes (16) is solved by completing the square with respect to . Then, the influence function in (14) is given by
[TABLE]
Essentially, (17) is the same as the influence function in M-estimation [28]. Since the matrix does not depend on the outlier, in order to investigate the boundedness of the influence function, we should evaluate the following term :
[TABLE]
This shows that the boundedness of the influence function depends on the function and the strictly convex function constituting the Bregman divergence. If (18) is bounded, the estimated cluster center is robust against outliers.
5.2 Necessary condition for boundedness
Theorem 2
The following condition of the function is necessary for its influence function to be bounded for all :
[TABLE]
Proof 2
As , when is large, is large. When the function is linear () or convex (), is constant or monotonically increasing with respect to . In such a case, is increased, and the norm of (18) becomes as large as possible, which means the influence function is not bounded. In order to reduce the norm of (18) when is large, must be monotonically decreasing. The type of the function that satisfies this condition is only concave (). When does not satisfy (19), the norm of (18) is divergent at , and hence the influence function is not bounded. Therefore, the function must satisfy (19). \qed
Remark 1
In this paper, the function is assumed to be one of three types: linear, convex, and concave. Therefore, from the proof of this theorem, we know that is restricted to a subclass of concave functions in order to obtain the robustness. However, even for non-concave functions, if the condition of Theorem 2 is satisfied, robustness can be obtained. For example, a sigmoid function is not a concave function although it can induce robustness.
Remark 2
In some cases, the factor in (18) can diverge to infinity around . Even in such a case, if we consider the region of satisfying for a constant , the condition of Theorem 2 provides a necessary condition for the boundedness of the influence function on the region.
Under the condition of Theorem 2, the norm of (18) is bounded. In the following, we consider whether the norm of the influence function vanishes as . In particular, if
[TABLE]
holds, the influence function is said to be redescending, and an outlier that is too large is automatically ignored.
The following Assumption 1 is assumed in the following discussion.
Assumption 1
The input dimension, the norm of the cluster center , and that of at are finite, that is, , , and . The Bregman divergence satisfies the followings for :
[TABLE]
Equations (21) and (22) hold true if the Bregman divergence is defined additively with respect to -dimensions. Below, we discuss the situation where holds under these assumptions. In some cases, such as the Bregman divergence corresponding to the binomial distribution, can not occur. However we can investigate the behavior of the influence function for finite . We illustrate the behavior of (18) for such a case in C.
5.3 Power mean
From (18) and (5), to evaluate the influence function in the case of the power mean, we evaluate the following term:
[TABLE]
Theorem 3
For the function of the power mean (5) with , the influence function is redescending for general Bregman divergences.
The proof of Theorem 3 is in D.2.1. However, when , the redescending or boundedness property of the influence function depends on the Bregman divergence. In the rest of this subsection, we investigate the influence functions for concrete examples of the Bregman divergences.
-divergence is a subclass of Bregman divergences, including the Itakura Saito divergence (), the generalized KL divergence (), and the squared distance () as special cases [29].333The -divergence here is usually termed as the -divergence with the parameter . We denote the parameter by in order not to be confused with the in (5). The -divergence and corresponding convex functions are given by
[TABLE]
respectively. In the convex function, when the parameter is a positive even number other than [math], its domain is defined as , otherwise it is . If the data is multidimensional, the Bregman divergence and the corresponding convex function are defined additively with respect to dimensions as follows :
[TABLE]
We calculated the influence function for the -divergence and found that it can be classified into divergent, bounded, and redescending types with respect to and according to the following conditions (the proof is in D.2.2) :
[TABLE]
Figure 2 shows the regions of corresponding to the three types. Specifically, we see the boundedness property holds at the boundary line between the divergent and the redescending properties of the influence function. This boundary line is continuous except for the point of , and gradually approaches when . The case of another divergence, the exp-loss is shown in D.2.3.
5.4 Log-sum-exp
From (18) and (7), to evaluate the influence function in the case of the log-sum-exp function, we evaluate the following term:
[TABLE]
Theorem 4
For the function of the log-sum-exp (7) with , the influence function is redescending for general Bregman divergences.
Theorem 4 shows that when the estimated cluster center is robust against outliers, the redescending property always holds for any Bregman divergences. (The proof of Theorem 4 is in D.3.)
From the above examples, it can be seen that the robustness property of -mean strongly depends on the function .
5.5 Discussion on the influence function
In Section 5, we have discussed the boundedness of the influence function when the norm of the outlier goes to infinity. As the property of the DP-means algorithm, the data point which satisfies (2) generates a new cluster. Hence, updating of the cluster center is not affected by such a data point. However, data points which satisfy
[TABLE]
may effectively work as outliers. Clustering performance may be badly affected by data points satisfying (23) for the fixed penalty parameter depending on the dataset. In fact, the clustering performance of DP-means depends on the selection of function as we will examine numerically in Section 6. Therefore, it is important to consider the robustness against outliers. On the other hand, efficiency and robustness trade-off: if priority is given to efficiency, lower robustness may be better. In such a case, we may choose a concave function similar to the linear function which corresponds to an unbounded influence function. In any case, it is important to examine the influence function, and it can be a criterion for the design of the function and the penalty parameter from the dataset.
6 Experiments
6.1 UCI experiment
We conducted experiments with benchmark datasets in UCI Machine Learning Repository444https://archive.ics.uci.edu/ml/datasets.html to investigate the influence of the objective function generalized by the monotonically increasing function . We focused on the power mean (5) and the log-sum-exp function (7). It is difficult to fairly compare DP-means with the other clustering methods introduced in Section 1 that can estimate the number of clusters because the calculation time of is necessary or many parameters are required, and the preconditions of the algorithms are different. Also, as discussed in Section 4.4, the clustering time of the generalized DP-means depends on the estimated number of clusters, which changes against for a fixed . Thus, we focused on the clustering performance of the generalized DP-means other than the computational time.
In Section 5.5, we discussed that outliers satisfying (23) and depending on can influence the performance of DP-means. In the experiments below, we assume that real datasets contain such outliers implicitly. In other words, the improvement of the performance of the generalized DP-means for reflects the fact that datasets contain such outliers deteriorating the original DP-means.
6.1.1 Dataset
The datasets used for the experiment, where , , and denote the number of data, the number of clusters, and the number of dimensions, respectively are summarized in Table 4, where links to the specific datasets are given as footnotes when there are multiple datasets. Datasets for classification problems were used assuming classes as the true clusters. Heart dataset consists of data on heart disease with five clusters. Four clusters represent heart disease and one cluster represents no heart disease. HeartK2 dataset was made of Heart dataset by coarsening the cluster labels. More specifically it was made from the two clusters, with and without heart disease. We deleted data points with missing values beforehand.
6.1.2 Evaluation criteria
We used the number of clusters and normalized mutual information (NMI) as the evaluation criteria to investigate the influence of the penalty parameter and control parameter for the objective function. In order to confirm the behavior of the objective function, we examined the behavior of the maximum distortion against the change of . NMI is criterion for evaluating the clustering result, and take values from [math] to . The closer the NMI is to , the better the result. NMI is defined by the following equation:
[TABLE]
for the label set of the clustering result and the label set of the correct cluster. Here and represent mutual information and entropy, respectively.
6.1.3 Method
For preprocessing of clustering, we standardized data so that the each dimension is transformed as . We chose the squared distance as the distortion measure, which is averaged with respect to the dimensions.
In the experiment, we investigated the change of each evaluation criterion when changing the parameter for each case of the power mean (6) with and log-sum-exp function (8). The range of examined was . DP-means returns a local minimum solution depending on the order of data. Hence, the order of data was shuffled times, and for each evaluation criteria, the average value on the number of shuffles was calculated and used as the result. The concrete procedure is shown below.
When the number of clusters is 1, find the maximum value of the maximum distortion in the range of . 2. 2.
Randomly rearrange the sequence of data. 3. 3.
Change from to in increments.
- (a)
Set 2. (b)
Algorithm is executed with the penalty parameter . 3. (c)
Repeat Step 3-1 and Step 3-2 until falls below the threshold. 4. 4.
Repeat Step 2 and Step 3 100 times. 5. 5.
Average the evaluation criteria over rearrangements of data for each and , and use them as the result.
The threshold was set so that the number of clusters was within about three times the number of correct clusters.
Note that when the Bregman divergence is defined as the average with respect to the dimension , in the case of the log-sum-exp function, the effective value of the parameter depends on . When the parameter to be given is , the effective value of the parameter is . Thus, the range of is and the step size is .
6.1.4 Result
Figures 4 through 24 show the number of clusters, NMI, and the maximum distortion by heat maps for the 11 datasets. We can see that the number of clusters, NMI, and maximum distortion are correlated. Depending on the dataset, NMI tends to be improved when the value of is lowered below 1. The tendency is remarkable in “Breast Cancer Wisconsin” (Figure 4(b), Figure 4(b)) and “HTRU2” (Figure 10(b), Figure 10(b)), where NMI increases monotonically as the value of is lowered. In other words, the outliers in the sense discussed in Section 5.5 exist implicitly in the datasets in which NMI was improved by lowering . However, NMI worsens if is lowered too much. Since the number of clusters and the maximum distortion are in a trade-off relationship, we can see the maximum distortion tends to decrease as increases when the number of clusters is fixed and the maximum distortion is compared. Theoretically, the larger , the closer the generalized DP-means to the maximum distortion minimization and the smaller , the it robust against outliers. The above result supports this fact.
On the other hand, the value of NMI may not change even if the value of is lowered than 1 as in the Iris dataset. It means that the dataset did not include the outlier which we assumed in Section 5.5. Therefore, we explicitly added outliers to the Iris dataset and conducted an additional experiment. The Iris dataset has 3 clusters, setosa, versicolor and virginica. The clusters of versicolor and virginica overlap. In such a situation, adding outliers to either cluster would adversely affect the estimation. Hence, outliers of of the total original data were randomly mixed at 0.95 to 1 times the true maximum distortion from the true cluster center of the virginica cluster. It is expected that this procedure simulates the outliers assumed in Section 5.5. Then, we applied the same preprocessing of data and experiment as described in Section 6.1.3. The result is shown in Figures 26 and 26. We can see that the value of NMI is improved when is lowered from 1 compared to when . This fact shows that the generalized DP-means is robust against the outliers assumed in Section 5.5. Although there are many other ways to mix outliers, it is beyond the scope of this paper and needs further study to examine comprehensively the influence of outliers to the clustering performance.
6.2 Image compression task
We conducted an experiment to see if generalized DP-means is more effective than the original DP-means () through the application of vector quantization to an image compression task. We used generalized DP-means with power mean (6) (). In particular, we considered the case where the minimization measure approaches maximum distortion minimization, that is, is sufficiently large. Although in this experiment, image compression was handled, the purpose was to examine the performance of the generalized DP-means. Tipping and Schölkopf compared the maximum distortion minimization and the average distortion minimization in an image compression task by using a clustering method called the kernel vector quantization [30]. In this experiment, the same comparison was carried out using the same image (Figure 27). This is a color image size . We obtained data points by dividing it into block images of . Each data point consisted of dimensions from the block size and the color information. The uncompressed image is represented by the dataset of points. Image compression was performed while increasing the penalty parameter and carrying out clustering using Algorithm 1 with given by (5) and as the average distortion minimization and as the approximation of the maximum distortion minimization. As the penalty parameter increases, the number of clusters decreases. The preprocessing for clustering was the same as the previous experiment. We chose the squared distance, as the distance measure. Newton’s method was used for gradient descent optimization to calculate cluster centers because the convergence speed is second order. The parameter was a relatively large value among , which did not cause any divergence in the calculation. Thus, was regarded as the maximum distortion minimization. In this experiment, we focused on whether or not the letter string of the license plate in Figure 27 was recognizable [30]. Figure 29 shows an image compressed to the limit at which the license plate letter string can be read for each of average distortion minimization and maximum distortion minimization. Further, when the letter string of the license plate can only be read partly, comparison under the same compression ratio is shown in Figure 29.
Figure 29 shows that the maximum distortion minimization achieved the better compression ratio compared to the average distortion minimization when the whole letter string can be read. In Figure 29, it is possible to read several characters of the license plate in the case of the maximum distortion minimization, while it is almost impossible to read in the case of the average distortion minimization. In the image compression task focusing on the letter string in the image, it was suggested that better performance was obtained by using the generalized DP-means with a large value of that approaches the maximum distortion minimization than the original DP-means. The reason why the generalized objective function with a large was effective may be as follows. In the average distortion minimization, the license plate consisting of a small number of patterns in the entire image tended to have large distortion from the cluster center to which it belongs, whereas in the case of the maximum distortion minimization, this led to the reduction in the distortion of the blocks from the license plate.
7 Discussion: total Bregman divergence
So far we have discussed with the Bregman divergence as a prerequisite. The same discussion can be made when the total Bregman divergence is used as the pseudo-distance. The total Bregman divergence is invariant to the rotation of the coordinate axes, and the cluster center obtained by minimizing the average distortion has been shown to be robust to outliers [26]. The Bregman divergence is known to have a bijective relationship with the exponential family, whereas the total Bregman divergence corresponds to the lifted exponential family [31]. The total Bregman divergence is defined by
[TABLE]
where [26, 27]. Note that the arguments of in the numerator are reversed compared to the in (3). In the case of , it coincides with the definition in [26], and when , it coincides with the case of the reversed Bregman divergence. When the total Bregman divergence is used for the pseudo-distance, as in Section 4.1, the update rule of the cluster center can be obtained as
[TABLE]
[TABLE]
Here, denotes the inverse function of . As in Section 4.2, when the function is concave, Theorem 6 (E.1) holds, claims the monotonic decreasing property of the objective function. When the function is convex, since the problem of updating the cluster center is a convex optimization problem, the gradient direction always becomes the descent direction by the gradient descent optimization. The update rules with Newton’s method are summarized in B. The influence function is derived in the same flow as in Section 5.1. As the theorem on the boundedness of the influence function, the following holds.
Theorem 5
The following Condition 1 on the function is a necessary and sufficient condition for its influence function to be bounded for all and Condition 2 provides a necessary and sufficient condition for it to be redescending:
* is monotonically decreasing function is a concave function or linear function,* 2. 2.
[TABLE]
The proof of this theorem is in E.3. (See Remark 2 for the discussion when around .) Recall that Condition 2 is a necessary condition for the boundedness of the influence function when the standard Bregman divergence is used as the pseudo-distance (Theorem 2).
8 Conclusion
In this paper, we generalized the average distortion of DP-means to -separable distortion measures by using a monotonically increasing function . If the function has an inverse function , the -separable distortion measure can be expressed by -mean. We classified the function into three types, linear, convex and concave. These three types correspond to the original average distortion, distortion measures approaching the maximum distortion, and those with robustness against outliers respectively. We showcased two kinds of functions including the parameter . The objective function constituted by these functions can change the characteristics according to the value of the parameter . Furthermore, based on this generalized objective function, an algorithm with guaranteed convergence was constructed. Like the original DP-means, this algorithm has the computational complexity of the linear order of the number of data. In order to evaluate the robustness against outliers, we derived the influence function on the general form of the function and showed the necessary condition for the influence function to be bounded. For each concrete example of the function , we examined the condition under which the boundedness of the influence function holds. We proved that the log-sum-exp function showed the robustness against outliers regardless of the Bregman divergence. Although the above discussion assumes the Bregman divergence as pseudo-distance, we also showed that the same argument holds true for the total Bregman divergence. In addition, experiments using real datasets demonstrated that the generalized DP-means improves the performance of the original DP-means. Although the generalized DP-means is scalable to high-dimensional data computationally, its clustering performance can be under question if the Bregman divergence is naively defined for example simply additively with respect to dimension. In such a case, it can be a remedy for high-dimensional data to combine dimension reduction techniques such as non-negative matrix factorization (NMF) [32], -distributed stochastic neighbor embedding (-SNE) [33], and deep autoencoders [34]. Our future research will include analysis of the generalization error consisting of the bias and variance in the estimation of the cluster centers. This will led to a principled design of a combination of the function and the Bregman divergence (or pseudo-distance) by investigating the trade-off between the generalization error and the robustness.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We are grateful of the anonymous referees for their helpful comments and suggestions. We would like to thank Professor Michael Tipping of Bath University for providing the image used for the experiment. This work was supported in part by JSPS KAKENHI Grant numbers 19J11776 and 19K11825.
Appendix A Derivation of objective functions through MAD-Bayes
A.1 Generalized Gaussian distribution
We focus on the objective function with the function in (5) (, ) and the Bregman divergence as the squared distance . We prove that the objective function of this case is derived from the framework of MAD-Bayes [21] when the generalized Gaussian distribution is assumed as a component. The generalized Gaussian distribution is given by
[TABLE]
where the normalization constant is
[TABLE]
and is the gamma function. The parameters are and . It includes the Laplace distribution (), the Gaussian distribution (), and the uniform distribution () as special cases. The likelihood is given by
[TABLE]
The Chinese restaurant process, an example of Dirichlet processes, is given by
[TABLE]
where and is the hyperparameter [5]. When an arbitrary distribution that creates the cluster center is defined as , the simultaneous distribution is expressed by . The simultaneous distribution of data, cluster assignments and cluster centers are expressed by the following equation:
[TABLE]
Then, setting , we consider the limit . We have
[TABLE]
It follows that
[TABLE]
Because as , we obtain the objective function as follows:
[TABLE]
This objective function is equivalent to the objective function (4) with the function in (5) (, ) as follows:
[TABLE]
Note, however, that must be positive in the generalized Gaussian distribution.
A.2 Deformed -distribution
We consider the same objective function as that in A.1 except that and instead of . We prove that the objective function of this case is derived from the framework of MAD-Bayes when the deformed -distribution is assumed as a component. The -distribution is
[TABLE]
where is the degree of freedom. is the dimension of data. It includes the Cauchy distribution () and the Gaussian distribution () as special cases. Here, we use the following distribution obtained by transforming this -distribution:
[TABLE]
where the normalization constant is
[TABLE]
The likelihood is given by
[TABLE]
As in A.1, the simultaneous distribution of data, cluster assignments, and cluster centers are expressed by the following equation:
[TABLE]
Then, setting , we consider the limit . We have
[TABLE]
It follows that
[TABLE]
Because as , we obtain the following objective function:
[TABLE]
We put . This objective function is equivalent to the objective function (4) with the function in (5) (, ) as follows:
[TABLE]
Note, however, that must not be equal to 0 in the deformed -distribution.
Appendix B Update rules with Newton’s method
The cluster center is updated with Newton’s method as follows:
[TABLE]
where the gradient vector and the hessian matrix are given by
[TABLE]
respectively. The gradient vector and the hessian matrix of the Bregman divergence are given by
[TABLE]
If the Bregman divergence is additive with respect to the dimension of data points, , the gradient (30) and hessian matrix (31) are expressed simply. The -th element of -dimensional column vector (30) is given by
[TABLE]
The hessian matrix (31) is diagonal matrix and its -th element is given by
[TABLE]
In the case of the total Bregman divergence, in (28) and (29) is replaced with and the cluster center is updated by (27). The gradient vector and the hessian matrix of the total Bregman divergence are given by
[TABLE]
Similarly, the total Bregman divergence is additive with respect to the dimension of data points, (32) and (33) are expressed simply. The -th element of -dimensional column vector (32) is given by
[TABLE]
The hessian matrix (33) is diagonal matrix and its -th element is given by
[TABLE]
Appendix C Plots of influence functions
The Bregman divergence corresponding to the binomial distribution is given by
[TABLE]
where is a non-negative integer value and in [10]. In this Section, we call equation (34) “binomial-loss”. In the following, (18) in the one-dimensional case is illustrated as a function of for each Bregman divergence for the power mean and the log-sum-exp (Figure 32-35). It is 0 at . The tendencies of the influence functions as discussed in Section 5.3 and Section 5.4 for different and Bregman divergences can be seen.
Appendix D Proof of bounded influence function
D.1 Proof of Lemma 1
Under Assumption 1, the following lemma holds.
Lemma 1
For , let
[TABLE]
be the influence function of the -th dimension. Then, it holds that
[TABLE]
Proof 3
Let the -th component of the matrix be , and . It follows that
[TABLE]
where
[TABLE]
If the norm of (18) is bounded, the norm of the influence function is bounded. Here, the norm of (18) is
[TABLE]
Hence, we have
[TABLE]
when . If the function satisfies (19) of Theorem 2, which implies the concavity of , the following holds:
[TABLE]
Thus, the bounded and redescending properties of the left hand side as imply those of the right hand side as , respectively. This means that if is bounded or converging to 0 for all , so is . If for some , putting and taking the limit , we have . \qed
D.2 Power mean
D.2.1 Proof of Theorem 3: redescending property for power mean
Evaluate the following expression (35) of Lemma 1 for the function in (5) (). It follows from l’Hopital’s rule that
[TABLE]
Therefore, from Lemma 1, the redescending property holds.
D.2.2 -divergence
Here, since the -divergence is additively defined, it holds that . Then,the -divergence is expressed as
[TABLE]
Evaluate the following expression (35) of Lemma 1 for the function in (5) :
[TABLE]
1.
It follows from (36) that
[TABLE]
Therefore, it holds that
[TABLE]
2. (generalized Kullback-Leibler divergence)
It follows from (36) that
[TABLE]
Therefore, it holds that
[TABLE]
3.
It follows from (36) that
[TABLE]
Therefore, it holds that
[TABLE]
D.2.3 Exp-loss
When the function is given by (5) and the convex function constituting the Bregman divergence is given by the exponential function, we investigate the boundedness of the influence function. For the convex function
[TABLE]
the corresponding Bregman divergence is given by [10],
[TABLE]
For multidimensional data, we additively define the divergence as follows:
[TABLE]
When , it follows from (37) that:
[TABLE]
When , if the -th elements of satisfies then, from (37). On the other hand, if the -th elements of satisfies then, from (38). That is, when , it follows that:
[TABLE]
When , is 0 from (37). The results are summarized as follows:
[TABLE]
D.3 Proof of Theorem 4: redscending property for log-sum-exp
Evaluate the following expression (35) of Lemma 1 for the function in (7) (). It follows from l’Hopital’s rule that
[TABLE]
Therefore, from Lemma 1, the redescending property holds.
Appendix E Properties of total Bregman divergence
E.1 Guarantee of monotonic decreasing property
Theorem 6
If the function is concave, the updating of the cluster center using (24) monotonically decreases the objective function for general total Bregman divergence.
Proof 4
The flow of the proof is almost the same as the proof of Theorem 1. We show that the objective function (4) monotonically decreases when the -th cluster center is newly updated to by (24). More specifically, we prove that, , where is the sum of the terms related to in (4), that is,
[TABLE]
From the inequality in (11), the following holds:
[TABLE]
where
[TABLE]
which is derived from (24) and (25) was used in (39). \qed
The following corollary immediately follows from Theorem 6.
Corollary 2
When the objective function is constructed by the power mean (6) or the log-sum-exp function (8), the following holds. When , the updating of the cluster center using (24) monotonically decreases the objective function for general total Bregman divergence.
E.2 Influence function
When we derive the influence function as in Section 5.1, it is given by
[TABLE]
Because the matrix does not depend on , the robustness against outliers is evaluated by
[TABLE]
E.3 Proof of Theorem 5
Evaluate the following expression, which is the norm of (40):
[TABLE]
Even if has any value, it is bounded [26]. Therefore, it is a necessary and sufficient condition for the influence function to be bounded that is bounded for all . As , when is large, is large. When the function is convex, is a monotonically increasing function. As is increased, (41) grows unboundedly, and the influence function is not bounded. In order to reduce (41) when is large, must be a monotonically decreasing function. The function that satisfies this condition is concave or linear (Condition 1). As , when , does not become 0. Therefore, the necessary and sufficient condition for satisfying the redescending property (20) is (26) (Condition 2).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (5814) (2007) 972–976.
- 2[2] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (8) (1995) 790–799.
- 3[3] A. Notsu, S. Eguchi, Robust clustering method in the presence of scattered observations, Neural Computation 28 (6) (2016) 1141–1162.
- 4[4] B. Kulis, M. I. Jordan, Revisiting k-means: New algorithms via Bayesian nonparametric, in: Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, pp. 513–520.
- 5[5] S. J. Gershman, D. M. Blei, A tutorial on Bayesian nonparametric models, Journal of Mathematical Psychology 56 (1) (2012) 1–12.
- 6[6] X. Pan, J. E. Gonzalez, S. Jegelka, T. Broderick, M. I. Jordan, Optimistic concurrency control for distributed unsupervised learning, in: Advances in Neural Information Processing Systems 26 (NIPS), 2013, pp. 1403–1411.
- 7[7] O. Bachem, M. Lucic, A. Krause, Coresets for nonparametric estimation - the case of DP-means, in: Proceedings of the 32th International Conference on Machine Learning (ICML), 2015, pp. 209–217.
- 8[8] L. Jiang, Y. Dong, N. Chen, T. Chen, DACE: a scalable DP-means algorithm for clustering extremely large sequence data, Bioinformatics 33 (6) (2017) 834–842.
