Dynamic Visualization and Fast Computation for Convex Clustering via   Algorithmic Regularization

Michael Weylandt; John Nagorski; Genevera I. Allen

arXiv:1901.01477·stat.ML·November 3, 2021

Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization

Michael Weylandt, John Nagorski, Genevera I. Allen

PDF

1 Repo

TL;DR

This paper introduces Algorithmic Regularization for convex clustering, significantly improving computational speed and visualization, enabling practical and insightful clustering analysis with theoretical guarantees.

Contribution

It proposes a novel iterative approximation scheme for convex clustering that guarantees convergence and dramatically accelerates computation while enhancing visualization capabilities.

Findings

01

Over 100-fold speed-up over existing methods

02

Finer approximation grid for clustering paths

03

Enables dynamic visualization of clustering solutions

Abstract

Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the…

Tables1

Table 1. Table 1 : Proportion of dendrogram recovered by CARP , CARP-VIZ , and standard fixed-grid methods. The back-tracking employed by CARP-VIZ is necessary for exact dendrogram recovery, but fixed step-size CARP is still able to recover the dendrogram more accurately than standard fixed-grid approaches.

Method	TCGA $(n = 438, p = 353)$	Authors $(n = 841, p = 69)$
CARP $(t = 1.1)$	5.93%	4.40%
CARP $(t = 1.05)$	11.18%	8.09%
CARP $(t = 1.01)$	41.55%	22.71%
CARP $(t = 1.005)$	60.27%	30.56%
CARP-VIZ	100%	100%
100-Point Grid	11.42%	5.00%
1000-Point Grid	37.44%	—

Equations149

U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} subject to 1 \leq i < j \leq n \sum \mathbbm 1_{U_{i \cdot} \neq = U_{j \cdot}} \leq t,

U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} subject to 1 \leq i < j \leq n \sum \mathbbm 1_{U_{i \cdot} \neq = U_{j \cdot}} \leq t,

\hat{U}_{λ} = U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ (1 \leq i < j \leq n \sum w_{ij} ∥ U_{i \cdot} - U_{j \cdot} ∥_{q}) .

\hat{U}_{λ} = U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ (1 \leq i < j \leq n \sum w_{ij} ∥ U_{i \cdot} - U_{j \cdot} ∥_{q}) .

U \in R^{n \times p} V \in R^{(2 n) \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ P (V; w, q) k = 1 \sum (2 n) w_{k} ∥ V_{k \cdot} ∥_{q} subject to V = D U .

U \in R^{n \times p} V \in R^{(2 n) \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ P (V; w, q) k = 1 \sum (2 n) w_{k} ∥ V_{k \cdot} ∥_{q} subject to V = D U .

d_{H} ({U^{(k)}}, {\hat{U}_{λ}}) \equiv max {λ sup k in f U^{(k)} - \hat{U}_{λ}, k sup λ in f U^{(k)} - \hat{U}_{λ}} (t, ϵ) \to (1, 0) 0

d_{H} ({U^{(k)}}, {\hat{U}_{λ}}) \equiv max {λ sup k in f U^{(k)} - \hat{U}_{λ}, k sup λ in f U^{(k)} - \hat{U}_{λ}} (t, ϵ) \to (1, 0) 0

d_{H} ({Z^{(k)}}, {\hat{Z}_{λ}}) \equiv max {λ sup k in f Z^{(k)} - \hat{Z}_{λ}, k sup λ in f Z^{(k)} - \hat{Z}_{λ}} (t, ϵ) \to (1, 0) 0

d_{H} ({U^{(k)}}, {\hat{U}_{λ}}) = \frac{max { sup _{λ} in f _{k} U ^{(k)} - U ^ _{λ} _{2} , sup _{k} in f _{λ} U ^{(k)} - U ^ _{λ} _{2} }}{n * p * ∥ D X ∥ _{2, \infty}}

d_{H} ({U^{(k)}}, {\hat{U}_{λ}}) = \frac{max { sup _{λ} in f _{k} U ^{(k)} - U ^ _{λ} _{2} , sup _{k} in f _{λ} U ^{(k)} - U ^ _{λ} _{2} }}{n * p * ∥ D X ∥ _{2, \infty}}

U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ ((i, j), w) \in E \sum w_{ij} ∥ U_{i \cdot} - U_{j \cdot} ∥_{q}

U \in R^{n \times p} arg min \frac{1}{2} ∥ X - U ∥_{F}^{2} + λ ((i, j), w) \in E \sum w_{ij} ∥ U_{i \cdot} - U_{j \cdot} ∥_{q}

U \in R^{n \times p} arg min \frac{1}{2} ∥ U - X ∥_{F}^{2} + λ P (V; w, q) (e_{l}, w_{l}) \in E \sum w_{l} ∥ V_{l \cdot} ∥_{q} subject to D U - V = 0.

U \in R^{n \times p} arg min \frac{1}{2} ∥ U - X ∥_{F}^{2} + λ P (V; w, q) (e_{l}, w_{l}) \in E \sum w_{l} ∥ V_{l \cdot} ∥_{q} subject to D U - V = 0.

U^{(k + 1)}

U^{(k + 1)}

V^{(k + 1)}

Z^{(k + 1)}

U^{(k + 1)} = (I + ρ D^{T} D)^{- 1} (X + ρ D^{T} (V^{(k)} - Z^{(k)}))

U^{(k + 1)} = (I + ρ D^{T} D)^{- 1} (X + ρ D^{T} (V^{(k)} - Z^{(k)}))

U^{(k + 1)} = L^{- T} L^{- 1} (X + ρ D^{T} (V^{(k)} - Z^{(k)})) where L L^{T} = I + ρ D^{T} D

U^{(k + 1)} = L^{- T} L^{- 1} (X + ρ D^{T} (V^{(k)} - Z^{(k)})) where L L^{T} = I + ρ D^{T} D

V \in R^{∣ E ∣ \times n} arg min λ P (V; w, q) + \frac{ρ}{2} D U^{(k + 1)} - V + Z^{(k)}_{F}^{2}

V \in R^{∣ E ∣ \times n} arg min λ P (V; w, q) + \frac{ρ}{2} D U^{(k + 1)} - V + Z^{(k)}_{F}^{2}

= prox_{\frac{λ}{ρ} P (\cdot; w, q)} (D U^{(k + 1)} + Z^{(k)})

U^{(k + 1)}

U^{(k + 1)}

V^{(k + 1)}

Z^{(k + 1)}

U^{(k + 1)}

U^{(k + 1)}

V^{(k + 1)}

Z^{(k + 1)}

∥ U^{(k)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k)} - \hat{Z}_{γ^{(k)}} ∥ < c [∥ U^{(k - 1)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k - 1)} - \hat{Z}_{γ^{(k)}} ∥]

∥ U^{(k)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k)} - \hat{Z}_{γ^{(k)}} ∥ < c [∥ U^{(k - 1)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k - 1)} - \hat{Z}_{γ^{(k)}} ∥]

u \in R^{n p}, v \in R^{∣ E ∣ p} arg min \frac{∥ x - u ∥ _{2}^{2}}{2} + λ ∥ v ∥_{vec (ℓ_{q})} subject to (I \otimes D) u = v

u \in R^{n p}, v \in R^{∣ E ∣ p} arg min \frac{∥ x - u ∥ _{2}^{2}}{2} + λ ∥ v ∥_{vec (ℓ_{q})} subject to (I \otimes D) u = v

∥ \hat{U}_{λ_{1}} - \hat{U}_{λ_{2}} ∥ + ∥ \hat{Z}_{λ_{1}} - \hat{Z}_{λ_{2}} ∥ \leq L * ∣ λ_{1} - λ_{2} ∣

∥ \hat{U}_{λ_{1}} - \hat{U}_{λ_{2}} ∥ + ∥ \hat{Z}_{λ_{1}} - \hat{Z}_{λ_{2}} ∥ \leq L * ∣ λ_{1} - λ_{2} ∣

\hat{u}_{λ} = u \in R^{n p} arg min \frac{1}{2} ∥ u - x ∥_{2}^{2} + λ f_{q} (\tilde{D} u)

\hat{u}_{λ} = u \in R^{n p} arg min \frac{1}{2} ∥ u - x ∥_{2}^{2} + λ f_{q} (\tilde{D} u)

0 \in u_{λ} - x + λ \tilde{D}^{T} \partial f_{q} (\tilde{D} u_{λ})

0 \in u_{λ} - x + λ \tilde{D}^{T} \partial f_{q} (\tilde{D} u_{λ})

0 = u_{λ} - x + λ \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ})

0 = u_{λ} - x + λ \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ})

0

0

\frac{\partial}{\partial λ} [0]

0

0

⟹ \frac{\partial u}{\partial λ}

\frac{\partial u _{λ}}{\partial λ}_{\infty} = - [I + λ \tilde{D}^{T} f_{q}^{''} (\tilde{D} u_{λ}) \tilde{D}]^{- 1} \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ})_{\infty} \leq ∥ - [I + 0]^{- 1} \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ}) ∥_{\infty} = ∥ \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ}) ∥_{\infty} .

\frac{\partial u _{λ}}{\partial λ}_{\infty} = - [I + λ \tilde{D}^{T} f_{q}^{''} (\tilde{D} u_{λ}) \tilde{D}]^{- 1} \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ})_{\infty} \leq ∥ - [I + 0]^{- 1} \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ}) ∥_{\infty} = ∥ \tilde{D}^{T} f_{q}^{'} (\tilde{D} u_{λ}) ∥_{\infty} .

∥ U^{(k)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k)} - \hat{Z}_{γ^{(k)}} ∥ \leq c^{k} L ϵ + L (t - 1) ϵ t^{k} i = 1 \sum k - 1 (\frac{c}{t})^{i}

∥ U^{(k)} - \hat{U}_{γ^{(k)}} ∥ + ∥ Z^{(k)} - \hat{Z}_{γ^{(k)}} ∥ \leq c^{k} L ϵ + L (t - 1) ϵ t^{k} i = 1 \sum k - 1 (\frac{c}{t})^{i}

\hat{Υ}_{λ} = (\hat{U}_{λ} \hat{Z}_{λ}) and \hat{Υ}^{(k)} = (U^{(k)} Z^{(k)}) .

\hat{Υ}_{λ} = (\hat{U}_{λ} \hat{Z}_{λ}) and \hat{Υ}^{(k)} = (U^{(k)} Z^{(k)}) .

∥ Υ^{(0)} - \hat{Υ}_{ϵ} ∥ \leq L ϵ

∥ Υ^{(0)} - \hat{Υ}_{ϵ} ∥ \leq L ϵ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DataSlingers/clustRviz
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11affiliationtext: Department of Statistics, Rice University22affiliationtext: Department of Computer Science, Rice University33affiliationtext: Department of Electrical and Computer Engineering, Rice University44affiliationtext: Jan and Dan Neurological Research Institute, Baylor College of Medicine

Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization

Michael Weylandt

John Nagorski

Genevera I. Allen

(Last Updated: )

Abstract

Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the clustering solution path. On example data sets from genomics and text analysis, CARP delivers over a 100-fold speed-up over existing methods, while attaining a finer approximation grid than standard methods. Furthermore, CARP enables improved visualization of clustering solutions: the fine solution grid returned by CARP can be used to construct a convex clustering-based dendrogram, as well as forming the basis of a dynamic path-wise visualization based on modern web technologies. Our methods are implemented in the open-source R package clustRviz, available at https://github.com/DataSlingers/clustRviz.

Keywords: Clustering, Convex Clustering, Optimization, Algorithmic Regularization, Visualization, Dendrograms Introduction Clustering, the task of identifying meaningful sub-populations in unlabelled data, is a fundamental problem in applied statistics, with applications as varied as cancer subtyping, market segmentation, and topic modeling of text documents. A wide range of methods for clustering have been proposed and we do not attempt to make a full accounting here, instead referring the reader to the recent book of [11]. Perhaps the most popular clustering method, however, is hierarchical clustering [25]. Hierarchical clustering derives its popularity from an intuitive formulation, efficient computation, and powerful visualizations. Dendrogram plots, which display the family of clustering solutions simultaneously, provide an easily-understood summary of the global structure of the data, allowing the analyst to visually examine the nested group structure of the data. Despite its popularity, hierarchical clustering has several limitations: it is highly sensitive to the choice of distance metric and linkage used; it is a heuristic algorithm which lacks optimality guarantees; and the conditions under which hierarchical clustering recovers the true clustering are unknown.

To address these limitations, several authors have recently studied a convex formulation of clustering [65, 49, 56, 39]. This convex formulation guarantees global optimality of the clustering solution and allows analysis of its theoretical properties [77, 88, 66, 4]. Despite these advantages, convex clustering has not yet achieved widespread popularity, due to its computationally intensive nature and lack of dendrogram-based visualizations. In this paper, we address these problems with a efficient algorithm for computing convex clustering solutions with sufficient precision to construct interpretable accurate dendrograms and dynamic path-wise visualizations, thereby making convex clustering a practical tool for applied data analysis.

Our main theoretical contribution is the concept of Algorithmic Regularization, a novel computationally efficient approach for obtaining accurate approximations of regularization paths. We provide a theoretical justification for our proposed approach, showing that we can obtain a high-quality approximation simultaneously at all values of the regularization parameter. While we focus on the convex clustering problem, our proposed approach can be applied to a much wider range of problems arising in statistical learning.

Using algorithmic regularization, we make two methodological contributions related to clustering: first, we propose an efficient algorithm, CARP, for computing convex clustering solutions. CARP is typically over one-hundred times faster than existing approaches, while simultaneously computing a much finer set of solutions than commonly used in practice. Secondly, we propose new visualization strategies for convex clustering based on CARP: a new dendrogram construction based on convex clustering paths and a novel “path-wise” visualization, which provides more information about the structure of the estimated clusters. We hope that, thanks to our proposed computational and visualization strategies, convex clustering will become a viable tool for exploratory data analysis. CARP and our proposed visualizations are implemented in our clustRviz R package, available at https://github.com/DataSlingers/clustRviz.

The remainder of this paper is organized as follows: Section 1 reviews convex clustering in more detail and discusses the difficulties entailed in producing dendrograms from convex clustering. Section 2 introduces the concept of “Algorithmic Regularization,” uses it to develop the CARP clustering algorithm, and gives theoretical guarantees of global convergence. Section 3 compares CARP with existing approaches for convex clustering, demonstrating its impressive computational and statistical performance on several data sets. Section 4 describes several novel visualizations made possible by the CARP algorithms in the context of an extended text-mining example. Finally, Section 5 concludes the paper with a discussion and proposes possible future directions for investigation.

1 Convex Clustering and Dendrograms

We seek to represent the convex clustering solution path as a dendrogram, and in this section, discuss both the theoretical conditions and computational considerations for this task. We first review the basic properties of convex clustering in Section 1.1 and then discuss dendrogram construction from convex clustering in Section 1.2.

1.1 Convex Clustering

Let $\bm{X}\in\mathbb{R}^{n\times p}$ denote a data matrix, consisting of $n$ observations (rows of the matrix) in $p$ dimensions. The convex clustering problem, first discussed by [65] and later explored by [49] and [56], is a convex relaxation of the general clustering problem:

[TABLE]

where $\bm{U}_{i\cdot}$ is the $i^{\text{th}}$ row of $\bm{U}$ . Replacing the non-convex indicator function with an $\ell_{q}$ -norm of the difference, we obtain the convex clustering problem:

[TABLE]

Note that we have included non-negative fusion weights $\{w_{ij}\}$ in our convex relaxation. We will say more about the computational and statistical roles played by these weights below.

The squared Frobenius norm loss function favors solutions which minimize the Euclidean distance between observations and their estimated centroids, while the fusion penalty term encourages the differences in columns of $\hat{\bm{U}}_{\lambda}$ to be shrunk to zero. We interpret the solution $\hat{\bm{U}}_{\lambda}$ as a matrix of cluster centroids, where each observation $\bm{X}_{i\cdot}$ belongs to a cluster with centroid $(\hat{\bm{U}}_{\lambda})_{i\cdot}$ . For sufficiently large values of $\lambda$ , the columns of $\hat{\bm{U}}_{\lambda}$ will be shrunk together by the penalty term. We say that points with the same centroid belong to the same cluster; that is, the observations $\bm{X}_{i\cdot}$ and $\bm{X}_{j\cdot}$ are assigned to the same cluster if $(\hat{\bm{U}}_{\lambda})_{i\cdot}=(\hat{\bm{U}}_{\lambda})_{j\cdot}$

A major advantage of convex clustering is that the solution $\hat{\bm{U}}_{\lambda}$ smoothly interpolates between clustering solutions, yielding a continuous path of solutions indexed by $\lambda$ . At $\lambda=0$ , $\hat{\bm{U}}_{\lambda=0}=\bm{X}$ , resulting in a solution of $n$ distinct clusters, with each observation as the centroid of its own cluster. As $\lambda$ is increased, the fusion penalty encourages the columns of $\hat{\bm{U}}_{\lambda}$ to merge together, inducing a clustering behavior. Finally, when $\lambda$ is large, all columns of $\hat{\bm{U}}_{\lambda}$ are fully merged, yielding a single cluster centroid equal to the grand mean of the columns of $\bm{X}$ . Thus, the penalty parameter $\lambda$ determines both the number of clusters and the cluster assignments.

The choice of the fusion weights $\bm{w}\in\mathbb{R}^{\binom{n}{2}}_{\geq 0}$ has a large effect on the statistical accuracy and computational efficiency of convex clustering. When uniform weights are used, convex clustering has a close connection to single-linkage hierarchical clustering, as shown by [77]. More commonly, weights inversely proportional to the distances between observations are used and have been empirically demonstrated to yield superior performance [49, 39, 37]. Furthermore, setting many of the weights to zero dramatically reduces the computational cost associated with computing the convex clustering solution. We typically prefer using the rotation-invariant $\ell_{2}$ -norm in the fusion penalty ( $q=2$ ), but one could employ $\ell_{1}$ or $\ell_{\infty}$ norms as well.

By formulating clustering as a convex problem, it becomes possible to analyze its theoretical properties using standard techniques from the high-dimensional statistics literature. [77] show a form of prediction consistency and derive an unbiased estimator of the effective degrees of freedom associated with the solution. [88] give sufficient conditions for exact cluster recovery at a fixed value of $\lambda$ (“sparsistency”). Like us, [66] are interested in properties of the entire solution path and give conditions under which convex clustering solutions (1) asymptotically yield the true dendrogram.

1.2 Constructing Dendrograms from Convex Clustering Paths

In this paper, we propose to represent the convex clustering solution path as a dendrogram, an example of which is shown in Figure 1. The convex clustering dendrogram is interpreted in much the same way as the classical dendrogram. As individual observations or groups of observations are fused together by the fusion penalty, they are denoted by merges in the tree structure. The height of the merge in the tree structure is given by the value of the regularization parameter, $\lambda$ , or more precisely $\log(\lambda)$ , at which the fusion occurred. Thus, observations that fuse at small values of $\lambda$ are denoted by merges at the bottom of the dendrogram structure. As with hierarchical clustering, one can cut the dendrogram horizontally at a specific height to yield the associated clusters, and we can interpret the tree height at which merges occur as indicative of the similarity between groups. Before proceeding, however, it is natural to ask whether it is even possible to represent the convex clustering solution path as a dendrogram. This simple question turns out to have a rather subtle answer.

There are two possible impediments to finding the desired dendrogram representation: i) it may be impossible to represent the exact solution path as a dendrogram; and ii) it may be unrealistic to compute the solution path with enough precision to form a dendrogram. We first consider the question of whether the exact solution path admits a dendrogram representation. It is easy to observe that the exact solution path can only be written as a dendrogram if it is agglomerative, i.e., if the solution path consists of only fusions and no fissions. [49] showed that if an $\ell_{1}$ -norm is used for the fusion penalty and the weights are uniform then the solution path is agglomerative, but their analysis does not generalize to arbitrary norms or arbitrary weight schemes. [4] showed that it is possible to select weights to yield a agglomerative path, but their analysis applies only to a specific weighting scheme. In general, for an arbitrary data-driven choice of weights, there is no theoretical guarantee that the solution path will be agglomerative. In our experience, however, the solution path is agglomerative except in pathological situations.

Assuming that the exact solution path is agglomerative, we still must determine whether the solution path, as calculated, is agglomerative. While, in theory, this poses no additional challenge as the solution path is a continuous function of $\lambda$ , in practice this poses a nearly insurmountable computational challenge. To construct a dendrogram, we require the exact values of $\lambda$ at which each fusion occurs. Since these values of $\lambda$ are not known a priori, we are faced with a double burden: we must identify a critical set of values of $\lambda$ and solve the convex clustering problem (1) at those values. There are two widely-used approaches to finding the critical set of $\lambda$ ’s, path-wise algorithms and grid search, but, as we will show below, neither approach is sufficient in this case.

Path-wise algorithms, such as those proposed for the Lasso [16, 8] or generalized Lasso [80] problems, compute the entire solution path exactly, identifying each value of $\lambda$ at which a sparsity event, equivalent to a fusion in our case, occurs. These algorithms are typically based on piece-wise linearity of the underlying solution path, which allows for smooth interpolation between sparsity events. [70] studied the conditions under which solution paths are piece-wise linear and hence under which path-wise algorithms can be developed. It is easy to verify that these conditions do not hold with our preferred $\ell_{2}$ -norm fusion penalty, so the solution path is not piece-wise linear and hence a path-wise algorithm cannot be employed for the convex clustering problem (1). We note that several authors have proposed path-wise algorithms which can theoretically handle non-piece-wise-linear paths, but require solving an ordinary differential equation exactly [26, 29, 27]; in practice, these methods are computationally intensive and only approximate the path at a series of grid points, similar to the iterative methods we discuss next. For the $\ell_{1}$ -norm fusion case, it is possible to apply path-wise algorithms for weights with specific graphical structures (see the discussion in [49] or the examples considered in [80])), but not for arbitrary graph structures with arbitrary weights, again eliminating the possibility of a general-purpose path-wise algorithm.

Since there exists no exact path-wise algorithm, we might instead compute the convex clustering solution at a series of discrete points corresponding to a regular grid of $\lambda$ ’s. As we will show in Section 3, however, this strategy is still computationally burdensome even using state-of-the-art algorithms and warm-start techniques. For example, the fastest algorithm considered by [39], an Accelerated Alternating Minimization Algorithm, takes 6.87 hours and 19.81 hours to compute the solution path at 100 and 1000 regularly spaced $\lambda$ ’s, respectively, on a relatively small data set of dimension $n=438$ and $p=353$ . Furthermore, a grid of 100 or 1000 $\lambda$ ’s does not give us the value of $\lambda$ at which each fusion event occurs, which we need to construct a dendrogram. Even if one wants to construct a dendrogram using only the order in which each fusion event occurs and their associated approximate values of $\lambda$ , computing the path along a grid of 100 or 1000 $\lambda$ ’s only uniquely resolves 11.4% or 37.44% respectively of the fusion events needed to construct a dendrogram. (See Table 1 in Section 3 for complete results.)

In general, the computational cost of performing convex clustering is so high that it precludes its use as a practical tool for clustering and exploratory data analysis. Further, computing the entire convex clustering solution path with fine enough precision to construct a dendrogram is an all but insurmountable task given existing computational algorithms for convex clustering. We seek to address this problem in this paper, using a novel computational technique which provides a fine grid of high-quality estimates of the regularization path. We introduce our approach and the clustering algorithm it suggests, CARP, in the next section.

2 CARP: Convex Clustering via Algorithmic Regularization Paths

We now turn our attention to efficiently computing solutions to the convex clustering problem (1) for a fine grid of $\lambda$ , with a goal of dendrogram construction. Like many problems in the “loss + penalty” form, the convex clustering problem is particularly amenable to operator-splitting schemes such as the Alternating Direction Method of Multipliers (ADMM) [33]. In statistical learning, we are often interested in the solution to a regularized estimation problem at a large number of values of $\lambda$ . In this context, the performance of ADMMs is further improved by the use of “warm-starts:” if the ADMM is initialized near the solution, usually the solution at the previous value of $\lambda$ , only a few iterations are typically required to obtain a solution which is accurate up to the statistical uncertainty inherent in the problem.

We propose an extreme version of this approach which we call Algorithmic Regularization. Instead of running the ADMM to convergence, we take only a single ADMM step, after which we move to the next value of $\lambda$ . By taking only a single ADMM step, we can significantly reduce the computational cost associated with estimating a regularization path. We can then use these computational savings to solve for a much finer grid of $\lambda$ ’s than we would typically use if employing a standard scheme. In essence, Algorithmic Regularization allows us to exchange computing an exact solution for a small set of $\lambda$ ’s for calculating a highly accurate approximation at a large set of $\lambda$ ’s. Usefully, we can now use a $\lambda$ grid with sufficiently fine resolution that we can fully capture the desired dendrogram structure in a reasonable amount of time.

[39] first considered the use of the ADMM to solve the convex clustering problem (1). To apply the ADMM, we introduce $\bm{D}$ , the directed difference matrix used to calculate to the pairwise differences of rows of $\bm{U}$ , and an auxiliary variable $\bm{V}$ , corresponding to the matrix of between-observation differences:

[TABLE]

Applying the ADMM with warm-starts to the above, we obtain the following algorithm:

where $\bm{Z}$ is the dual variable with the same dimensions as $\bm{V}$ , $\bm{L}$ is Cholesky factorization of $\bm{I}+\bm{D}^{T}\bm{D}$ , and $\operatorname{\textsf{prox}}_{f(\cdot)}(\bm{x})=\operatorname*{arg\,min}_{\bm{z}}\frac{1}{2}\|\bm{x}-\bm{z}\|_{2}^{2}+f(\bm{z})$ is the proximal mapping of a general function $f$ . Note that, if sparse weights are used, the corresponding rows of $\bm{D}$ , $\bm{V}$ , and $\bm{Z}$ may be omitted, yielding more efficient updates. Additionally, note that, because we use a multiplicative update for $\lambda$ , we must initialize at $\lambda=\epsilon$ , for some small $\epsilon$ , rather than at $\lambda=0$ . A derivation and more detailed statement of this algorithm are given in Section References of the Supplementary Materials.

We now take Algorithm 1 as the basis for our extreme early stopping strategy of Algorithmic Regularization. Removing the the inner loop, we obtain the following scheme, which we refer to as CARP–Convex Clustering via Algorithmic Regularization Paths:

The fundamental difference between Algorithm 1 and Algorithm 2 is that Algorithm 2 does not have an “inner loop” in which ADMM iterates are repeated until convergence to the exact solution for a fixed value of the regularization parameter. As such, the CARP iterates $\{\bm{U}^{(k)}\}$ are not exact solutions to the convex clustering problem (1) for any value of $\lambda$ , though they are typically accurate approximations in a sense that Theorem 1 below makes precise. Our notation reflects this distinction and replaces $\lambda_{l}$ with $\gamma^{(k)}$ in the $\bm{V}$ -update to avoid suggesting any false equivalence. A more detailed formulation of the CARP algorithm is given in Section References of the Supplementary Materials.

The role of the step-size parameter $t$ in Algorithm 2 is particularly important in understanding CARP. The step size $t$ controls the fineness of the $\{\gamma^{(k)}\}$ grid used internally by CARP and, as such, serves as a computational tuning parameter controlling how well the CARP path approximates the true convex clustering path. Decreasing $t$ has benefits for both local and global accuracy of the CARP path: a smaller value of $t$ yields an approximate solution path which has a finer set of grid points $\{\gamma^{(k)}\}$ (improved global accuracy) and more accurate approximations $\{\bm{U}^{(k)}\}$ at each of those grid points (improved local accuracy). This is in contrast to standard approaches where the user has to pre-specify the $\{\lambda_{l}\}$ grid used and the stopping tolerance of the iterative algorithm to strike a balance between local accuracy and global accuracy obtainable in a given amount of time. As we will see in Section 3, the Algorithmic Regularization strategy of replacing an iterative algorithm with a one-step approximation thereof allows us to improve both local and global accuracy at a fraction of the cost of competing methods.

While this may all seem rather fishy, the following theorem shows that, in the limit of small changes to the regularization level (i.e., $(t,\epsilon)\to(1,0)$ ), there is indeed no loss in accuracy induced by the one-step approximation. In fact, we are able to show a very strong form of convergence, so-called Hausdorff convergence, in both the primal and dual variables. Hausdorff convergence implies two different convergence results hold simultaneously for both the primal and dual variables. The first, $\sup_{\lambda}\inf_{k}\left\|\bm{U}^{(k)}-\hat{\bm{U}}_{\lambda}\right\|\to 0$ , implies every convex clustering solution will be recovered by CARP as $(t,\epsilon)\to(1,0)$ . The second, $\sup_{k}\inf_{\lambda}\left\|\bm{U}^{(k)}-\hat{\bm{U}}_{\lambda}\right\|\to 0$ , implies that any clustering produced by CARP as $(t,\epsilon)\to(1,0)$ is a valid convex clustering solution for some $\lambda$ . More memorably, Theorem 1 shows that asymptotically CARP produces “the whole regularization path and nothing but the regularization path:”

Theorem 1.

As $(t,\epsilon)\to(1,0)$ , where $t$ is the multiplicative step-size update and $\epsilon$ is the initial regularization level, the primal and dual CARP paths converge to the primal and dual convex clustering paths in the Hausdorff metric: that is,

[TABLE]

where $\bm{U}^{(k)},\bm{Z}^{(k)}$ are the values of the $k^{\textrm{th}}$ CARP iterate and $\hat{\bm{U}}_{\lambda},\hat{\bm{Z}}_{\lambda}$ are the exact solutions to the convex clustering problem (1) and its dual at $\lambda$ .

A full proof of Theorem 1 is given in Section A of the Supplementary Materials, but we highlight the three essential elements here: i) we obtain a high-quality initialization at the first step by setting $\bm{U}^{(0)}=\bm{X}$ which is the exact solution at $\lambda=0$ ( $\hat{\bm{U}}_{\lambda=0}=\bm{X}$ ); ii) the convex clustering problem (1) is strongly convex due to the squared Frobenius norm loss, so the ADMM converges quickly (linearly); and iii) the solution path is Lipschitz as a function of $\lambda$ , so $\hat{\bm{U}}_{\lambda}$ does not vary too quickly. Putting these together, we show that CARP can “track” the exact solution path closely, with the approximation error at each step decreasing at a faster rate than the exact solution changes. We emphasize that both strong convexity and Lipschitz solution paths are features of the optimization problem, not the specific data, and are easily checked in practice. A careful reading of the proof of Theorem 1 will reveal that our analysis applies to a much wider class of problems than convex clustering. We consider the application of algorithmic regularization to the closely related problem of convex bi-clustering [37] in Section B of the Supplementary Materials, where we develop the CBASS (Convex Bi-Clustering via Algorithmic Regularization with Small Steps) algorithm, but we leave examination of the more general phenomenon of Algorithmic Regularization to future work.

While Theorem 1 implies that a sufficiently small choice of step size $t$ allows for exact dendrogram recovery, in practice it is often challenging to select $t$ sufficiently small without requiring excessive computation. Instead, we take a small, but not infinitesimal, value of $t$ and add a back-tracking step to ensure that fusions necessary for dendrogram construction are exactly identified. We refer to the back-tracking version of CARP as CARP-VIZ, for reasons which will be clarified in Section 3. Furthermore, a post-processing step can be used to find fusions that back-tracking is unable to isolate. Details of the back-tracking and post-processing rules, as implemented in clustRviz, are given in Section D of the Supplementary Materials.

Algorithmic Regularization, as used here, was first discussed in [13] without theoretical justification and was successfully applied to unmixing problems in hyperspectral imaging by [7]. We emphasize that, while grounded in standard optimization techniques, algorithmic regularization takes a different perspective than other commonly-used computational approaches, more akin to function approximation than standard optimization. The algorithmic regularization perspective is principally concerned with recovering the overall structure of the solution path than with obtaining the most accurate solution possible at a fixed value of $\lambda$ . As such, our Theorem 1 is of a different character than similar results appearing in the optimization literature, making a claim of global path-wise convergence rather than local point-wise convergence. This “holistic” viewpoint is necessary to recover dendrograms, a major goal of this paper, but has also recently been found useful for choosing tuning parameters in penalized regression problems [5].

2.1 Related Work

Several authors have considered path approximation algorithms not unlike CARP, often in the context of boosting algorithms. [20], [28], and [9] all consider iterative algorithms which approximate solution paths of regularized estimators. Of these, the approach of [28], who consider a path approximation algorithm for the Lasso, is most similar to our own. Assuming strong convexity, their BLasso algorithm exactly recovers the lasso solution path as the step-size goes to zero. Their algorithm can be viewed as an application of algorithmic regularization to greedy coordinate descent with an additional back-tracking step to help isolate events of interest (variables entering or leaving the active set). Our algorithmic regularization strategy is simpler than their approach, as it does not require the back-tracking step, and can be applied to more general penalty functions.

[6] and [10] consider the problem of obtaining approximate solutions for a set of parameterized problems subject to a simplex constraint, though their approach still requires running an optimization step until approximate convergence at each step, Building on this, [23] proposes a general framework for constructing “stagewise” solution paths, which can be interpreted as an application of algorithmic regularization to the Frank-Wolfe algorithm [14]. He shows that the stagewise estimators achieve the optimal objective value as the step-size is taken to zero; if a strong convexity assumption is added, it is not difficult to extend his Theorem 2 to recover a result similar to our Theorem 1. Our framework is more general than his, as we do not require require the gradient of the loss function to be Lipschitz and we admit more general regularizers.

3 Numerical and Timing Comparisons

Having introduced CARP and given some theoretical justification for its use, we now consider its performance on representative data sets from text analysis and genomics. As we will show, CARP achieves the superior clustering performance of convex clustering at a small fraction of the computational cost. Throughout this section, we use two example data sets: TCGA and Authors. The TCGA data set ( $n=438$ , $p=353$ ) contains log-transformed Level III RPKM gene expression levels for 438 breast-cancer patients from [22]. The Authors data set ( $n=841,p=69$ ) consists of word counts from texts written by four popular English-language authors (Austen, London, Shakespeare, and Milton). For all comparisons, we use clustRviz’s default sparse Gaussian kernel weighting scheme described in the package documentation. For timing comparisons, the Accelerated ADMM and AMA proposed by [39] and implemented in their cvxclustr package was used; our clustRviz package was used for CARP and CARP-VIZ. All comparisons were run on a 2013 iMac with a 3.2 GHz Intel i5 processor and 16 GB of 1600 MHz DDR3 memory.

While Theorem 1 strictly only applies for asymptotically small values of $t$ , CARP paths are high-quality approximations of the exact convex clustering solution path, even at moderate values of $t$ . We assess accuracy of the CARP paths by considering the normalized relative Hausdorff distance between the primal CARP path ( $\bm{U}^{(k)}$ ) and the exact solution ( $\hat{\bm{U}}_{\lambda}$ )

[TABLE]

where $\|\cdot\|_{2,\infty}$ is the maximum of the $\ell_{2}$ -norms of the rows of a matrix. (We include the normalization constants in the denominator so that our distance measure does not depend on the size or numerical scale of the data.) In order to calculate the Hausdorff distance, the CARP path with a very small step-size $t$ was used in lieu of the exact solution $\{\hat{\bm{U}}_{\lambda}\}$ . As can be seen in Figure 2, the CARP path is highly accurate even at moderate values of $t$ and converges quickly to the exact solution as $t\to 1$ . CARP-VIZ, which uses an adaptive choice of $t$ to isolate each individual fusion, performs even better than the fixed step-size CARP, attaining a very accurate approximation of the true clustering path.

Even though they are highly accurate, CARP paths are relatively cheap to compute. In Figure 3, we compare the computational cost of CARP with the algorithms proposed in [39]. As shown in Figure 3, CARP significantly outperforms the Accelerated AMA and ADMM algorithms. At large step-sizes ( $t=1.1,1.05$ ), CARP terminates in less than a minute and, even at finer grid-sizes ( $t=1.01,1.005$ ), CARP takes only a few minutes to run. The CARP-VIZ variant takes significantly longer than standard CARP, though it still outperforms the AMA, taking about an hour for TCGA, rather than the six and a half hours required to solve the AMA at 100 grid points. This improvement in computational performance is even more remarkable when we note that CARP and CARP-VIZ produce a fine grid of solutions by default: on the TCGA data, CARP with $t=1.01$ produces over 2047 distinct grid points in under five minutes.

The fine grid of solutions returned by CARP and CARP-VIZ result in much improved dendrogram recovery, as measured by the fraction of unique clustering assignments each method returns. Table 1 shows recovery results for CARP (at several values of $t$ ), CARP-VIZ, and standard fixed-grid methods. It is clear that back-tracking employed by CARP-VIZ is necessary for exact dendrogram recovery and that CARP-VIZ should be used if the exact dendrogram recovery is required for visualization. Even with moderate step-sizes ( $t=1.1,1.05$ ), however, CARP is still able to estimate the dendrogram far more accurately and more rapidly than standard iterative methods, making it a useful alternative for exploratory work.

On the TCGA data, CARP with $t=1.05$ is able to recover the dendrogram with the same accuracy in a minute that the AMA attains in six and a half hours. With $t=1.01$ , CARP recovers the dendrogram more accurately in under five minutes than the AMA does in nineteen hours (1000 grid points). CARP achieves these improvements by using its computation efficiently: while a standard optimization algorithm may spend several hundred iterations at a single value of the regularization parameter, CARP only spends a single iteration. By reducing the number of iterations at each grid point, it can take examine a much finer grid in less time. This trade-off is particularly well-suited for our goal of dendrogram recovery, which requires a fine grid of solutions to assess the order of fusions, but does not depend on the exact values of the estimated centroids. While not a primary focus of this paper, the high-quality dendrogram estimation allowed by CARP translates into improved statistical performance as well. We compare the statistical performance of CARP with other clustering methods in Section C of the Supplementary Materials.

4 Visualization of CARP Results

In this section, we discuss visualization of convex clustering results, emphasizing the role that CARP can play in exploratory data analysis. The visualizations illustrated in this section can all be produced using our clustRviz R package. Throughout this section, we will use the Presidents data set, ( $n=44,p=75$ ) which contains log-transformed word counts of the 75 most variable words taken from the aggregated major speeches (primarily Inaugural and State of the Union Addresses) of the 44 U.S. presidents through mid-2018. (We consolidate the two non-consecutive terms of Grover Cleveland.)

We begin by considering a dendrogram representation of this data, as shown in Figure 4. For each dendrogram, we have colored the observations by historical period: Founding Fathers, pre-Civil War, pre-World War II, and modern. Given the evolution of the English language and the changing political concerns of these periods, we would expect clustering methods to group the presidents according to historical period. With three exceptions, CARP clearly identifies the four historical periods, with the modern period being particularly well-separated. The performance of hierarchical clustering is highly sensitive to the choice of linkage: Ward’s linkage [25] does almost as well as CARP, but does not clearly separate the pre-Civil War and pre-World War II periods. Single linkage correctly identifies the modern period, but otherwise does not separate the pre-modern presidents. Complete linkage performs the worst, clustering Donald Trump with the Founding Fathers, Garfield, and Harrison instead of with other modern presidents. We note that Harrison is consistently misclustered by all methods considered: we believe this is due to the fact he died thirty-one days into his first term and did not leave a lengthy textual record.

Beyond allowing accurate dendrogram construction, the CARP paths are themselves interesting to visualize. By plotting the path traced by the CARP iterates $\bm{U}^{(k)}$ , we can observe exactly how CARP forms clusters from a given data set. For dimension reduction, we typically plot the projection of $\bm{U}^{(k)}$ onto the principal components of $\bm{X}$ , though clustRviz allows visualization of the raw features as well. Unlike the CARP-dendrogram, the path plot allows examination of the structure of the estimated clusters and not just their membership. By displaying the original observations on the path plot, we also enable comparison of the estimated centroids with the original data. Modern web technologies allow us to display these path plots dynamically, forming a movie with each CARP iterate as a separate frame. The fine solution grid returned by CARP is especially relevant here, enabling us to construct movies in which the observations move smoothly. We have found that the smoothness of the movie is a useful heuristic to assess whether a small enough step-size $t$ was used: if the paths “jump” conspicuously from one frame to the next, one should consider re-running CARP with a smaller step-size.

Figure 5 shows three frames of such a movie: in each frame, on the left side, we see the Founding Fathers cluster being merged to the other pre-modern presidents, while on the right, we see a clear cluster of modern presidents. A closer examination of the central frame reveals additional information not visible in the CARP-dendrogram: Harding, the last president to join the pre-modern cluster, is an outlier lying between two clusters rather than far to one side.

Because both the dendrogram and path visualizations are indexed by the regularization level, $\gamma^{(k)}$ , it is possible to display them in a “linked” fashion, highlighting clusterings on the dendrogram as they are fused in the path plot. Particularly when rendered dynamically, this combination gives the best of both visualizations, combining the global structure visible in the dendrogram with the structural information visible in the path plot. An example of this “linked” visualization is shown in Figure 6.

5 Discussion

We have introduced Algorithmic Regularization, an iterative one-step approximation scheme which can be used to efficiently obtain high-quality approximations of regularization paths. Algorithmic regularization focuses on accurate reconstruction of an entire regularization path and is particularly useful for obtaining path-wise information, such as a dendrogram or the order in which variables leave the active set in sparse regression. We have focused on the application of the ADMM to convex clustering, but the technique of iterative one step-approximations can be applied to any problem which lacks an efficient algorithm. We believe that algorithmic regularization can be fruitfully applied to a broader range of statistical learning problems and expect that similar computational improvements can be achieved for other difficult optimization problems.

Theorem 1 is a novel global convergence result, guaranteeing high-quality approximation at each point of the exact solution path. Despite this, there are still many open questions in the analysis of algorithmic regularization. We are particularly interested in determining optimal convergence rates for global path-wise approximation problems and showing that algorithmic regularization can attain those rates. Our proof of Theorem 1 depends on the strong convexity of the convex clustering problem to ensure linear convergence of the underlying ADMM steps. It would be interesting to explore the interplay between algorithmic regularization and optimization schemes which are linearly convergent without strong convexity, as this may extend the applicability of algorithmic regularization even further.

Using algorithmic regularization, we have introduced the CARP and CBASS algorithms for convex clustering and bi-clustering. On moderately sized problems, CARP and CBASS reduce the time necessary to obtain high-quality regularization paths from several hours to only a few minutes, typically attaining over one-hundred-fold improvements over existing algorithms. Because CARP and CBASS return solutions at a fine grid of the regularization parameter, they can be used to construct accurate convex (bi-)clustering dendrograms, particularly if the back-tracking CARP-VIZ and CBASS-VIZ variants are employed. Additionally, the fine-grained CARP and CBASS solution paths allow for path-wise dynamic visualizations, allowing the analyst to observe exactly how the estimated clusters are formed and structured.

We anticipate that the computational and visualization techniques proposed in this paper will make convex clustering and bi-clustering an attractive option for applied data analysis. Both CARP and CBASS, as well as the proposed visualizations, are implemented in our clustRviz software, available at https://github.com/DataSlingers/clustRviz.

Acknowledgements

We thank Eric Chi for helpful discussions about both convex clustering and algorithmic regularization. MW acknowledges support from the NSF Graduate Research Fellowship Program under grant number 1842494. GA acknowledges support from NSF DMS-1554821, NSF NeuroNex-1707400, and NSF DMS-1264058. JN acknowledges support from NSF DMS-124058, NSF DMS-1554821, and the National Institutes of Health National Cancer Institute T32 Training program in Biostatistics for Cancer Research, Grant Number: CA096520.

MW and JN jointly developed the clustRviz software. MW is responsible for the content and proof of Theorem 1 and prepared the manuscript. JN performed initial experiments and developed the back-tracking and post-processing schemes. GA supervised the research and edited the final manuscript.

Supplementary Materials

Supplementary materials published alongside the online version of this article contain a more detailed derivation of Algorithms 1 and 2, a detailed proof of Theorem 1, a discussion of our CBASS algorithm for convex bi-clustering, numerical comparisons of CARP to existing clustering approaches on real and synthetic data sets, details about the back-tracking, post-processing, and dendrogram construction strategies used in our CARP-VIZ algorithm and clustRviz software, and a discussion of additional related work.

References

[31] Heinz H. Bauschke and Patrick L. Combettes

“A Dykstra-Like Algorithm for Two Monotone Operators”

In Pacific Journal of Optimization 4.3, 2008, pp. 383–391

URL: http://www.ybook.co.jp/online2/oppjo/vol4/p383.html

[32] Dimitri P. Bertsekas

“Incremental proximal methods for large scale convex optimization”

In Mathematical Programming, Series B 129, 2011, pp. 163–195

DOI: 10.1007/s10107-011-0472-0

[33] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato and Jonathan Eckstein

“Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers”

In Foundations and Trends® in Machine Learning 3.1, 2011, pp. 1–122

DOI: 10.1561/2200000016

[34] Peter Bühlmann and Sara Geer

“Statistics for High-Dimensional Data: Methods, Theory, and Applications”, Springer Series in Statistics

Springer Verlag, 2011

DOI: 10.1007/978-3-642-20192-9

[35] Emmanuel J. Candès, Xiaodong Li, Yi Ma and John Wright

“Robust Principal Component Analysis?”

In Journal of the ACM 58.3, 2011, pp. 11:1–11:37

DOI: 10.1145/1970392.1970395

[36] Gary K. Chen, Eric C. Chi, John Michael O. Ranola and Kenneth Lange

“Convex Clustering: An Attractive Alternative to Hierarchical Clustering”

In PLOS Computational Biology 11.5, 2015, pp. e1004228

DOI: 10.1371/journal.pcbi.1004228

[37] Eric C. Chi, Genevera I. Allen and Richard G. Baraniuk

“Convex Biclustering”

In Biometrics 73.1

Wiley Online Library, 2017, pp. 10–19

DOI: 10.1111/biom.12540

[38] Eric C. Chi, Brian R. Gaines, Will Wei Sun, Hua Zhou and Jian Yang

“Provable Convex Co-Clustering of Tensors”

In ArXiv Pre-Print 1803.06518, 2018

URL: https://arxiv.org/abs/1803.06518

[39] Eric C. Chi and Kenneth Lange

“Splitting Methods for Convex Clustering”

In Journal of Computational and Graphical Statistics 24.4

Taylor & Francis, 2015, pp. 994–1013

DOI: 10.1080/10618600.2014.948181

[40] Patrick L. Combettes and Jean-Cristophe Pesquet

“Proximal Splitting Methods in Signal Processing”

In Fixed-Point Algorithms for Inverse Problems in Science and Engineering

Springer, 2011, pp. 185–212

DOI: 10.1007/978-1-4419-9569-8˙10

[41] Wei Deng and Wotao Yin

“On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers”

In Journal of Scientific Computing 66.3, 2016, pp. 889–916

DOI: 10.1007/s10915-015-0048-x

[42] John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra

“Efficient Projections onto the $\ell_{1}$ -Ball for Learning in High Dimensions”

In ICML 2008: Proceedings of the 25th International Conference on Machine Learning

Helsinki, Finland: Omnipress, 2008, pp. 272–279

DOI: 10.1145/1390156.1390191

[43] Jianqing Fan and Runze Li

“Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties”

In Journal of the American Statistical Association 96.456, 2001, pp. 1348–1360

DOI: 10.1198/016214501753382273

[44] Daniel Gabay and Bertrand Mercier

“A Dual Algorithm for the Solution of Nonlinear Variational Problems via Finite Element Approximation”

In Computers & Mathematics with Applications 2.1, 1976, pp. 17–40

DOI: 10.1016/0898-1221(76)90003-1

[45] R. Glowinski and A. Marroco

“Sur l’Approximation, par Éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”

In Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique 9.R2, 1975, pp. 41–76

[46] Tom Goldstein, Brendan O’Donoghue, Simon Setzer and Richard Baraniuk

“Fast Alternating Direction Optimization Methods”

In SIAM Journal on Imaging Sciences 7.3, 2014, pp. 1588–1623

DOI: 10.1137/120896219

[47] Trevor Hastie, Robert Tibshirani and Martin Wainwright

“Statistical Learning with Sparsity: The Lasso and Generalizations”, Monographs on Statistics and Applied Probability 143

ChapmanHall/CRC, 2015

[48] Nhat Ho, Tianyi Lin and Michael I. Jordan

“Global Error Bounds and Linear Convergence for Gradient-Based Algorithms for Trend Filtering and $\ell_{1}$ -Convex Clustering”

In ArXiv Pre-Print 1904.07462, 2019

URL: https://arxiv.org/abs/1904.07462

[49] Toby Dylan Hocking, Armand Joulin, Francis Bach and Jean-Philippe Vert

“Clusterpath: An Algorithm for Clustering using Convex Fusion Penalties”

In ICML 2011: Proceedings of the 28th International Conference on Machine Learning

Bellevue, Washington, USA: ACM, 2011, pp. 745–752

URL: http://www.icml-2011.org/papers/419_icmlpaper.pdf

[50] Holger Hoefling

“A Path Algorithm for the Fused Lasso Signal Approximator”

In Journal of Computational and Graphical Statistics 19.4, 2010, pp. 984–1006

DOI: 10.1198/jcgs.2010.09208

[51] Mingyi Hong and Zhi-Quan Luo

“On the linear convergence of the alternating direction method of multipliers”

In Mathematical Programming, Series A 162.1-2, 2017, pp. 165–1699

DOI: 10.1007/s10107-016-1034-2

[52] Lawrence Hubert and Phipps Arabie

“Comparing Partitions”

In Journal of Classification 2.1, 1985, pp. 193–218

DOI: 10.1007/BF01908075

[53] Nicholas A. Johnson

“A Dynamic Programming Algorithm for the Fused Lasso and $L_{0}$ -Segmentation”

In Journal of Computational and Graphical Statistics 22.2, 2013, pp. 246–260

DOI: 10.1080/10618600.2012.681238

[54] Kenneth Lange and Kevin L. Keys

“The Proximal Distance Algorithm”

In ICM 2014: Proceedings of the International Congress of Mathematicians 4, 2014, pp. 95–116

URL: http://www.icm2014.org/en/vod/proceedings.html

[55] Xudong Li, Defeng Sun and Kim-Chuan Toh

“A Highly Efficient Semismooth Newton Augmented Lagrangian Method for Solving Lasso Problems”

In SIAM Journal on Optimization 28.1, 2016, pp. 433–458

DOI: 10.1137/16M1097572

[56] Fredrik Lindsten, Henrik Ohlsson and Lennart Ljung

“Clustering using sum-of-norms regularization: With application to particle filter output computation”

In SSP 2011: Proceedings of the 2011 IEEE Statistical Signal Processing Workshop

Nice, France: Curran Associates, Inc., 2011, pp. 201–204

DOI: 10.1109/SSP.2011.5967659

[57] P.L. Lions and B. Mercier

“Splitting Algorithms for the Sum of Two Nonlinear Operators”

In SIAM Journal on Numerical Analysis 16.6, 1979, pp. 964–979

DOI: 10.1137/0716071

[58] Yuliya Marchetti and Qing Zhou

“Solution Path Clustering with Adaptive Concave Penalty”

In Electronic Journal of Statistics 8.1, 2014, pp. 1569–1603

DOI: 10.1214/14-EJS934

[59] Jean Jacques Moreau

“Décomposition Orthogonale d’un Espace Hilbertien Selon Deux Cônes Mutuellement Polaires”

In Comptes Rendus Hebdomadaires des Séances de l’Académie des Sciences 255.1, 1962, pp. 238–240

[60] John Nagorski and Genevera I. Allen

“Genomic Region Detection via Spatial Convex Clustering”

In PLoS One 13.9, 2018, pp. e0203007

DOI: 10.1371/journal.pone.0203007

[61] Robert Nishihara, Laurent Lessard, Ben Recht, Andrew Packard and Michael Jordan

“A General Analysis of the Convergence of ADMM”

In ICML:2015: Proceedings of the 32nd International Conference on Machine Learning

Lille, France: PMLR, 2015, pp. 343–352

URL: http://proceedings.mlr.press/v37/nishihara15.html

[62] Wei Pan, Xiaotong Shen and Binghui Liu

“Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty”

In Journal of Machine Learning Research 14, 2013, pp. 1865–1889

URL: http://www.jmlr.org/papers/v14/pan13a.html

[63] Ashkan Panahi, Devdatt Dubhashi, Fredrik D. Johansson and Chiranjib Bhattacharyya

“Clustering by Sum of Norms: Stochastic Incremental Algorithm, Convergence, and Cluster Recovery”

In ICML:2017: Proceedings of the 34th International Conference on Machine Learning 70

Sydney, Australia: PMLR, 2017, pp. 2769–2777

URL: http://proceedings.mlr.press/v70/panahi17a.html

[64] Cheolwoo Park, Hosik Choi, Chris Delcher, Yanning Wang and Young Joo Yoon

“Convex Clustering Analysis for Histogram‐Valued Data”

In Biometrics To appear, 2018+

DOI: 10.1111/biom.13004

[65] Kristiaan Pelckmans, Joseph Brabanter, Bart Moor and Johan Suykens

“Convex Clustering Shrinkage”

In PASCAL Workshop on Statistics and Optimization of Clustering, 2005

[66] Peter Radchenko and Gourab Mukherjee

“Convex Clustering via $\ell_{1}$ Fusion Penalization”

In Journal of the Royal Statistical Society, Series B: Statistical Methodology 79.5, 2017, pp. 1527–1546

DOI: 10.1111/rssb.12226

[67] William M. Rand

“Objective Criteria for the Evaluation of Clustering Methods”

In Journal of the American Statistical Association 66.366, 1971, pp. 846–850

DOI: 10.1080/01621459.1971.10482356

[68] Alessandro Rinaldo

“Properties and Refinements of the Fused Lasso”

In Annals of Statistics 37.5B, 2009, pp. 2922–2952

DOI: 10.1214/08-AOS665

[69] R. Rockafellar

“Convex Analysis”

Princeton University Press, 1970

[70] Saharon Rosset and Ji Zhu

“Piecewise Linear Regularized Solution Paths”

In Annals of Statistics 35.3, 2007, pp. 1012–1030

DOI: 10.1214/009053606000001370

[71] Leonid I. Rudin, Stanley Osher and Emad Fatemi

“Nonlinear Total Variation Based Noise Removal Algorithms”

In Physica D: Nonlinear Phenomena 60.1-4, 1992, pp. 259–268

DOI: 10.1016/0167-2789(92)90242-F

[72] Sohil Atul Shah and Vladlen Koltun

“Robust continuous clustering”

In Proceedings of the National Academy of Sciences of the United States 114.37, 2017, pp. 9814–9819

DOI: 10.1073/pnas.1700770114

[73] James Sharpnack, Aarti Singh and Alessandro Rinaldo

“Sparsistency of the Edge Lasso over Graphs”

In AISTATS 2012: Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics 22

La Palma, Canary Islands: PMLR, 2012, pp. 1028–1036

URL: http://proceedings.mlr.press/v22/sharpnack12.html

[74] Wei Shi, Qing Ling, Kun Yuan, Gang Wu and Wotao Yin

“On the Linear Convergence of the ADMM in Decentralized Consensus Optimization”

In IEEE Transactions on Signal Processing 62.7, 2014, pp. 1750–1761

DOI: 10.1109/TSP.2014.2304432

[75] Xiaopeng Lucia Sui, Li Xu, Xiaoning Qian and Tie Liu

“Convex Clustering with Metric Learning”

In Pattern Recognition 81, 2018, pp. 575–584

DOI: 10.1016/j.patcog.2018.04.019

[76] Defeng Sun, Kim-Chuan Toh and Yancheng Yuan

“Convex Clustering: Model, Theoretical Guarantee and Efficient Algorithm”

In ArXiv Pre-Print 1810.02677, 2018

URL: https://arxiv.org/abs/1810.02677

[77] Kean Ming Tan and Daniela Witten

“Statistical Properties of Convex Clustering”

In Electronic Journal of Statistics 9.2, 2015, pp. 2324–2347

DOI: 10.1214/15-EJS1074

[78] Robert Tibshirani

“Regression Shrinkage and Selection via the Lasso”

In Journal of the Royal Statistical Society, Series B (Methodological) 58.1, 1996, pp. 267–288

DOI: 10.1111/j.2517-6161.1996.tb02080.x

[79] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu and Keith Knight

“Sparsity and Smoothness via the Fused Lasso”

In Journal of the Royal Statistical Society, Series B: Statistical Methodology 67.1, 2005, pp. 91–108

DOI: 10.1111/j.1467-9868.2005.00490.x

[80] Ryan J. Tibshirani and Jonathan Taylor

“The Solution Path of the Generalized Lasso”

In Annals of Statistics 39.3, 2011, pp. 1335–1371

DOI: 10.1214/11-AOS878

[81] Paul Tseng

“Applications of a Splitting Algorithm to Decomposition in Convex Programming and Variational Inequalities”

In SIAM Journal on Control and Optimization 29.1, 1991, pp. 119–138

DOI: 10.1137/0329006

[82] Binhuan Wang, Yilong Zhang, Will Wei Sun and Yixin Fang

“Sparse Convex Clustering”

In Journal of Computational and Graphical Statistics 27.2, 2018, pp. 393–403

DOI: 10.1080/10618600.2017.1377081

[83] Qi Wang, Pinghua Gong, Shiyu Chang, Thomas S. Huang and Jiayu Zhou

“Robust Convex Clustering Analysis”

In ICDM 2016: Proceedings of the 16th IEEE International Conference on Data Mining, 2016, pp. 1263–1268

DOI: 10.1109/ICDM.2016.0170

[84] Leland Wilkinson and Michael Friendly

“The History of the Cluster Heat Map”

In The American Statistician 63.2, 2009, pp. 179–184

DOI: 10.1198/tas.2009.0033

[85] Chong Wu, Sunghoon Kown, Xiaotong Shen and Wei Pan

“A New Algorithm and Theory for Penalized Regression-based Clustering”

In Journal of Machine Learning Research 17.188, 2016, pp. 1–25

URL: http://jmlr.org/papers/v17/15-553.html

[86] Wei Hong Yang and Deren Han

“Linear Convergence of the Alternating Direction Method of Multipliers for a Class of Convex Optimization Problems”

In SIAM Journal on Numerical Analysis 54.2, 2016, pp. 625–640

DOI: 10.1137/140974237

[87] Cun-Hui Zhang

“Nearly Unbiased Variable Selection under Minimax Concave Penalty”

In Annals of Statistics 38.2, 2010, pp. 894–942

DOI: 10.1214/09-AOS729

[88] Changbo Zhu, Huan Xu, Chenlei Leng and Shuicheng Yan

“Convex Optimization Procedure for Clustering: Theoretical Revisit”

In NIPS 2014: Advances in Neural Information Processing Systems 27

Montréal, Canada: Curran Associates, Inc., 2014, pp. 1619–1627

URL: https://papers.nips.cc/paper/5307-convex-optimization-procedure-for-clustering-theoretical-revisit

Supplementary Materials

Operator Splitting Methods for Convex Clustering

.1 ADMM for Convex Clustering

In this section, we derive and give the full form of the ADMM presented in Algorithm 1 for the convex clustering problem (1). We begin by noting that, in typical applications, most of the weights $w_{ij}$ are zero and hence do not enter into the optimization problem. We can omit the $\binom{n}{2}$ -term sum and instead write the convex clustering problem (1) as

[TABLE]

where $\mathcal{E}$ is the set of directed edges with non-zero weights $w$ connecting $i$ to $j$ .

In this form, it is clear that the convex clustering problem is amenable to operator splitting methods; in particular, [39] showed that the Alternating Direction Method of Multipliers (ADMM) [45, 44, 33] works particularly well for this problem. Algorithm A1 differs from the ADMM derived in [39] in two significant ways: firstly, we only consider edges with non-zero weights, thereby greatly reducing storage requirements of the algorithm; and secondly, we implement the algorithm in “matrix-form” rather than in a fully vectorized form. These differences make the resulting algorithm both easier to derive and to read, as well as more able to take advantage of highly optimized numerical linear algebra libraries.

We note that while we are solving a matrix-valued problem, it is not a semi-definite program, and the additional complexity typically associated with matrix-valued optimization does not apply here. Because we are optimizing over the space of all matrices of a certain size, the underlying problem is essentially Euclidean in geometry and standard (vector-valued) optimization techniques can be applied, replacing the (squared) Euclidean norm with the (squared) Frobenius norm and the standard Euclidean inner product with the Frobenius inner product as necessary.

The derivation of the ADMM for convex clustering (1) is relatively straight-forward. We begin by introducing an auxiliary variable $\bm{V}$ containing the pairwise differences between connected rows of $\bm{U}$ . The problem then becomes

[TABLE]

From here, we use the scaled form of the ADMM as given by a matrix version of Equations (3.5) to (3.7) of [33]:

[TABLE]

where the dual variable is denoted by $\bm{Z}$ . The analytical solution to the first subproblem is given by:

[TABLE]

This update is the most expensive step in the ADMM, though it can be significantly sped up by pre-calculating caching the Cholesky factorization of $\bm{I}+\rho\bm{D}^{T}\bm{D}$ and using it at each $\bm{U}$ -update:

[TABLE]

To solve the second subproblem, we note that it can be written as a proximal operator:

[TABLE]

We note that, due to the row-wise structure of $P$ , this proximal operator can be computed separately across the rows of its argument. In the cases $q=1$ or $q=2$ , the proximal operator reduces to element-wise ( $q=1$ ) or group ( $q=2$ ) soft-thresholding row $l$ at the level $w_{l}\lambda/\rho$ . If $q=\infty$ , Moreau’s decomposition [59] can be combined with the the efficient projection onto the $\ell_{1}$ ball developed by [42] to evaluate the prox in $\mathcal{O}(p\log p)$ steps. For other values of $q$ , an iterative algorithm must be used.

Several stopping criteria for the ADMM have been proposed in the literature. We have found a simple stopping rule based on the change in $\bm{U}$ being small sufficient in all cases. While some authors report speed-ups due to varying the ADMM relaxation parameter $\rho$ , we have found that fixing $\rho$ and re-using the Cholesky factor $\bm{L}$ to be more efficient. Combining these steps, we obtain Algorithm A1.

.2 Algorithmic Regularization for Convex Clustering

In this section, we give a the full version of the CARP algorithm presented in Algorithm 2. CARP can be obtained from the standard ADMM for convex clustering (Algorithm A1) by replacing the inner ADMM loop with a single iteration. This modification gives CARP (Algorithm A2). As with Algorithm A1, we prefer to use a matrix formulation, instead of a fully vectorized formulation, to simplify the implementation and to take advantage of high-performance numerical linear algebra libraries.

.3 AMA for Convex Clustering

In addition to the AMA, [39] also show that the convex clustering problem (1) can be efficiently solved using the Alternating Minimization Algorithm (AMA) of [81]. In our notation, the AMA becomes

[TABLE]

(Note that we use the unscaled updates for $\bm{V},\bm{Z}$ here as the AMA uses different values of the relaxation parameter in the $\bm{U}$ and $\bm{V}$ updates. In particular, this means that the dual variables $\bm{Z}$ from the ADMM are not the same as those from the AMA.) Simplifying these updates as before, the AMA becomes:

[TABLE]

[39] note that a clever application of Moreau’s decomposition [59] allows the $\bm{V}$ -updates to be elided and for the AMA to be simplified into a two-step scheme. The $\bm{V}^{(k)}$ iterates are key to dendrogram reconstruction, however, so such a simplification could not be used in an AMA-based CARP variant.

In our experiments, this elision is necessary for the AMA to outperform the ADMM and so, without it, the AMA does not appear to be a promising basis for an algorithmic regularization scheme. In general, the ADMM appears to converge more rapidly per iteration than the AMA, while the simplified AMA has much faster updates, allowing better overall computational performance in a standard optimization scheme. Since CARP performs only a single iteration per regularization level, however, the faster per iteration convergence of the ADMM is more important to us than the faster calculation of the AMA.

Finally, [39] also discuss the use of accelerated variants of the ADMM and AMA [46] to improve convergence. Because CARP uses only a single iteration for each regularization level, it is not amenable to acceleration.

Appendix A Proof of Theorem 1

In this section we prove Theorem 1 on the Hausdorff convergence of CARP to the convex clustering regularization path. We begin with 3 technical lemmas which may be of independent interest: Lemma 1 provides a convergence rate for the optimization step embedded within a CARP iteration; Lemma 2 establishes a form of Lipschitz continuity for convex clustering regularization paths; Lemma 3 provides a global bound for the approximation error induced by CARP at any iteration. In one step, our results are stated and proven for CARP with an $\ell_{2}$ -fusion penalty, but can be easily extended to other $\ell_{q}$ -fusion penalties.

Lemma 1 (Q-Linear Error Decrease).

At each iteration $k$ , the CARP approximation error decreases by a factor $c<1$ not depending on $t$ or $\epsilon$ . That is,

[TABLE]

for some $c$ strictly less than 1.

Proof.

By construction, each CARP step is a single iteration of the ADMM for the convex clustering problem (Algorithm A1) initialized at $(\bm{U}^{(k-1)},\bm{V}^{(k-1)},\bm{Z}^{(k-1)})$ . Hence it suffices to analyze the convergence of the ADMM for the convex clustering problem and to establish linear convergence.

The convex clustering problem (1) is strongly convex due to squared Frobenius norm term. Linear convergence of the standard ADMM for strongly convex problems was first shown by [57] and has since been refined by several other authors including [74], [61], [41], and [86].

In vectorized form, with $\bm{x}=\operatorname{vec}(\bm{X})$ , $\bm{u}=\operatorname{vec}(\bm{U})$ , and $\bm{v}=\operatorname{vec}(\bm{V})$ , the convex clustering problem (1) can be expressed as:

[TABLE]

where $\|\cdot\|_{\operatorname{vec}(\ell_{q})}$ is an appropriately vectorized version of the row-wise $\ell_{q}$ norm (a standard $\ell_{1}$ norm in the case $q=1$ and a mixed $\ell_{q}/\ell_{1}$ norm otherwise) and we have omitted the fusion weights for brevity.

In the notation of [51], the constraint matrix for the convex clustering problem is given by $\bm{E}=\begin{pmatrix}\bm{I}\otimes\bm{D}&-\bm{I}\end{pmatrix}$ , for appropriately sized identity matrices, which is clearly row-independent, yielding linear convergence of the primal and dual variables at a rate $c_{\lambda}<1$ which may depend on $\lambda$ . (We do not need to verify their additional technical assumptions as we are only using a two-block ADMM instead of the more general multi-block ADMM which is the focus of their paper.) Taking $c=\sup_{\lambda\leq\lambda_{\max}}c_{\lambda}$ , we observe that the CARP iterates are uniformly Q-linearly convergent at a rate $c$ . ∎

*Remark**.*

Recently, [41] gave a readable and precise analysis of the linear convergence of the ADMM, including estimates of the convergence rate $c$ . The specific proof technique they employ does not strictly apply to the convex clustering problem 1, however, as the $\bm{D}$ matrix is rank-deficient (excluding their Scenario 1) and the norm used for the fusion penalty is non-differentiable at the origin (excluding their Scenario 3). If an estimate of the convergence rate is required, the analysis of [41] can be applied to the convex clustering problem 1 by re-parameterizing the problem to address the rank-deficiency of $\bm{D}$ . In particular, if the redundant rows of $\bm{D}$ are combined (eliminating the nullspace of $\bm{D}$ ), the resulting matrix will be full-row rank, allowing Scenario 1 and Case 2 of [41] to be applied. This re-parameterization results in different split and dual variables ( $\bm{V}$ and $\bm{Z}$ , corresponding to the $\bm{D}$ matrix), however, so we do not pursue that approach here. The primal variable, $\bm{U}$ , remains unchanged under this re-parameterization.

Lemma 2 (Lipschitz Continuity of Solution Paths).

$(\hat{\bm{U}}_{\lambda},\hat{\bm{Z}}_{\lambda})$ * is $L$ -Lipschitz with respect to $\lambda$ . That is,*

[TABLE]

for some $L>0$ .

We note that this not the only form of Lipschitz continuity commonly considered for regularized estimation problems. In particular, Lipschitz continuity of the solution with respect to the data is a key element of various consistency results, while Lipschitz continuity of the objective function with respect to the parameters is a key assumption used to prove convergence of many optimization schemes.

Proof.

It suffices to prove Lipschitz continuity of $\hat{\bm{U}}_{\lambda}$ and $\hat{\bm{Z}}_{\lambda}$ separately and then take the sum of their Lipschitz moduli as the joint Lipschitz modulus.

We first show that $\hat{\bm{U}}_{\lambda}$ is Lipschitz. In vectorized form, the convex clustering problem is

[TABLE]

where $\bm{u}=\operatorname{vec}(\bm{U})$ , $\bm{x}=\operatorname{vec}(\bm{X})$ , $f_{q}$ is a convex function, and $\tilde{\bm{D}}=\bm{I}\otimes\bm{D}$ is a fixed matrix [77, cf.].

The KKT conditions give

[TABLE]

where $\partial f_{q}(\cdot)$ is the subdifferential of $f_{q}$ . Since $f_{q}$ is convex, it is differentiable almost everywhere [69, Theorem 25.5], so the following holds for almost all $\bm{u}_{\lambda}$ :

[TABLE]

Differentiating with respect to $\lambda$ , we obtain [70, c.f.]

[TABLE]

Note that $\bm{u}_{\lambda}$ depends on $\lambda$ so the chain rule must be used here. From here, we note

[TABLE]

For the convex clustering problem, we recall that $f_{q}(\cdot)$ is a norm and hence has bounded gradient; hence $f^{\prime}_{q}(\tilde{\bm{D}}\bm{u}_{\lambda})$ is bounded so the gradient of the regularization path is bounded and exists almost everywhere. This implies that the regularization path is piecewise Lipschitz. Since the solution path is constant for $\lambda\geq\lambda_{\max}$ and is continuous [39, Proposition 2.1], the solution path is globally Lipschitz with a Lipschitz modulus equal to the maximum of the piecewise Lipschitz moduli.

A similar argument shows Lipschitz continuity of $\hat{\bm{Z}}_{\lambda}$ or one can use the relationships between $\hat{\bm{U}}_{\lambda}$ and $\hat{\bm{Z}}_{\lambda}$ discussed in Section 2.1 of [77].

∎

Lemma 3 (Global Error Bound).

The following error bound holds for all $k$ :

[TABLE]

Proof.

Throughout, we let

[TABLE]

Our proof proceeds by induction on $k$ . First note that, at initialization:

[TABLE]

by Lemma 2.

Next, at $k=1$ , we note that

[TABLE]

by Lemma 1. We now the triangle inequality to split the right hand side:

[TABLE]

From above, we have $\text{RHS-1}\leq L\epsilon$ . Using Lemma 2, RHS-2 can be bounded by

[TABLE]

Putting these together, we get

[TABLE]

Repeating this argument for $k=2$ , we see

[TABLE]

We use this as a base case for our inductive proof and prove the general case:

[TABLE]

Expanding the definitions of $\bm{\Upsilon}^{(k)},\hat{\bm{\Upsilon}}_{\lambda}$ , we get the desired result.

∎

With these results, we are now ready to prove Theorem 1:

See 1

Proof of Theorem 1.

It suffices to show that $\{\bm{\Upsilon}^{(k)}\},\{\hat{\bm{\Upsilon}}_{\lambda}\}$ converge in the Hausdorff metric to show that the primal and dual paths converge separately. We break our proof into three steps:

i.

$\sup_{\lambda}\inf_{k}\left\|\bm{\Upsilon}^{(k)}-\hat{\bm{\Upsilon}}_{\lambda}\right\|\to 0$ ; 2. ii.

$\epsilon t^{k^{*}}$ remains bounded as $t,\epsilon$ decrease and $k^{*}$ increases, where $k^{*}$ is the iteration at which CARP terminates; and 3. iii.

$\sup_{k}\inf_{\lambda}\left\|\bm{\Upsilon}^{(k)}-\hat{\bm{\Upsilon}}_{\lambda}\right\|\to 0$ .

Together, these give the desired result.

Step i. We first show that

[TABLE]

tends to zero. We begin by fixing temporarily $\lambda$ and bounding

[TABLE]

The infimum over all $k$ is less than the distance at any particular $k$ , so it suffices to choose a value of $k$ which gives convergence to 0. Let $\tilde{k}$ be the value of $k$ which gives the closest value of $\gamma^{(k)}$ to $\lambda$ along the CARP path; and let $\tilde{\lambda}=\gamma^{(\tilde{k})}=\epsilon t^{\tilde{k}}$ . That is,

[TABLE]

Then

[TABLE]

Using Lemma 2, we can bound RHS-2 as

[TABLE]

Using Lemma 3, we can bound RHS-1 as

[TABLE]

where $C=\sum_{i=1}^{\infty}\left(\frac{c}{1+t}\right)^{i}$ is large but finite. Since $c<1$ and $\tilde{\lambda}=\epsilon t^{\tilde{k}}\leq\lambda_{\max}$ , we can replace the $k$ -dependent quantities to get

[TABLE]

Putting these together, we have

[TABLE]

Since the right-hand side doesnt’ depend on $\lambda$ , we have

[TABLE]

As $(t,\epsilon)\to(1,0)$ , we have that the right-hand side converges to zero and hence

[TABLE]

as desired.

Step ii. Before showing the other half of Hausdorff convergence, we pause to prove an intermediate result: As $(t,\epsilon)\to(1,0)$ , $\epsilon t^{k^{*}}$ remains bounded, where $k^{*}=k^{*}(t,\epsilon)$ is the iteration at which CARP halts. For this step, we specialize to the $\ell_{2}$ -case for concreteness, though our results are easily generalized.

CARP terminates when $\|\bm{V}^{(k+1)}\|_{\infty,q}=\max_{i,j}\|\bm{U}^{(k+1)}_{i\cdot}-\bm{U}^{(k+1)}_{j\cdot}\|_{q}=0$ ; that is, CARP terminates when all of the pairwise differences have gone to zero and the data has been fused into a single cluster.

Note that the update

[TABLE]

will set $\bm{V}^{(k+1)}_{i}$ to zero when

[TABLE]

Letting $(j,k)$ be the endpoints of edge $i$ , we find

[TABLE]

where $\overline{\bm{x}}$ is the column-wise mean of $\bm{X}$ .

Our strategy will be to show that this quantity is less that $w_{i}\gamma^{(k)}$ for some $k>k^{*}$ small enough that $\epsilon t^{k}$ remains bounded and hence $\epsilon t^{k^{*}}$ remains bounded. Let

[TABLE]

be the first value of $k$ such that $\gamma^{(k)}>\lambda_{\max}$ , i.e., the value of $\lambda$ such that all of the pairwise differences have gone to zero and the data has been fused into a single cluster in the regularization path ( $\tilde{k}$ is to the regularization path as $k^{*}$ is to the CARP path).

Using the bound from Equation (A1), we have that

[TABLE]

so

[TABLE]

Bounding $\|\bm{Z}^{(k)}_{i}\|_{2}$ is more subtle, but a rough bound can be obtained again using Equation (A1) to obtain:

[TABLE]

so

[TABLE]

Putting these together, we obtain

[TABLE]

To stop, we require that

[TABLE]

which occurs when

[TABLE]

Taking the max over all $(l,m)$ -pairs we find

[TABLE]

Hence it suffices to note

[TABLE]

which clearly remains bounded as $t,\epsilon\to(1,0)$ .

Step iii. With this result in hand, the proof is similar to the first half. Again, we can invoke Lemma 3 to find that

[TABLE]

With the result from above, $\epsilon t^{k}$ remains bounded above by some $B<\infty$ , so

[TABLE]

As $(t,\epsilon)\to(1,0)$ , the right hand-side goes to zero so

[TABLE]

Combining this with step i, we have

[TABLE]

as desired. ∎

Appendix B CBASS: Algorithmic Regularization Paths for Convex Bi-Clustering

Having explored the computational, theoretical, and practical advantages of CARP, we now turn to the closely related problem of bi-clustering. Bi-clustering refers to the simultaneous clustering of rows and columns. Building on the convex clustering formulation (1), [37] propose the following convex formulation of bi-clustering:

[TABLE]

The second penalty term induces row fusions, similarly to how the first term induces column fusions. The resulting $\hat{\bm{U}}_{\lambda}$ has a ‘checkerboard’ pattern where groups of rows and columns are clustered together. Note that for bi-clustering the centroids are scalars instead of vectors as in the clustering case.

Despite their relatively similar appearances, the convex bi-clustering problem (A2) is significantly more complicated than the convex clustering problem (1) and cannot be directly solved directly using an operator splitting method. [37] propose the use of the Dykstra-Like Proximal Algorithm (DLPA) of [31] to solve the convex bi-clustering problem (A2) and refer to the resulting algorithm as COBRA (Convex Bi-ClusteRing Algorithm). COBRA works by alternating solving row- and column-wise convex clustering problems until convergence. As with convex clustering, calculating the bi-clustering solution path with sufficient accuracy to accurately reconstruct both row and column dendrograms poses significant computational burden, which is exacerbated by COBRA’s requirement to evaluate several convex clustering subproblems for each value of $\lambda$ . While CARP could be used to solve each subproblem quickly, we would still have to run CARP many times, incurring a non-trivial total cost.

Instead, we apply the technique of algorithmic regularization to COBRA directly: we take only a single DLPA step and, within that step, we take only a single ADMM step for each of the row- and column-subproblems. We refer to the resulting algorithm as CBASS–Convex Bi-clustering via Algorithmic Regularization with Small Steps. Details of the CBASS algorithm are given in Algorithm A6 below. Our clustRviz software implements CBASS with and without a back-tracking step to ensure exact recovery or both the row- and column-dendrograms.

B.1 Algorithms for Convex Bi-Clustering

The DLPA can be used to solve problems of the form

[TABLE]

where $f$ and $g$ are proximable but $f+g$ is not using the following iterative algorithm (see also Algorithm 10.18 in [40]):

To apply Algorithm A3 to convex bi-clustering (A2), we note that the problem can be rewritten as:

[TABLE]

and apply the DLPA with $f(\bm{U})=P_{\text{row}}(\bm{U};\bm{w}_{\text{row}},q)$ and $g(\bm{U})=P_{\text{col}}(\bm{U};\bm{w}_{\text{col}},q)$ . We note that $\operatorname{\textsf{prox}}_{f}$ is a standard convex clustering problem and can be evaluated using the ADMM or AMA approaches described above. To evaluate $\operatorname{\textsf{prox}}_{g}$ , we note that $\|(\bm{U}\bm{D}_{\text{col}})_{l\cdot}\|_{q}=\|(\bm{D}_{\text{col}}^{T}\bm{U}^{T})_{\cdot l}\|_{q}$ so we simply need to perform standard convex clustering on transposed data. The DLPA then becomes:

Expanding the Convex-Clustering steps with the ADMM from Algorithm A1 yields Algorithm A5. To obtain CBASS from Algorithm A5, we replace the inner row- and column-subproblem loops with a single iteration of the convex clustering ADMM. Additionally, we do not reset the auxiliary $\bm{U},\bm{P},\bm{Q}$ variables, instead carrying forward their values from each CBASS iteration to the next. These two modifications yield CBASS (Algorithm A6).

B.2 Visualizations for Convex Bi-Clustering

While it is possible to construct row- and column-wise CBASS analogues of the CARP dendrogram and path plots discussed above, the primary visualization associated with bi-clustering is the cluster heatmap, which combines a heatmap visualization of the raw data with independent row- and column-dendrograms [84]. We modify the standard cluster heatmap by creating dendrograms using the fusions identified by CBASS. As [37] argue, the joint estimation of dendrograms provided by convex bi-clustering often produces better results than independent dendrogram construction.

We applied CBASS to the Presidents data and show the resulting cluster heatmap in Figure A1. A close examination reveals several interesting patterns. This data clearly exhibits a bi-clustered structure, with certain words being strongly associated with certain groups of presidents. Examining the two clear bi-clusters on the left, we see that words such as “billion,” “soviet,” and “technology” are frequently used by modern presidents and rarely used by pre-modern presidents. Conversely, we see that words which may be considered somewhat antiquated, such as “vessel” or “shall,” are associated with pre-modern presidents. For data with less clear structure, the interpretability of the cluster heatmap can sometimes be increased by plotting the smoothed estimates $\bm{U}^{(k)}$ rather than the raw data.

In simulation studies, CBASS appears to converge to the exact regularization path as $t\to 1$ . While this is consistent with both our theory and observations for CARP, we leave the theoretical analysis of CBASS to future work. As far as we know, a rate of convergence has not been established for the DLPA in the optimization literature, without which the techniques used to prove Theorem 1 cannot be applied to CBASS.

Appendix C Additional Comparisons

Figure A2 compares the accuracy of CARP, CBASS, hierarchical clustering, and $K$ -means clustering on the TCGA and Authors data sets discussed in Section 3. While certain forms of hierarchical clustering perform well on this data, CARP achieves superior performance without requiring the user to select a distance or linkage.

Figure A3 compares the performance of CARP, hierarchical clustering with Euclidean distance and Ward’s, complete, and single linkage, and $K$ -means on data simulated from a Gaussian mixture model. The cluster centroids were equally spaced on a 2-dimensional subspace and $n=54$ observations were generated from a Gaussian distribution with unit variance centered at the cluster centroid. Each of the clustering methods exhibit similar behaviors, with improved performance as the inter-cluster distance increases and decreased performance with higher ambient dimensionality or more clusters. Because these data were generated from isotropic Gaussians, all methods except single linkage hierarchical clustering perform well.

Figure A4 compares the performance of the same methods on non-convex clusters. In particular, we consider a version of the “half-moons” example proposed by [49]. (See also Figure A5.) The data were generated on a two-dimensional subspace with $n=50$ observations from each cluster and Gaussian noise orthogonal to the signal subspace were added. Not surprisingly, the performance of all methods degrades as the degree of noise and the ambient dimensionality are increased. Despite this, we see that CARP and single-linkage hierarchical clustering clearly outperform other methods, with CARP being more robust to the presence of noise.

Comparing these two simulations, we see that only convex clustering (CARP) is able to consistently perform well on both the convex and non-convex simulated data without requiring the user to select a distance metric or linkage. This is in large part due to the sparse weighting scheme used in the clustRviz package, which is able to flexibly and robustly adapt to the observed data distribution. Our findings should be contrasted with those of [77] who focus only on the case of uniform weights and show that, without informative weights, convex clustering performs similarly to single linkage convex clustering.

Appendix D Back-Tracking, Post-Processing, and Dendrogram Construction

The CARP-VIZ variant of our CARP algorithm implements a back-tracking scheme in order to improve dendrogram recovery. Because a relatively large value of $\lambda$ is typically required for any fusions to occur in convex clustering (1), CARP-VIZ begin with a large step-size (by default, $t=1.1$ ) and performs standard CARP iterations until the first fusion is identified (i.e., a row of $\bm{V}^{(k)}$ is set to zero). After the first fusion is identified, CARP-VIZ switches to a smaller step-size (by default, $t=1.01$ ) for the remainder of the algorithm. At each iteration, CARP-VIZ counts the number of fusions that occur. If more than one fusion occurs, instead of proceeding, CARP-VIZ attempts to determine which fusion occurred first. It does so using a back-tracking scheme, similar to those used in optimization methods. CARP-VIZ discards the iteration with multiple fusions, halves the step-size, and performs another iteration. If this half-step iteration has only one fusion, CARP-VIZ accepts it and continues as before. Otherwise, CARP-VIZ again halves the step-size and repeats this process until the correct order of fusions is identified (or a limit on the number of back-tracking steps is hit). Once the first fusion is identified, CARP-VIZ resets $t$ and continues. CBASS-VIZ uses essentially the same scheme, though it checks for both row and column fusions. We have found that, because it only uses a small step-size at “interesting” parts of the solution space, this back-tracking scheme typically produces more accurate dendrogram recovery at less expense than running standard CARP with a very small step-size.

Once CARP or CARP-VIZ terminate, clustRviz performs an additional post-processing step to isolate individual fusions. clustRviz reviews the fusions at each iteration and, if an iteration has multiple fusions, linearly interpolates between $\bm{U}^{(k)}$ and $\bm{U}^{(k+1)}$ to determine the approximate regularization level at which each fusion occurred. The interpolated iterate is only approximate, but is necessary for dendrogram construction. We note that no interpolation is typically needed for CARP-VIZ results, due to the back-tracking step used to isolate individual fusions, but, by default, clustRviz post-processes both CARP and CARP-VIZ output. The same post-processing scheme is applied separately to the row and column fusions from CBASS.

Once post-processing is performed, a dendrogram is constructed from the interpolated iterates. The dendrogram construction proceeds in the opposite order as hierarchical clustering: we begin with the fully fused data and decrease $\gamma^{(k)}$ , noting the order in which centroids were fused. (We use the reverse ordering so that, in the rare case where the path contains fissions, the final fusion is reflected in the resulting dendrogram.) The dendrogram height associated with each fusion is the $\gamma^{(k)}$ at which that fusion is first observed. Finally, we check whether fusions are more uniformly distributed on the $\gamma^{(k)}$ scale or the $\log(\gamma^{(k)})$ and adjust the dendrogram height accordingly to provide less cluttered visualizations.

Since the weight selection, post-processing, and dendrogram reconstruction steps could potentially be applied to any convex clustering algorithm, they are omitted from all timing results shown in this paper.

Appendix E Additional Related Work

Following its original introduction by [65] and popularization by [49] and [56], convex clustering has been the subject of much methodological and theoretical research. In this section, we review some of this related work which, while not directly relevant to the computational or visualization strategies we propose, may be of interest to readers interested in convex clustering.

The convex clustering problem can be generalized as

[TABLE]

where $p(\cdot)$ is any sparsity-inducing function. The choice of an $\ell_{q}$ -norm ( $p(\cdot)=\|\cdot\|_{q}$ ) gives standard convex clustering as considered in this paper. [62], [58], [85], and [72] have all considered the use of non-convex choices of $p(\cdot)$ , typically using the popular SCAD or MCP penalty functions to reduce bias and improve estimation performance [43, 87]. We do not consider non-convex $p(\cdot)$ in this paper, though the computational techniques and visualizations we propose could be adapted to non-convex penalties in a relatively straightforward manner.

Restricting our attention to standard convex clustering ( $p(\cdot)=\|\cdot\|_{q}$ ), several useful methodological extensions have been proposed in the literature. For example, [83] augment the convex clustering problem (1) with an additional sparse component to add robustness to outliers, similar to the robust PCA formulation of [35], while [82] propose a variant which incorporates feature selection into the clustering objective using an $\ell_{1}$ penalty [78]. As discussed in Section B, [37] extend convex clustering to the bi-clustering setting, where rows and columns are simultaneously clustered. Building on this work, [38] extend bi-clustering to general co-clustering of $k$ -order tensors, where they note several surprising theoretical advantages. The recent paper by [64] extends convex clustering to histogram-valued data by replacing the Euclidean distance with an appropriate metric on the space of histograms.

The squared Frobenius loss function of the convex clustering problem may be interpreted as an isotropic Gaussian likelihood, suggesting another avenue for generalization. [75] replace the Frobenius loss with a squared Mahalanobis distance to improve performance on non-spherical clusters. If the metric (inverse covariance matrix) is known, simple variants on the techniques used in this paper may be used; if the metric must be estimated from the data, the resulting problem is bi-convex and an alternating minimization scheme must be used, only guaranteeing convergence to a stationary point.

The use of a convex formulation allows the sophisticated tools of modern high-dimensional statistics to be brought to bear [34, 47]. In addition to the work of [77] proving a form prediction consistency and of [66] proving asymptotic dendrogram recovery, [88] give sufficient conditions for exact cluster recovery in the two-cluster case. The results of [88] were later extended by [63] and by [76] to the more general multi-cluster case.

In addition to the general purpose operator-splitting algorithms proposed by [39], specialized algorithms have been proposed for convex clustering in the “large $n$ ” (many observations) setting. [63] propose a stochastic incremental algorithm based on the framework of [32], while [76] propose a semi-smooth Newton algorithm based on the framework of [55]. [36] propose a proximal distance-based algorithm [54] and provide a GPU-based implementation. Recently, [48] proposed a generalized dual gradient ascent algorithm with linear convergence, though their approach only works for the $q=1$ case; their approach is likely amenable to algorithmic regularization schemes similar to those we have propose for the ADMM.

The special case of convex clustering in $\mathbb{R}$ has been studied under various names, including total variation denoising [71], the edge lasso [73] and the graph-fused lasso [50], or as a special case of the generalized lasso [80]. When the underlying graph is a chain graph, convex clustering simplifies to the well-studied fused lasso problem [79, 68, 53].

Convex clustering has not yet seen significant adoption outside of the statistics and machine learning communities, though [36] discuss applications to human genomics. [60] propose an alternative weighting scheme based on genetic distances which they use to perform genomic region segmentation.

Bibliography88

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato and Jonathan Eckstein “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers” In Foundations and Trends ® in Machine Learning 3.1 , 2011, pp. 1–122 DOI: 10.1561/2200000016 · doi ↗
2[2] Eric C. Chi, Genevera I. Allen and Richard G. Baraniuk “Convex Biclustering” In Biometrics 73.1 Wiley Online Library, 2017, pp. 10–19 DOI: 10.1111/biom.12540 · doi ↗
3[3] Eric C. Chi and Kenneth Lange “Splitting Methods for Convex Clustering” In Journal of Computational and Graphical Statistics 24.4 Taylor & Francis, 2015, pp. 994–1013 DOI: 10.1080/10618600.2014.948181 · doi ↗
4[4] Eric C. Chi and Stefan Steinerberger “Recovering Trees with Convex Clustering” In Ar Xiv Pre-Print 1806.11096 , 2018 URL: https://arxiv.org/abs/1806.11096
5[5] Michael Chichignoud, Johannes Lederer and Martin J. Wainwright “A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees” In Journal of Machine Learning Research 17.231 , 2016, pp. 1–20 URL: http://jmlr.org/papers/v 17/15-605.html
6[6] Kenneth L. Clarkson “Coresets, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm” In ACM Transactions on Algorithms (TALG) 6.4 , 2010, pp. 63:1–63:30 DOI: 10.1145/1824777.1824783 · doi ↗
7[7] L. Drumetz, G. Tochon, M.A. Veganzones, J. Chanussot and C. Jutten “Improved Local Spectral Unmixing of Hyperspectral Data using an Algorithmic Regularization Path for Collaborative Sparse Regression” In ICASSP 2017: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing New Orleans, Louisiana: IEEE, 2017, pp. 6190–6194 DOI: 10.1109/ICASSP.2017.7953346 · doi ↗
8[8] Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani “Least Angle Regression” In Annals of Statistics 32.2 , 2004, pp. 407–451 DOI: 10.1214/009053604000000067 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization

Abstract

1 Convex Clustering and Dendrograms

1.1 Convex Clustering

1.2 Constructing Dendrograms from Convex Clustering Paths

2 CARP: Convex Clustering via Algorithmic Regularization Paths

Theorem 1**.**

2.1 Related Work

3 Numerical and Timing Comparisons

4 Visualization of CARP Results

5 Discussion

Acknowledgements

Supplementary Materials

References

.1 ADMM for Convex Clustering

.2 Algorithmic Regularization for Convex Clustering

.3 AMA for Convex Clustering

Appendix A Proof of Theorem 1

Lemma 1** (Q-Linear Error Decrease).**

Proof.

Remark*.*

Lemma 2** (Lipschitz Continuity of Solution Paths).**

Proof.

Lemma 3** (Global Error Bound).**

Proof.

Proof of Theorem 1.

Appendix B CBASS: Algorithmic Regularization Paths for Convex Bi-Clustering

B.1 Algorithms for Convex Bi-Clustering

B.2 Visualizations for Convex Bi-Clustering

Appendix C Additional Comparisons

Appendix D Back-Tracking, Post-Processing, and Dendrogram Construction

Appendix E Additional Related Work

Theorem 1.

Lemma 1 (Q-Linear Error Decrease).

*Remark**.*

Lemma 2 (Lipschitz Continuity of Solution Paths).

Lemma 3 (Global Error Bound).