TL;DR
This paper introduces a volume sampling strategy for coordinate descent, selecting variable subsets based on determinants, which accelerates convergence for convex optimization problems.
Contribution
It proposes a novel volume sampling coordinate selection method and establishes convergence rates, demonstrating potential acceleration over traditional methods.
Findings
Convergence rates are established for convex and strongly convex problems.
Increasing subset size can significantly accelerate the coordinate descent method.
Numerical experiments confirm theoretical acceleration benefits.
Abstract
We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides with the well-known strategy of sampling coordinates proportionally to their Lipschitz constants. For the coordinate descent with volume sampling, we establish the convergence rates both for convex and strongly convex problems. Our theoretical results show that, by increasing the size of the subsets, it is possible to accelerate the method up to the factor which depends on the spectral gap between the corresponding largest eigenvalues of the curvature matrix. Several numerical experiments confirm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newsiamremark
remarkRemark \newsiamremarkhypothesisHypothesis
\newsiamthmclaimClaim \newsiamthmexampleExample \headersA Randomized Coordinate Descent Method with Volume SamplingA. Rodomanov and D. Kropotov
A Randomized Coordinate Descent Method with Volume Sampling††thanks: Submitted to the editors DATE.
\fundingThis research is in part based on the work supported by Samsung Research, Samsung Electronics. Results in Sections 3.2, 3.3 have been obtained by Dmitry Kropotov and supported by the Russian Science Foundation grant no.∼19-71-30020.
Anton Rodomanov Samsung-HSE Laboratory, National Research University Higher School of Economics, Moscow, Russia (, ). [email protected]
Dmitry Kropotov22footnotemark: 2 Lomonosov Moscow State University, Moscow, Russia.
Abstract
We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides with the well-known strategy of sampling coordinates proportionally to their Lipschitz constants. For the coordinate descent with volume sampling, we establish the convergence rates both for convex and strongly convex problems. Our theoretical results show that, by increasing the size of the subsets, it is possible to accelerate the method up to the factor which depends on the spectral gap between the corresponding largest eigenvalues of the curvature matrix. Several numerical experiments confirm our theoretical conclusions.
keywords:
Convex optimization, Unconstrained minimization, Coordinate descent methods, Randomized algorithms, Volume sampling, Convergence rate
{AMS}
90C25, 90C06, 68Q25
1 Introduction
Coordinate descent methods are minimization algorithms that are very popular for solving large-scale optimization problems. The main idea of these algorithms is to successively reduce the value of the objective function along certain subsets of coordinates that are selected at each iteration according to some rule. Coordinate descent has been successfully applied to a number of applications in various areas such as machine learning, compressed sensing, network problems etc.
One of the main parameters of a typical coordinate descent method is the number of coordinates , which are updated at every iteration. When choosing this parameter, one usually faces the following trade-off. On the one hand, the convergence rate of the method becomes faster with the increase of , but, on the other hand, each iteration becomes more expensive. Therefore, to obtain a speed-up of the method in terms of the total running time, it is necessary to ensure that the increase in the cost of each iteration is low, compared to the increase in the convergence rate. One possible way to achieve this is parallelization. This idea has been extensively discussed in the context of parallel coordinate descent [1, 2, 3, 4, 5, 6], where (under certain separability assumptions) the authors show that the convergence rate improves linearly in , and thus it is possible to achieve a linear speed-up by using independent processors instead of one. Note, however, that in the usual serial regime (without parallelization) the aforementioned results do not guarantee any decrease in the total running time, since each iteration becomes at least times more expensive. Clearly, to be able to ensure a speed-up in this regime, one needs some non-linear, in terms of , convergence rate estimates.
In this paper, we analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting -element subsets of variables proportionally to the volumes (or determinants) of principal submatrices of the matrix , that bounds the curvature of the objective function (see Section 2). For the coordinate descent with volume sampling, we establish the worst-case iteration complexity bounds, that have a non-linear dependency on . In particular, we show that the increase in from to leads to the improvement of the convergence rate of the method up to the factor of
[TABLE]
where are the eigenvalues of . Note that can be arbitrarily big, depending on the spectral gap between and .
In addition to this, we also propose a new efficient algorithm for 2-element volume sampling from a sparse matrix. The preprocessing complexity of this algorithm is of the order of the number of non-zero elements in the matrix, and its sampling complexity is only logarithmic in the dimension.
1.1 Related work
There is a vast literature on coordinate descent methods. Most research in this area is usually focused on the rules for selecting coordinates [7, 8, 9], or on obtaining accelerated [7, 10, 11], parallel [1, 2, 3, 4, 5, 6], proximal [3, 12] and primal-dual [13, 14, 15, 16] variants of the already known methods. For a general overview of the topic, see the recent paper [17] and references therein. Below we discuss just several works that are most closely related to ours.
One of the most influential papers on coordinate descent is [7] by Nesterov, where he proposed a coordinate gradient method (which we will refer to as RCD) with a special randomized rule for selecting coordinates. In RCD, each coordinate is sampled with probability proportional to the corresponding coordinate Lipschitz constant. Nesterov then derived the complexity bound for RCD and showed that it can be even better than that of the standard gradient method. The method, that we consider in this work, generalizes RCD in the sense that it coincides with it in the special case when the number of coordinates, selected at each iteration, equals one.
The most relevant work to ours is [16], where the authors propose three different randomized methods for smooth unconstrained minimization. Their Method 1 is exactly the same method, that we analyze in this paper, with the only difference that, instead of volume sampling, they consider an arbitrary sampling. As a result, the convergence rate of their method is expressed quite abstractly (in terms of the minimal eigenvalue of the expectation of a certain matrix), and, in particular, it is not clear how exactly it depends on the number of coordinates , used at each iteration. Although the authors provide a particular example of a matrix, for which their method should be very efficient, they do not establish any general results. In this regard, our work can be considered a further development of [16], where we establish more interpretable iteration complexity bounds specifically for volume sampling. In addition to that, in [16], the authors work only with the strongly convex problems, while. in this paper, we allow the problem to be non-strongly convex.
Another relevant work to this one is [18], where the authors propose a new randomized optimization method for minimizing quadratic functions, given eigenvectors corresponding to the smallest eigenvalues of the Hessian. Although the complexity estimates for this method look similar to ours, there are several key differences. First, the results in [18] show that the increase in the number of coordinates from to in their method leads to the acceleration rate that depends on the spectral gap between the st and nd smallest eigenvalues. The acceleration rate for coordinate descent with volume sampling, depends, on the contrary, on the spectral gap between the st and nd largest eigenvalues. Second, their method is not, strictly speaking, a coordinate descent algorithm since it uses eigenvectors as search directions instead of the coordinate directions. Finally, it is also less practical. For example, even in the simplest non-trivial regime , their method requires an eigenvector, corresponding to the smallest eigenvalue; the complexity of obtaining such a vector is in general . In contrast, the simplest non-trivial choice for coordinate descent with volume sampling is , which requires to perform at the beginning one preprocessing step of (or even , see Section 4.2), and then each iteration of the method takes linear time in the dimension.
Finally, we should mention that, although volume sampling has not been previously considered in the context of coordinate descent methods, it is not a novel concept, and has already been known in the literature for some time. To our knowledge, it was first proposed in [19] for the problem of matrix approximation. Later on the same authors developed several efficient exact and approximate methods for doing volume sampling [20] based on the standard linear algebra algorithms. Some other polynomial-time sampling methods and their connection to the theory of Markov chains were considered in [21]. Recently volume sampling has also been applied to the problem of linear regression [22, 23].
1.2 Contents
This paper is organized as follows. In Section 2, we describe the randomized coordinate descent method with volume sampling. In Section 3, we present the convergence analysis of this method. We start with an auxiliary sufficient decrease lemma (Section 3.1) and then use it to derive the convergence rates both for convex functions (Section 3.2) and strongly convex ones (Section 3.3). In Section 4, we discuss how to generate a random variable according to volume sampling. First, we discuss a simple general approach (Section 4.1) and, after this, develop a special algorithm for 2-element volume sampling which is suitable for sparse matrices (Section 4.2). In Section 5, we consider several examples of possible applications: quadratic functions (Section 5.1), separable problems (Section 5.2) and the smoothing technique (Section 5.3). Finally, in Section 6, we present the results of several numerical experiments.
1.3 Notation
By we denote the Euclidean space of all -dimensional real column vectors with the standard inner product and the standard Euclidean norm . Given an real symmetric positive semidefinite matrix , we also use the seminorm ; recall that becomes a norm iff is positive definite.
For , by we denote the collection of all -element subsets of . For each , by we denote the matrix obtained from the identity matrix by retaining only those columns whose indices are in ; if the dimension is not specified directly, then it can be determined from the context. For an matrix and a subset , by we denote the principal submatrix located at the intersection of the rows and columns with indices from (i.e. ); similarly, for a vector , by we denote the subvector of size obtained from by retaining only the elements with indices from (i.e. ).
Finally, for a square matrix , by we denote its adjugate matrix (the transpose of the cofactor matrix).
2 Randomized coordinate descent with volume sampling
Consider the unconstrained optimization problem
[TABLE]
where is a differentiable function. We assume that is 1-smooth with respect to the seminorm induced by some real symmetric positive semidefinite matrix :
[TABLE]
for all (see Section 5 for examples). When is twice continuously differentiable, one sufficient condition for this is that the Hessian of is uniformly upper bounded by .
Remark 2.1**.**
The standard smoothness assumption in the context of coordinate descent methods is slightly different. Typically, it is assumed that, for each , there exists (called coordinate Lipschitz constant) such that
[TABLE]
*for all and all . Clearly, if satisfies (2), it also satisfies (3) for . However, (2) and (3) are not completely equivalent. For example, if and the function is twice continuously differentiable, (3) requires only the diagonal of the Hessian to be uniformly bounded. Nevertheless, for many practical applications, condition (2) holds (see Section 5). *
Let us fix a point and a -element subset of coordinates , where . According to (2), we have
[TABLE]
for all . A natural idea to obtain an update rule of a coordinate descent algorithm is to minimize the right-hand side of (4) in . It is possible to do so when the matrix is non-degenerate, and this leads to the following update rule:
[TABLE]
Now it remains to specify the procedure for selecting the coordinates . In view of the above remark, the probability of choosing a degenerate submatrix should be zero. One sampling scheme that naturally possesses this property is given by the following
Definition 2.2** (Volume sampling).**
Let be an real symmetric positive semidefinite matrix, let , and let be a random variable taking values in . We say that is generated according to -element volume sampling with respect to , denoted by , if for all , we have
[TABLE]
Observe that for volume sampling corresponds to picking indices with probabilities proportional to the coordinate Lipschitz constants (diagonal elements of ). Thus, volume sampling in fact generalizes the well-known coordinate Lipschitz constant sampling. We discuss its implementation in Section 4.
Combining the update rule (5) with the volume sampling of coordinates, we obtain the Randomized Coordinate Descent Method with Volume Sampling (RCDVS), see Algorithm 1. Note that for RCDVS coincides with the well-known RCD method from [7].
3 Convergence analysis
We now turn to analyzing the convergence rate of the RCDVS method. To keep the presentation concise, we only study the convergence rates of expectations, although it is not difficult to establish their high probability counterparts using standard techniques.
3.1 Sufficient decrease lemma
We start with the following simple result which directly follows from smoothness and does not yet take into account the particular strategy for sampling coordinates:
Lemma 3.1** (General sufficient decrease lemma).**
Let be a 1-smooth function with respect to the seminorm , where is an real symmetric positive semidefinite matrix. Let be deterministic, let , let be a random variable taking values in such that is non-degenerate almost surely, and let . Then almost surely, and
[TABLE]
In its current form, Lemma 3.1 is not very useful since it involves some general expectation which is not clear how to work with. Our task now is to estimate this expectation in a convenient yet non-trivial way for the particular case .
Assume that all submatrices of are non-degenerate, i.e. the -element volume sampling has full support (the other case will be considered later). Using Cramer’s rule , we can write
[TABLE]
Thus, to estimate the expectation, we need to estimate the following two sums:
The sum of principal minors . 2. 2.
The sum .
The first sum is rather well-known and a closed form expression for it can be found in many standard textbooks on linear algebra (see e.g. Chapter 7 [24]). To present the formula, let us introduce for each , the real elementary symmetric polynomial of degree , defined by
[TABLE]
i.e. the sum of all -ary products of , and put for convenience. The well-known result is
Lemma 3.2** (Sum of principal minors).**
Let be an real symmetric matrix with eigenvalues , where , and let . Then
[TABLE]
Now we turn to the second sum. To the best of our knowledge, this sum has not been previously considered in the literature. Nevertheless, it turns out that it can also be conveniently expressed in terms of the elementary symmetric polynomials of eigenvalues:
Lemma 3.3**.**
Let be an real symmetric matrix with eigenvalues , where , let be its spectral decomposition for some orthogonal matrix , and let . Then
[TABLE]
*where for each denotes the vector without the -th element. *
Let us accept this lemma for now and defer its proof to a separate Section 3.4.
Using Lemma 3.2 together with Lemma 3.3, we can rewrite (7) as follows:
[TABLE]
Thus, we have managed to express the expectation solely in terms of the eigenvectors and eigenvalues of . However, our new expression for the expectation is still difficult to work with because each elementary symmetric polynomial is in fact a very complex sum. Fortunately, recall that we do not need the expectation itself but only a suitable lower bound for it. To obtain such a bound, it is convenient to introduce
Definition 3.4** (-coordinate approximation).**
Let an real symmetric positive semidefinite matrix with eigenvalues , let , and let be a spectral decomposition of for some orthogonal matrix . The -coordinate approximation of , denoted by , is the real positive semidefinite matrix
[TABLE]
Observe that is non-degenerate since otherwise which, in view of positive semidefiniteness, contradicts the assumption that . Also note that does not depend on the particular orthogonal matrix in the spectral decomposition of . Indeed, the first term in the definition of can be written as , where is the function . It is well-known that such matrices do not depend on the choice of the diagonalizing matrix (see e.g. [24, Section 7.3]).
Using the -coordinate approximation, we can now lower bound (10) as follows:
Lemma 3.5**.**
Let an real symmetric positive semidefinite matrix with eigenvalues , let , and let be a spectral decomposition of for some orthogonal matrix . Then
[TABLE]
Proof 3.6**.**
Since the eigenvalues are non-negative, we have
[TABLE]
By the symmetry of elementary symmetric polynomials, this can strengthened to
[TABLE]
which in turn can be further generalized to
[TABLE]
*for all . The claim follows. *
To summarize, we have obtained that in the case when volume sampling has full support, we can replace (6) with
[TABLE]
It remains to show that exactly the same result holds even when some principal submatrices of are possibly degenerate. In this case, instead of (7) we should write more carefully that
[TABLE]
Unfortunately, we cannot use Lemma 3.3 anymore because now (9) overestimates (and not underestimates) the numerator due to the fact that the adjugate to a symmetric positive semidefinite matrix is also symmetric positive semidefinite. However, recall that we are not interested in the numerator itself, but only in how it acts on the gradient . For each , define the linear subspace
[TABLE]
where is the image space of . Observe that if and only if all principal submatrices of are non-degenerate. Our interest in the subspace lies in the following observation:
Lemma 3.7**.**
*Let be a 1-smooth function with respect to the seminorm , where is an real symmetric positive semidefinite matrix. If is bounded from below, then for each and each , we have . *
Proof 3.8**.**
*Let be such that (if there is no such , the claim is vacuously true). Then the kernel is non-trivial and hence there exists a non-zero . From smoothness of with respect to , it follows that for all . Hence, , otherwise is unbounded from below. *
According to Lemma 3.7 and the above remarks, we are interested only in the action of on the subspace . But one can easily see that on this subspace it acts exactly as the already studied matrix , and so the case of degenerate submatrices reduces to that of non-degenerate ones.
Thus, regardless of whether there are degenerate principal submatrices or not, we have proved
Lemma 3.9** (Sufficient decrease lemma for volume sampling).**
Let be a function which is bounded from below and 1-smooth with respect to the seminorm , where is an real symmetric positive semidefinite matrix. Let , let for some , and let . Then almost surely and
[TABLE]
To finish this section, let us establish the following relations between -coordinate approximations that will be central in the forthcoming convergence analysis.
Lemma 3.10**.**
Let be an real symmetric positive semidefinite matrix with eigenvalues , and let . Then
[TABLE]
*where is defined by (1). *
Proof 3.11**.**
Denote for . According to (11), we have , where
[TABLE]
Similarly, , where
[TABLE]
Hence, , where
[TABLE]
To finish the proof, it now remains to show that are bounded from below by 1 and bounded from above by .
*For , we have . Since , it follows from this representation that (recall that ). Similarly, for (including ), we have , hence . Finally, (including ). Thus, . *
3.2 Convex functions
Now we are ready to establish several results on the convergence rate of the RCDVS method. We start with the class of smooth convex functions.
Theorem 3.12** (Convergence rate for convex functions).**
Let be a convex function which is 1-smooth with respect to the seminorm , where is an real symmetric positive semidefinite matrix. Let be a deterministic point in and assume that the sublevel set is bounded. Let , , and let be the random points in generated by . Then
[TABLE]
*for all , where is the radius of the sublevel set measured in the norm . *
Proof 3.13**.**
*See Section A. *
According to Theorem 3.12, for achieving accuracy in terms of the expected value of the objective, one needs the following number of iterations:
[TABLE]
In particular, for , we have , where is the radius of the sublevel set measured in the standard Euclidean norm; this recovers the already known result for the RCD method [7].
Let us fix and compare the efficiency estimates of RCDVS with coordinates with that of coordinates. We obtain that
[TABLE]
Thus, we need to compare the quantities and . By Lemma 3.10, we have
[TABLE]
for all . Hence, by first minimizing in , and then maximizing in , we obtain
[TABLE]
This means that the method with a bigger number of coordinates is always not worse than the corresponding method with a smaller number of coordinates, but it can also be faster up to the ratio (1).
3.3 Strongly convex functions
Now let us consider the strongly convex case. For measuring the parameter of strong convexity, it is natural to use the norm . Recall that a differentiable function is called strongly convex with respect to the norm if there exists such that
[TABLE]
for all . The largest possible value of , satisfying (14), is called the modulus of strong convexity. Observe that if is additionally 1-smooth with respect to (see (2)), then we must have . Thus, cannot be degenerate in this situation.
Theorem 3.14** (Convergence rate for strongly convex functions).**
Let be a function which is 1-smooth with respect to the norm , where is an real symmetric positive definite matrix, let , and let be strongly convex with respect to the norm with modulus . Let be a deterministic point in , let , and let be the random points in generated by . Then
[TABLE]
*for all . *
Proof 3.15**.**
*See Section B. *
Since , this means that, given any , for achieving accuracy in terms of the expected value of the objective, one needs the following number of iterations:
[TABLE]
For , we have , where is the strong convexity parameter of in the standard Euclidean norm. This recovers the convergence rate of the RCD method for strongly convex functions from [7].
Similarly to the discussion in Section 3.2, let us compare the efficiency estimates for different values of . Fix . Then, the acceleration rate equals
[TABLE]
Let us compare the constants and . By Lemma 3.10, for all , we have
[TABLE]
Hence, by strong convexity of in the -norm, it follows that
[TABLE]
for all . This means that the modulus of strong convexity of with respect to is at least , i.e. . Similarly, combining the strong convexity of in the -norm with the second inequality in (16), we obtain
[TABLE]
for all . This means that . Hence, our reasoning shows that
[TABLE]
Thus, we obtain absolutely the same result as in the previous section: the efficiency of the method monotonically improves with and the acceleration factor can reach the ratio (1) (for example, one can verify that this is the case for a strictly convex quadratic function).
Remark 3.16**.**
*The results, presented in this section, can be straightforwardly extended from strongly convex functions to a more broader class of gradient dominated functions of degree 2, also known as the functions satisfying the Polyak–Łojasiewicz condition. For more information and different examples of such functions, see [25]. Note also that recently, in the context of coordinate descent methods, there has appeared an even more general condition, called Generalized Error Bound Property [4]. However, we do not know whether our results can be extended for this property. *
3.4 Proof of Lemma 3.3
In this section, we give the proof of Lemma 3.3 assuming that (otherwise the claim is trivial). We start with introducing a little new notation that will be used only inside this section. For a subset , by we denote the matrix obtained from the identity matrix by removing the columns with indices from (i.e. ). For an matrix and a subset , by we denote the submatrix obtained from by removing the rows and columns with indices from (i.e. ); similarly, for a vector , by we denote the subvector of size obtained from by removing the elements with indices from (i.e. ); for brevity, for each , we also use , and instead of more cumbersome , and respectively.
To prove Lemma 3.3, let us consider the matrix-valued polynomial
[TABLE]
and show that the left- and right-hand sides of (9) are, up to a constant multiplicative factor, different representations of the -th derivative of at zero.
We start with the easier right-hand side. Using the spectral decomposition and the definition of the adjugate matrix, for each we readily obtain the following spectral decomposition of :
[TABLE]
where for each is the polynomial
[TABLE]
Opening the parentheses and grouping the terms by the powers of , we see that
[TABLE]
for each , and hence
[TABLE]
Now we take another approach to calculating the derivative by directly differentiating the original expression (17). The key inductive step here is
Lemma 3.17** (Inductive step).**
Let be an () real symmetric matrix, let be the matrix-valued polynomial (17), and let . Then
[TABLE]
Suppose for the moment that Lemma 3.17 holds, and let be arbitrary. Differentiating both sides of (19) (each time applying Lemma 3.17 to the matrix ) and assuming that , we obtain
[TABLE]
Similarly, assuming that , we have
[TABLE]
and, more generally (by induction on ), that
[TABLE]
for all . In particular,
[TABLE]
Equating (18) and (20), we obtain the claim of Lemma 3.3.
All that remains is to prove Lemma 3.17.
Proof 3.18**.**
We begin with a couple of technical simplifications. First, it suffices to prove the claim only for ; the general case then follows by replacing with . Second, we can assume that and all its principal submatrices () are simultaneously non-degenerate; otherwise we can replace with for an arbitrary sufficiently small (such that satisfies the above requirement) and then pass to the limit as using the continuity of the both sides of (19) in .
Since is non-degenerate, there exists a sufficiently small neighborhood around zero such that is non-degenerate for all from this neighborhood and
[TABLE]
by Cramer’s rule. Differentiating and denoting , we obtain
[TABLE]
where are the columns of . Observe that the -th row and the -th column of the matrix for each consist entirely of zeros, so
[TABLE]
Thus, to obtain the claim, it suffices to demonstrate that
[TABLE]
for all . Replacing with if necessary (where is the permutation matrix obtained from the identity matrix by moving the -th column into the end), it is enough to consider only the case . Let
[TABLE]
where is the top-left principal submatrix of , is the right-most column of with the last element removed, and is the element in the lower right corner. Note that is non-degenerate as a principal submatrix of (by the technical assumption made at the very beginning). Using the formula for inverting a block matrix, we obtain that , and
[TABLE]
In particular, we see that
[TABLE]
Since by Cramer’s rule
[TABLE]
it remains to check whether
[TABLE]
*But this is exactly the formula for the determinant of a block matrix. *
4 Implementation of volume sampling
Let be an real symmetric positive semidefinite matrix, and let . In this section, we discuss how to generate a random variable according to volume sampling.
4.1 General algorithm
Recall that volume sampling is sampling with a finite number of outcomes. Thus, in principle, can be generated by any general method for generating random variables taking a finite number of values. Let us briefly review one such method which is based on the following result:
Proposition 4.1** (Generating a random variable taking a finite number of values).**
*Let be a finite set, and let be non-negative numbers such that . For each , let . Let be a random variable uniformly distributed on the interval , and let , where . Then is a well-defined random variable taking values in such that for all . *
Proof 4.2**.**
*For convenience, denote . Observe that . Thus, for some iff belongs to the interval . Since these intervals are disjoint and their union is , the variable is well-defined. Hence, is well-defined, and for each . *
Proposition 4.1 is in fact a two-stage algorithm for generating a random variable taking values in given the corresponding list of probabilities . At the first stage of this algorithm (called preprocessing), we compute the cumulative sums . This requires operations. At the second stage (called sampling), we first generate a random variable uniformly distributed on , then compute , and finally output . Since the cumulative sums are monotonically increasing, one can use binary search for efficiently finding . Thus, the complexity of sampling is just operations. Note that the preprocessing has to be done only once; after this, one can generate arbitrarily many independent samples using the sampling routine.
In the case of volume sampling, the above procedure looks as follows. At the preprocessing stage, we iterate over all possible -element subsets of , computing the corresponding principal minors of and corresponding cumulative sums. This requires operations in total assuming that the complexity of calculating a minor of size is . During sampling, we use binary search to find the number , and then return the set corresponding to . Each sampling thus requires operations.
Unfortunately, the above preprocessing time makes the general algorithm impractical for most values of . Nevertheless, it is still applicable for several very small values of . For example, when , the preprocessing time and memory complexities are both , while the sampling time and memory complexities are and respectively. Another interesting regime is . In this case, the preprocessing time and memory complexities are both , while the sampling time and memory complexities are the same as before. Note that in many applications the objective function has the form , where is a real matrix, and is a function that can be computed in time (see Section 5 for different examples). In these applications, the memory is comparable to the cost of storing , and the time is comparable to the cost of one computation of the objective and is often allowable (see also Section 4.2 for a possible treatment of sparsity).
While the general procedure described above is not polynomial, we note that there are more specialized methods for volume sampling that are polynomial (e.g. see [20] for an exact algorithm and several efficient approximate ones). However, they are designed for generating just one sample and therefore are not directly suited for using them inside optimization algorithms, where one needs a very fast sampling routine that can be called at each iteration. Perhaps, it is possible to modify these methods by properly splitting them into a polynomial preprocessing stage and an independent sampling stage which is much faster, but we have not investigated this direction.
4.2 Two-element volume sampling for sparse matrices
Now suppose that the matrix is sparse and our goal is to implement 2-element volume sampling. The general algorithm described above requires time and memory for preprocessing, which may be too expensive when is large. In this section, we present a special method that takes into account the sparsity of and whose preprocessing time and memory complexities are both , where is the number of non-zero elements of . When is dense, , but it can be much smaller than if is sufficiently sparse. Once the preprocessing is done, each sampling then has the time complexity and the memory complexity, which are exactly the same complexities as those of the general algorithm from the previous section (see Figure 1 for a comparison).
Assume that the matrix is given in the CSR (Compressed Sparse Row) format111Strictly speaking, the classical CSR format is different from the one we are describing here. In the classical CSR format, the index vectors are concatenated into one large vector, similarly the value vectors are concatenated into one large vector, and instead the numbers one stores their cumulative sums. Nevertheless, the format we are describing is algorithmically equivalent to the original CSR format and one can be transformed into the other in a straightforward manner without any time or memory overhead.: for each , we know the indices (possibly zero) of all non-zero elements in the -th row of which are located to the right of the diagonal (thus, ), as well as the corresponding values of these elements (thus, for all ). For notational convenience, for each we also define and
[TABLE]
where , are given, and .
Now the notation has been established and we are ready to present the algorithm. Similarly to the method from the previous section, it is a two-stage procedure that consists of an expensive preprocessing stage (Algorithm 2) and a cheap sampling stage (Algorithm 3) that can be executed as many times as one wishes once the preprocessing has terminated.
Clearly, the time and memory complexities of the preprocessing stage are both . The sampling stage consists of three successive binary searches over certain subsets of and thus has the time complexity and the memory complexity when properly implemented (the function should be computed on the fly inside each binary search).
We now prove the correctness of the algorithm.
Theorem 4.3**.**
*Let be an real symmetric positive semidefinite matrix with rank at least two. Let , and let . Then is a well-defined random variable distributed according to . *
Proof 4.4**.**
Observe that, for each and each , we have
[TABLE]
For each , denote
[TABLE]
Then, for each , it holds that
[TABLE]
From Proposition 4.1, it follows that is a well-defined random variable taking values in with probabilities
[TABLE]
Next observe that, for each and , we have
[TABLE]
for all . In particular, for all , from which we see that monotonically increases with (since does so and monotonically increases with ) until it reaches when . This shows that is well-defined. Similarly, we can write for all , from which it follows that monotonically increases with . Combining this with the definition of , which gives
[TABLE]
we conclude that is well-defined and in fact
[TABLE]
Applying Proposition 4.1 again, we obtain that, conditioned on , the random variable takes values in such that
[TABLE]
for all . Undoing the conditioning, we see that is a well-defined random variable taking values in with probabilities
[TABLE]
*and the claim follows. *
5 Examples of applications
Now we consider several examples of objective functions for which it is possible to apply the RCDVS method and discuss different implementation details.
5.1 Quadratic function
Our first example of an objective function is the convex quadratic , defined by
[TABLE]
where is a given real symmetric positive semidefinite matrix and . This function is 1-smooth with respect to the seminorm , so one can minimize it by RCDVS with
[TABLE]
For doing volume sampling, one can either use the general algorithm from Section 4.1 when the matrix is dense, or the special one from Section 4.2 when is sparse.
5.2 Separable problems
The second example gives rise to a whole family of objective functions that are admissible for the RCDVS method and can be obtained by composing some smooth separable function with a linear mapping:
[TABLE]
Here are given vectors and are univariate functions such that is -smooth for each , meaning that it is differentiable and satisfies
[TABLE]
for all . It is easy to see from the definitions that the resulting function turns out to be 1-smooth with respect to the seminorm , where
[TABLE]
Example 5.1** (Least squares).**
Let be the least squares function
[TABLE]
*where . In this case, is the quadratic function with for each . *
Example 5.2** (Logistic regression).**
Let be the logistic regression function
[TABLE]
*where . In this case, is the logistic function with for each . *
The matrix can be computed in operations. If is sufficiently small, this can be done rather efficiently, and then one can apply RCDVS for several small values of (e.g. , etc.).
If is large, one can still use RCDVS provided that the vectors are sparse. Indeed, observe that the number of non-zero elements in is bounded above by , where for each denotes the number of non-zero elements in . Furthermore, a sparse representation of (e.g. the commonly used sparse compressed row/column formats) can be obtained in
[TABLE]
operations (possibly with some logarithmic terms when a further sorting of indices is needed). After this, one can use the efficient algorithm for sparse two-element volume sampling from Section 4.2, whose preprocessing complexity is the same as (23). For example, if for all , we have , where is some sufficiently small integer, then both the computation of and the preprocessing procedure take
[TABLE]
operations, which is only linear in both and .
5.3 Smoothing technique
Another interesting and quite rich source of examples comes from the smoothing technique [26, 27], which we now briefly review. Let be a convex function. By the Fenchel–Moreau theorem, we can write
[TABLE]
for all , where is the Fenchel conjugate of with the effective domain (assume that is bounded). Let be a distance generating function on with respect to the standard Euclidean norm (i.e. a non-negative closed convex function with domain that is 1-strongly convex on with respect to ). Let , and let be the function
[TABLE]
It is known that satisfies for all , and moreover it is -smooth with respect to . Thus, can be seen as a smooth uniform approximation of , where the parameter controls both the accuracy of approximation and its level of smoothness.
Now let be an real matrix, let , and define by
[TABLE]
It is easy to see that for all . Furthermore, the function is -smooth with respect to the seminorm . Thus, the problem of minimizing can be replaced with the problem of minimizing its smooth approximation for some carefully chosen value of . This latter problem can be solved by RCDVS with
[TABLE]
This matrix has exactly the same structure as the one from (22).
Example 5.3**.**
The function
[TABLE]
*is obtained from using the negative entropy function with domain . *
Example 5.4**.**
The function
[TABLE]
where is the Huber function
[TABLE]
*is obtained from the -norm using the Euclidean distance generating function with domain . *
Example 5.5**.**
The function
[TABLE]
*is obtained from using the function with domain . *
5.4 Combinations of previous examples
Finally, one can take non-negative linear combinations of the already considered examples. Indeed, let be functions, where for each is 1-smooth with respect to an real symmetric positive semidefinite matrix , and let . Then the sum is 1-smooth with respect to the seminorm with .
Example 5.6**.**
Let be the -regularized logistic regression function
[TABLE]
where , , . In this case,
[TABLE]
6 Numerical experiments
In this section, we investigate the practical behavior of RCDVS and compare it with that of a couple of other already known methods.
The first one is the RCD method. Recall that it is in fact the same method as RCDVS with . In comparing RCDVS with RCD, we are interested in investigating how the actual acceleration ratio of RCDVS corresponds to our theoretical prediction (see (1))
[TABLE]
The difference between the actual acceleration ratio and the theoretical one is that the latter is the ratio of the theoretical upper bounds on the performance of the methods, while the former is the ratio of the real number of iterations performed by the methods on a particular problem instance.
The second one, denoted SDNA, is Method 1 from [16] which uses so-called -nice sampling. As was already discussed in Section 1.1, this method is exactly the same method as RCDVS with the only difference that it uses uniform sampling (without replacement) instead of volume sampling. In comparing RCDVS with SDNA, we are interested in seeing how important is the sampling strategy to the performance of the general coordinate descent scheme that we consider in this paper.
6.1 Quadratic function
For the first experiment, we have chosen the convex quadratic function from Section 5.1 and set . Our goal is to observe how the behavior of the methods changes when the spectral gap between the two largest eigenvalues of increases. For this, we construct the matrix as follows. First, we choose some and set . Then we successively perform 10 random Householder reflections on the rows and columns of , where each time the direction is sampled uniformly from the unit sphere in . Observe that, by construction, the eigenvalues of are exactly . Once is constructed, we set , where is generated from the uniform distribution on the hypercube and run each method from until the objective value becomes -close to the optimal one for . This procedure is repeated 10 times to take into account the randomness in the data.
The results of the experiment222All experimental results in this paper were obtained on a laptop with the Intel Core i7-8650U CPU (1.90GHz x 8) and 16 GB DDR4 RAM, no parallelism was used. The source code is available at https://github.com/arodomanov/rcdvs. for different values of and the eigenvalue ratio are shown in Figure 2. Each column in this table displays the median value of the corresponding statistic: “It” is the total number of iterations (in thousands) taken by the method until its termination; “T” is the corresponding total running time (in seconds); “Acc” is the actual acceleration ratio of the method over RCD in terms of the number of iterations (this ratio may be less than 1 when there is no acceleration); “%” expresses the “Acc” for RCDVS as a percent of the theoretical prediction (24).333To keep the table concise, we report only the first few significant digits for each statistic. For example, in the first row of the table, the value 2 in the column “It” for RCDVS may actually stand for 2.1, 2.53, or even 2.999. From this table, we can see that the number of iterations for RCD and SDNA grows significantly with the spectral gap between the two largest eigenvalues, while for RCDVS there is almost no growth at all. As a result, RCDVS dramatically outperforms the other two methods both in terms of iterations and total running time, especially for large values of . By inspecting the “Acc” column, we observe that the actual acceleration ratio of RCDVS with respect to RCD monotonically increases with the spectral gap, which is natural. What is more important, the “%” always takes values around 100, which means that our theoretical prediction (24) is quite accurate. SDNA, on the contrary, performs even worse than RCD in most cases.
6.2 Huber function
In the second experiment, we still use but now consider the Huber function from Example 5.4. In contrast to the previous one, this objective is non-strongly convex.
The design of the experiment is almost the same as before. To generate the matrix , we choose , set to be the diagonal matrix with elements , where , and then successively perform 10 random Householder reflections on the rows and columns of , where each time and are uniformly distributed on the unit spheres in and respectively. Such a construction ensures that the matrix (see Section 5.3) has eigenvalues (plus zeros when ). The vector , the starting point and the termination criterion for the methods are absolutely the same as in the previous experiment.
One additional remark should be made in the case when . Recall that in this situation is in fact degenerate and hence some of its principal submatrices may not be invertible. This does not cause any difficulties for RCD and RCDVS since the probability of choosing a degenerate submatrix in these methods is zero. However, this is not so for SDNA and thus, strictly speaking, SDNA is not well-defined in the case of a degenerate matrix . To fix this problem, we use the More–Penrose pseudoinverse instead of the usual inverse in this method. (We have also tried to simply skip the update when a degenerate submatrix has been chosen, but this strategy turned out to work somewhat worse.)
The results of the experiment are shown in Figure 3. In the same way as before, we see that RCDVS always significantly outperforms RCD and its actual acceleration ratio is quite close to the theoretical prediction. However, it is interesting that this time SDNA works quite well for problems with (almost comparably to RCDVS). In particular, its number of iterations is almost independent of the spectral gap. Nevertheless, for its behavior is the same as in the previous experiment.
Now let us consider the same problem but with much bigger dimensions. For this, we slightly change the way we construct . This time, we generate it as a sparse matrix using the procedure described above but with sparse directions and that are chosen as follows. First, we take some integer that controls the resulting sparsity level of . After this, we pick random uniformly distributed indices and fill the positions corresponding to these indices with a random vector uniformly distributed on the unit sphere in , the rest of the positions are set to zero. Of course, in order to work with a sparse problem, every method has to be properly modified. In particular, we should use the special algorithm from Section 4.2 for doing volume sampling in RCDVS.
The results for the bigger dimensions are shown in Figure 4. Somewhat surprisingly, the problems with are now even more difficult for SDNA than those with . Otherwise, the overall picture is the same as previously.
6.3 Logistic regression
Now we consider the -regularized logistic regression function from Example 5.6, which is very popular in the context of machine learning. The termination criterion for each method is the same as before with the difference that now the optimal objective value is unknown and we have to calculate it numerically in advance. Nevertheless, this auxiliary computation is needed only for our presentation and does not affect the actual performance of the methods in any way.
We set since this is a default value of the regularization parameter used in practice. However, instead of generating the data and artificially as we did before, now we take some real-world data from the LIBSVM website444https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, which is summarized in Figure 5. Here is the number of observations and is the number of features. The next 4 columns display the four largest eigenvalues of the matrix (see Example 5.6), while the last 3 columns show the theoretical acceleration ratio (24) for the three corresponding values of . The main reason why we are presenting this table is to demonstrate that it is not uncommon for real data to have significant spectral gaps between the first largest eigenvalues (although these gaps are not as big as in our previous experiments with artificial data).
For the results of the experiment, see Figure 6, where, in contrast to the previous two experiments, we additionally consider several small values of for RCDVS and SDNA. We can see that, on the real data, the method SDNA looks much better than previously and in many cases it outperforms RCD. Nevertheless, RCDVS is still a winner, and its actual acceleration rate is usually even faster than predicted by theory.
7 Conclusion
We have analyzed the randomized coordinate descent method with volume sampling (RCDVS) for minimizing a function, that is smooth with respect to some positive semidefinite matrix . In its iterations, this method randomly selects -element subsets of coordinates with probabilities proportional to the determinants of principal submatrices of . We have shown, both theoretically and empirically, that the increase in from to improves the convergence rate up to the factor (1), which depends on the spectral gap between the st and nd eigenvalues of .
However, there are still many important directions for further research:
- •
Accelerated method. In addition to the basic randomized coordinate descent, there also exists the accelerated one [10, 11], where the coordinates are sampled with probabilities proportional to the square roots of the diagonal elements of . Is it possible to accelerate RCDVS in a similar manner, possibly using the square roots of the determinants as probabilities?
- •
Constrained and composite optimization. We have considered only the basic smooth unconstrained minimization. However, most optimization methods can often be extended to handle problems involving some simple constraints (e.g. box constraints or linear ones), or they can also be extended to working with composite functions (when the objective is the sum of a smooth function and a simple convex possibly non-smooth function), while still retaining the original convergence rate. Can we generalize RCDVS to these settings?555Currently, the main problem here is that, up to our knowledge, there are no corresponding results even for the case , that is for the RCD method.
- •
Special volume sampling algorithms or different kind of sampling. Although the results, that we have obtained, are true for any value of , from the practical point of view, currently there is only one choice that is suitable for large scale problems, namely (apart from previously known ). The problem is that currently there are no algorithms for volume sampling whose preprocessing/sampling complexity is appropriate for large scale applications (e.g. instead of ). However, this does not mean that it is not possible to devise such algorithms, especially when the matrix possesses special structure (e.g. sparse, banded, low-rank etc.). Another interesting question is whether volume sampling can be replaced with some other kind of sampling which is more practical but still gives similar results.
Acknowledgments
The authors are thankful to the anonymous reviewers for their attentive reading and valuable comments.
Appendix A Proof of Theorem 3.12
Proof A.1**.**
Note that is non-empty and compact by the Weierstrass theorem ( is continuous as a convex function with open domain, the sublevel set is bounded by the statement and is closed as the inverse image of a closed set under a continuous mapping). Hence, both and in the definition of are attained and, in particular, is finite.
Using Lemma 3.9, we obtain and
[TABLE]
for all .
Let . By the convexity of and Cauchy-Schwarz inequality, we have
[TABLE]
Hence, by the definition of , it follows that
[TABLE]
from which, by Jensen’s inequality, we conclude that
[TABLE]
Combining (25) and (26) and writing , we finally obtain
[TABLE]
for all . Now the claim follows by a standard argument. Indeed, we can assume without loss of generality that is strictly positive for each . Then, using (27) together with the monotonicity of , we obtain
[TABLE]
for all . By induction, it follows that
[TABLE]
*for all , where the last inequality is a consequence of (27) and the positivity of . *
Appendix B Proof of Theorem 3.14
Proof B.1**.**
Let . By the same argument as in the proof of Theorem 3.12,
[TABLE]
At the same time, by strong convexity of in the norm , we have666This is a standard inequality for strongly convex functions, and can be easily proved from the definition (14) by minimizing both sides in .
[TABLE]
Combining (28) and (29), we obtain that
[TABLE]
or, equivalently, after rearranging, that
[TABLE]
Thus, subtracting from both sides, we get
[TABLE]
*The claim now follows by induction. *
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Richtárik and M. Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming , 156(1-2):433–484, 2016.
- 2[2] I. Necoara, Y. Nesterov, and F. Glineur. Random block coordinate descent methods for linearly constrained optimization over networks. Journal of Optimization Theory and Applications , 173(1):227–254, 2017.
- 3[3] O. Fercoq and P. Richtárik. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review , 58(4):739–771, 2016.
- 4[4] I. Necoara and D. Clipici. Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds. SIAM Journal on Optimization , 26(1):197–226, 2016.
- 5[5] J. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l 1-regularized loss minimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning , ICML’11, pages 321–328, USA, 2011. Omnipress.
- 6[6] Z. Peng, M. Yan, and W. Yin. Parallel and distributed sparse optimization. In 2013 Asilomar conference on signals, systems and computers , pages 659–646. IEEE, 2013.
- 7[7] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization , 22(2):341–362, 2012.
- 8[8] A. Beck and L. Tetruashviferli. On the convergence of block coordinate descent type methods. SIAM journal on Optimization , 23(4):2037–2060, 2013.
