A Randomized Coordinate Descent Method with Volume Sampling

Anton Rodomanov; Dmitry Kropotov

arXiv:1904.04587·math.OC·April 30, 2020·SIAM J. Optim.

A Randomized Coordinate Descent Method with Volume Sampling

Anton Rodomanov, Dmitry Kropotov

PDF

1 Repo

TL;DR

This paper introduces a volume sampling strategy for coordinate descent, selecting variable subsets based on determinants, which accelerates convergence for convex optimization problems.

Contribution

It proposes a novel volume sampling coordinate selection method and establishes convergence rates, demonstrating potential acceleration over traditional methods.

Findings

01

Convergence rates are established for convex and strongly convex problems.

02

Increasing subset size can significantly accelerate the coordinate descent method.

03

Numerical experiments confirm theoretical acceleration benefits.

Abstract

We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides with the well-known strategy of sampling coordinates proportionally to their Lipschitz constants. For the coordinate descent with volume sampling, we establish the convergence rates both for convex and strongly convex problems. Our theoretical results show that, by increasing the size of the subsets, it is possible to accelerate the method up to the factor which depends on the spectral gap between the corresponding largest eigenvalues of the curvature matrix. Several numerical experiments confirm…

Equations202

R_{λ} (τ_{1}, τ_{2}) := \frac{\sum _{i = τ_{1}}^{n} λ _{i}}{\sum _{i = τ_{2}}^{n} λ _{i}},

R_{λ} (τ_{1}, τ_{2}) := \frac{\sum _{i = τ_{1}}^{n} λ _{i}}{\sum _{i = τ_{2}}^{n} λ _{i}},

x \in R^{n} min f (x),

x \in R^{n} min f (x),

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{1}{2} ∥ y - x ∥_{B}^{2}

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{1}{2} ∥ y - x ∥_{B}^{2}

f (x + I_{S} h) \leq f (x) + ⟨ \nabla f (x)_{S}, h ⟩ + \frac{L _{S}}{2} ∥ h ∥^{2}

f (x + I_{S} h) \leq f (x) + ⟨ \nabla f (x)_{S}, h ⟩ + \frac{L _{S}}{2} ∥ h ∥^{2}

f (x_{0} + I_{S_{0}} h) \leq f (x_{0}) + ⟨ \nabla f (x_{0})_{S_{0}}, h ⟩ + \frac{1}{2} ∥ h ∥_{B_{S_{0} \times S_{0}}}^{2}

f (x_{0} + I_{S_{0}} h) \leq f (x_{0}) + ⟨ \nabla f (x_{0})_{S_{0}}, h ⟩ + \frac{1}{2} ∥ h ∥_{B_{S_{0} \times S_{0}}}^{2}

x_{1} := x_{0} - I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} \nabla f (x_{0})_{S_{0}} .

x_{1} := x_{0} - I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} \nabla f (x_{0})_{S_{0}} .

P (S_{0} = S) = \frac{Det ( B _{S \times S} )}{\sum _{S^{'} \in (τ [ n ])} Det ( B _{S^{'} \times S^{'}} )} .

P (S_{0} = S) = \frac{Det ( B _{S \times S} )}{\sum _{S^{'} \in (τ [ n ])} Det ( B _{S^{'} \times S^{'}} )} .

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T}}^{2} .

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T}}^{2} .

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{\sum _{S \in (τ [ n ])} I _{S} Adj ( B _{S \times S} ) I _{S}^{T}}{\sum _{S \in (τ [ n ])} Det ( B _{S \times S} )} .

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{\sum _{S \in (τ [ n ])} I _{S} Adj ( B _{S \times S} ) I _{S}^{T}}{\sum _{S \in (τ [ n ])} Det ( B _{S \times S} )} .

σ_{m} (x) := 1 \leq i_{1} < \dots < i_{m} \leq n \sum x_{i_{1}} \dots x_{i_{m}},

σ_{m} (x) := 1 \leq i_{1} < \dots < i_{m} \leq n \sum x_{i_{1}} \dots x_{i_{m}},

S \in (τ [ n ]) \sum Det (B_{S \times S}) = σ_{τ} (λ) .

S \in (τ [ n ]) \sum Det (B_{S \times S}) = σ_{τ} (λ) .

S \in (τ [ n ]) \sum I_{S} Adj (B_{S \times S}) I_{S}^{T} = Q Diag (σ_{τ - 1} (λ_{- 1}), \dots, σ_{τ - 1} (λ_{- n})) Q^{T},

S \in (τ [ n ]) \sum I_{S} Adj (B_{S \times S}) I_{S}^{T} = Q Diag (σ_{τ - 1} (λ_{- 1}), \dots, σ_{τ - 1} (λ_{- n})) Q^{T},

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{Q Diag ( σ _{τ - 1} ( λ _{- 1} ) , \dots , σ _{τ - 1} ( λ _{- n} )) Q ^{T}}{σ _{τ} ( λ )} .

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{Q Diag ( σ _{τ - 1} ( λ _{- 1} ) , \dots , σ _{τ - 1} ( λ _{- n} )) Q ^{T}}{σ _{τ} ( λ )} .

B_{τ} := Q Diag (λ_{1}, \dots, λ_{τ}, λ_{τ}, \dots, λ_{τ}) Q^{T} + i = τ + 1 \sum n λ_{i} I .

B_{τ} := Q Diag (λ_{1}, \dots, λ_{τ}, λ_{τ}, \dots, λ_{τ}) Q^{T} + i = τ + 1 \sum n λ_{i} I .

\frac{Q Diag ( σ _{τ - 1} ( λ _{- 1} ) , \dots , σ _{τ - 1} ( λ _{- n} )) Q ^{T}}{σ _{τ} ( λ )} ⪰ (B_{τ})^{- 1} .

\frac{Q Diag ( σ _{τ - 1} ( λ _{- 1} ) , \dots , σ _{τ - 1} ( λ _{- n} )) Q ^{T}}{σ _{τ} ( λ )} ⪰ (B_{τ})^{- 1} .

σ_{τ} (λ) = i_{1} = 1 \sum n - τ + 1 λ_{i_{1}} i_{1} + 1 \leq i_{2} < \dots < i_{τ} \leq n \sum λ_{i_{2}} \dots λ_{i_{τ}} \leq σ_{τ - 1} (λ_{- 1}) i = 1 \sum n - τ + 1 λ_{i} .

σ_{τ} (λ) = i_{1} = 1 \sum n - τ + 1 λ_{i_{1}} i_{1} + 1 \leq i_{2} < \dots < i_{τ} \leq n \sum λ_{i_{2}} \dots λ_{i_{τ}} \leq σ_{τ - 1} (λ_{- 1}) i = 1 \sum n - τ + 1 λ_{i} .

σ_{τ} (λ) \leq σ_{τ - 1} (λ_{- 1}) (λ_{1} + j = τ + 1 \sum n λ_{j}),

σ_{τ} (λ) \leq σ_{τ - 1} (λ_{- 1}) (λ_{1} + j = τ + 1 \sum n λ_{j}),

σ_{τ} (λ) \leq σ_{τ - 1} (λ_{- i}) (λ_{m i n {i, τ}} + j = τ + 1 \sum n λ_{j})

σ_{τ} (λ) \leq σ_{τ - 1} (λ_{- i}) (λ_{m i n {i, τ}} + j = τ + 1 \sum n λ_{j})

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{(B_{τ})^{- 1}}^{2} .

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{(B_{τ})^{- 1}}^{2} .

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{\sum _{S \in (τ [ n ]) : Det (B_{S \times S}) \neq = 0} I _{S} Adj ( B _{S \times S} ) I _{S}^{T}}{\sum _{S \in (τ [ n ])} Det ( B _{S \times S} )} .

E I_{S_{0}} (B_{S_{0} \times S_{0}})^{- 1} I_{S_{0}}^{T} = \frac{\sum _{S \in (τ [ n ]) : Det (B_{S \times S}) \neq = 0} I _{S} Adj ( B _{S \times S} ) I _{S}^{T}}{\sum _{S \in (τ [ n ])} Det ( B _{S \times S} )} .

U_{τ} (B) := {u \in R^{n} : u_{S} \in Im (B_{S \times S}) for all S \in (τ [ n ]) with Det (B_{S \times S}) = 0},

U_{τ} (B) := {u \in R^{n} : u_{S} \in Im (B_{S \times S}) for all S \in (τ [ n ]) with Det (B_{S \times S}) = 0},

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{(B_{τ})^{- 1}}^{2} .

f (x_{0}) - E f (x_{1}) \geq \frac{1}{2} ∥\nabla f (x_{0}) ∥_{(B_{τ})^{- 1}}^{2} .

B_{τ_{2}} ⪯ B_{τ_{1}} ⪯ R_{λ} (τ_{1}, τ_{2}) B_{τ_{2}},

B_{τ_{2}} ⪯ B_{τ_{1}} ⪯ R_{λ} (τ_{1}, τ_{2}) B_{τ_{2}},

s_{i}^{(1)} := {λ_{i} + Σ_{τ_{1} + 1}, Σ_{τ_{1}}, if 1 \leq i \leq τ_{1}, if τ_{1} + 1 \leq i \leq n .

s_{i}^{(1)} := {λ_{i} + Σ_{τ_{1} + 1}, Σ_{τ_{1}}, if 1 \leq i \leq τ_{1}, if τ_{1} + 1 \leq i \leq n .

s_{i}^{(2)} := {λ_{i} + Σ_{τ_{2} + 1}, Σ_{τ_{2}}, if 1 \leq i \leq τ_{2}, if τ_{2} + 1 \leq i \leq n .

s_{i}^{(2)} := {λ_{i} + Σ_{τ_{2} + 1}, Σ_{τ_{2}}, if 1 \leq i \leq τ_{2}, if τ_{2} + 1 \leq i \leq n .

s_{i} := ⎩ ⎨ ⎧ \frac{λ _{i} + Σ _{τ_{1} + 1}}{λ _{i} + Σ _{τ_{2} + 1}}, \frac{Σ _{τ_{1}}}{λ _{i} + Σ _{τ_{2} + 1}}, \frac{Σ _{τ_{1}}}{Σ _{τ_{2}}}, if 1 \leq i \leq τ_{1}, if τ_{1} + 1 \leq i \leq τ_{2}, if τ_{2} + 1 \leq i \leq n .

s_{i} := ⎩ ⎨ ⎧ \frac{λ _{i} + Σ _{τ_{1} + 1}}{λ _{i} + Σ _{τ_{2} + 1}}, \frac{Σ _{τ_{1}}}{λ _{i} + Σ _{τ_{2} + 1}}, \frac{Σ _{τ_{1}}}{Σ _{τ_{2}}}, if 1 \leq i \leq τ_{1}, if τ_{1} + 1 \leq i \leq τ_{2}, if τ_{2} + 1 \leq i \leq n .

E f (x_{k}) - min f \leq \frac{2 D _{τ}^{2}}{k + 1},

E f (x_{k}) - min f \leq \frac{2 D _{τ}^{2}}{k + 1},

K_{τ} := \frac{2 D _{τ}^{2}}{ε} .

K_{τ} := \frac{2 D _{τ}^{2}}{ε} .

\frac{K _{τ_{1}}}{K _{τ_{2}}} = \frac{D _{τ_{1}}^{2}}{D _{τ_{2}}^{2}} .

\frac{K _{τ_{1}}}{K _{τ_{2}}} = \frac{D _{τ_{1}}^{2}}{D _{τ_{2}}^{2}} .

∥ x - x^{*} ∥_{B_{τ_{2}}}^{2} \leq ∥ x - x^{*} ∥_{B_{τ_{1}}}^{2} \leq R_{λ} (τ_{1}, τ_{2}) ∥ x - x^{*} ∥_{B_{τ_{2}}}^{2}

∥ x - x^{*} ∥_{B_{τ_{2}}}^{2} \leq ∥ x - x^{*} ∥_{B_{τ_{1}}}^{2} \leq R_{λ} (τ_{1}, τ_{2}) ∥ x - x^{*} ∥_{B_{τ_{2}}}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arodomanov/rcdvs
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis

\newsiamthmclaimClaim \newsiamthmexampleExample \headersA Randomized Coordinate Descent Method with Volume SamplingA. Rodomanov and D. Kropotov

A Randomized Coordinate Descent Method with Volume Sampling††thanks: Submitted to the editors DATE.

\fundingThis research is in part based on the work supported by Samsung Research, Samsung Electronics. Results in Sections 3.2, 3.3 have been obtained by Dmitry Kropotov and supported by the Russian Science Foundation grant no.∼19-71-30020.

Anton Rodomanov Samsung-HSE Laboratory, National Research University Higher School of Economics, Moscow, Russia (, ). [email protected]

[email protected]

Dmitry Kropotov22footnotemark: 2 Lomonosov Moscow State University, Moscow, Russia.

Abstract

We analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting subsets of variables of certain size proportionally to the determinants of principal submatrices of the matrix, that bounds the curvature of the objective function. In the particular case, when the size of the subsets equals one, volume sampling coincides with the well-known strategy of sampling coordinates proportionally to their Lipschitz constants. For the coordinate descent with volume sampling, we establish the convergence rates both for convex and strongly convex problems. Our theoretical results show that, by increasing the size of the subsets, it is possible to accelerate the method up to the factor which depends on the spectral gap between the corresponding largest eigenvalues of the curvature matrix. Several numerical experiments confirm our theoretical conclusions.

keywords:

Convex optimization, Unconstrained minimization, Coordinate descent methods, Randomized algorithms, Volume sampling, Convergence rate

{AMS}

90C25, 90C06, 68Q25

1 Introduction

Coordinate descent methods are minimization algorithms that are very popular for solving large-scale optimization problems. The main idea of these algorithms is to successively reduce the value of the objective function along certain subsets of coordinates that are selected at each iteration according to some rule. Coordinate descent has been successfully applied to a number of applications in various areas such as machine learning, compressed sensing, network problems etc.

One of the main parameters of a typical coordinate descent method is the number of coordinates $\tau$ , which are updated at every iteration. When choosing this parameter, one usually faces the following trade-off. On the one hand, the convergence rate of the method becomes faster with the increase of $\tau$ , but, on the other hand, each iteration becomes more expensive. Therefore, to obtain a speed-up of the method in terms of the total running time, it is necessary to ensure that the increase in the cost of each iteration is low, compared to the increase in the convergence rate. One possible way to achieve this is parallelization. This idea has been extensively discussed in the context of parallel coordinate descent [1, 2, 3, 4, 5, 6], where (under certain separability assumptions) the authors show that the convergence rate improves linearly in $\tau$ , and thus it is possible to achieve a linear speed-up by using $\tau$ independent processors instead of one. Note, however, that in the usual serial regime (without parallelization) the aforementioned results do not guarantee any decrease in the total running time, since each iteration becomes at least $\tau$ times more expensive. Clearly, to be able to ensure a speed-up in this regime, one needs some non-linear, in terms of $\tau$ , convergence rate estimates.

In this paper, we analyze the coordinate descent method with a new coordinate selection strategy, called volume sampling. This strategy prescribes selecting $\tau$ -element subsets of variables proportionally to the volumes (or determinants) of principal submatrices of the matrix $B$ , that bounds the curvature of the objective function (see Section 2). For the coordinate descent with volume sampling, we establish the worst-case iteration complexity bounds, that have a non-linear dependency on $\tau$ . In particular, we show that the increase in $\tau$ from $\tau_{1}$ to $\tau_{2}$ leads to the improvement of the convergence rate of the method up to the factor of

[TABLE]

where $\lambda_{1}\geq\dots\geq\lambda_{n}$ are the eigenvalues of $B$ . Note that $R_{\lambda}(\tau_{1},\tau_{2})$ can be arbitrarily big, depending on the spectral gap between $\lambda_{\tau_{1}}$ and $\lambda_{\tau_{2}}$ .

In addition to this, we also propose a new efficient algorithm for 2-element volume sampling from a sparse matrix. The preprocessing complexity of this algorithm is of the order of the number of non-zero elements in the matrix, and its sampling complexity is only logarithmic in the dimension.

1.1 Related work

There is a vast literature on coordinate descent methods. Most research in this area is usually focused on the rules for selecting coordinates [7, 8, 9], or on obtaining accelerated [7, 10, 11], parallel [1, 2, 3, 4, 5, 6], proximal [3, 12] and primal-dual [13, 14, 15, 16] variants of the already known methods. For a general overview of the topic, see the recent paper [17] and references therein. Below we discuss just several works that are most closely related to ours.

One of the most influential papers on coordinate descent is [7] by Nesterov, where he proposed a coordinate gradient method (which we will refer to as RCD) with a special randomized rule for selecting coordinates. In RCD, each coordinate is sampled with probability proportional to the corresponding coordinate Lipschitz constant. Nesterov then derived the complexity bound for RCD and showed that it can be even better than that of the standard gradient method. The method, that we consider in this work, generalizes RCD in the sense that it coincides with it in the special case when the number of coordinates, selected at each iteration, equals one.

The most relevant work to ours is [16], where the authors propose three different randomized methods for smooth unconstrained minimization. Their Method 1 is exactly the same method, that we analyze in this paper, with the only difference that, instead of volume sampling, they consider an arbitrary sampling. As a result, the convergence rate of their method is expressed quite abstractly (in terms of the minimal eigenvalue of the expectation of a certain matrix), and, in particular, it is not clear how exactly it depends on the number of coordinates $\tau$ , used at each iteration. Although the authors provide a particular example of a $3\times 3$ matrix, for which their method should be very efficient, they do not establish any general results. In this regard, our work can be considered a further development of [16], where we establish more interpretable iteration complexity bounds specifically for volume sampling. In addition to that, in [16], the authors work only with the strongly convex problems, while. in this paper, we allow the problem to be non-strongly convex.

Another relevant work to this one is [18], where the authors propose a new randomized optimization method for minimizing quadratic functions, given $\tau$ eigenvectors corresponding to the $\tau$ smallest eigenvalues of the Hessian. Although the complexity estimates for this method look similar to ours, there are several key differences. First, the results in [18] show that the increase in the number of coordinates from $\tau_{1}$ to $\tau_{2}$ in their method leads to the acceleration rate that depends on the spectral gap between the $\tau_{1}$ st and $\tau_{2}$ nd smallest eigenvalues. The acceleration rate for coordinate descent with volume sampling, depends, on the contrary, on the spectral gap between the $\tau_{1}$ st and $\tau_{2}$ nd largest eigenvalues. Second, their method is not, strictly speaking, a coordinate descent algorithm since it uses eigenvectors as search directions instead of the coordinate directions. Finally, it is also less practical. For example, even in the simplest non-trivial regime $\tau=1$ , their method requires an eigenvector, corresponding to the smallest eigenvalue; the complexity of obtaining such a vector is in general $O(n^{3})$ . In contrast, the simplest non-trivial choice for coordinate descent with volume sampling is $\tau=2$ , which requires to perform at the beginning one preprocessing step of $O(n^{2})$ (or even $O(n+\operatorname{nnz}(B))$ , see Section 4.2), and then each iteration of the method takes linear time in the dimension.

Finally, we should mention that, although volume sampling has not been previously considered in the context of coordinate descent methods, it is not a novel concept, and has already been known in the literature for some time. To our knowledge, it was first proposed in [19] for the problem of matrix approximation. Later on the same authors developed several efficient exact and approximate methods for doing volume sampling [20] based on the standard linear algebra algorithms. Some other polynomial-time sampling methods and their connection to the theory of Markov chains were considered in [21]. Recently volume sampling has also been applied to the problem of linear regression [22, 23].

1.2 Contents

This paper is organized as follows. In Section 2, we describe the randomized coordinate descent method with volume sampling. In Section 3, we present the convergence analysis of this method. We start with an auxiliary sufficient decrease lemma (Section 3.1) and then use it to derive the convergence rates both for convex functions (Section 3.2) and strongly convex ones (Section 3.3). In Section 4, we discuss how to generate a random variable according to volume sampling. First, we discuss a simple general approach (Section 4.1) and, after this, develop a special algorithm for 2-element volume sampling which is suitable for sparse matrices (Section 4.2). In Section 5, we consider several examples of possible applications: quadratic functions (Section 5.1), separable problems (Section 5.2) and the smoothing technique (Section 5.3). Finally, in Section 6, we present the results of several numerical experiments.

1.3 Notation

By $\mathbb{R}^{n}$ we denote the Euclidean space of all $n$ -dimensional real column vectors with the standard inner product $\langle u,v\rangle:=\sum_{i=1}^{n}u_{i}v_{i}$ and the standard Euclidean norm $\|v\|:=\langle v,v\rangle^{\frac{1}{2}}$ . Given an $n\times n$ real symmetric positive semidefinite matrix $B$ , we also use the seminorm $\|v\|_{B}:=\langle Bv,v\rangle^{\frac{1}{2}}$ ; recall that $\|\cdot\|_{B}$ becomes a norm iff $B$ is positive definite.

For $1\leq\tau\leq n$ , by $[n]\choose\tau$ we denote the collection of all $\tau$ -element subsets of $[n]:=\{1,\dots,n\}$ . For each $S\in{[n]\choose\tau}$ , by $I_{S}$ we denote the $n\times\tau$ matrix obtained from the $n\times n$ identity matrix $I$ by retaining only those columns whose indices are in $S$ ; if the dimension $n$ is not specified directly, then it can be determined from the context. For an $n\times n$ matrix $B$ and a subset $S\in{[n]\choose\tau}$ , by $B_{S\times S}$ we denote the $\tau\times\tau$ principal submatrix located at the intersection of the rows and columns with indices from $S$ (i.e. $B_{S\times S}:=I_{S}^{T}BI_{S}$ ); similarly, for a vector $v\in\mathbb{R}^{n}$ , by $v_{S}$ we denote the subvector of size $\tau$ obtained from $v$ by retaining only the elements with indices from $S$ (i.e. $v_{S}:=I_{S}^{T}v$ ).

Finally, for a square matrix $A$ , by $\operatorname*{Adj}(A)$ we denote its adjugate matrix (the transpose of the cofactor matrix).

2 Randomized coordinate descent with volume sampling

Consider the unconstrained optimization problem

[TABLE]

where $f:\mathbb{R}^{n}\to\mathbb{R}$ is a differentiable function. We assume that $f$ is 1-smooth with respect to the seminorm $\|\cdot\|_{B}$ induced by some $n\times n$ real symmetric positive semidefinite matrix $B$ :

[TABLE]

for all $x,y\in\mathbb{R}^{n}$ (see Section 5 for examples). When $f$ is twice continuously differentiable, one sufficient condition for this is that the Hessian of $f$ is uniformly upper bounded by $B$ .

Remark 2.1.

The standard smoothness assumption in the context of coordinate descent methods is slightly different. Typically, it is assumed that, for each $S\in{[n]\choose\tau}$ , there exists $L_{S}\geq 0$ (called coordinate Lipschitz constant) such that

[TABLE]

*for all $x\in\mathbb{R}^{n}$ and all $h\in\mathbb{R}^{\tau}$ . Clearly, if $f$ satisfies (2), it also satisfies (3) for $L_{S}:=\|B_{S\times S}\|$ . However, (2) and (3) are not completely equivalent. For example, if $\tau=1$ and the function $f$ is twice continuously differentiable, (3) requires only the diagonal of the Hessian to be uniformly bounded. Nevertheless, for many practical applications, condition (2) holds (see Section 5). *

Let us fix a point $x_{0}\in\mathbb{R}^{n}$ and a $\tau$ -element subset of coordinates $S_{0}\in{[n]\choose\tau}$ , where $1\leq\tau\leq\operatorname*{Rank}(B)$ . According to (2), we have

[TABLE]

for all $h\in\mathbb{R}^{\tau}$ . A natural idea to obtain an update rule of a coordinate descent algorithm is to minimize the right-hand side of (4) in $h$ . It is possible to do so when the matrix $B_{S_{0}\times S_{0}}$ is non-degenerate, and this leads to the following update rule:

[TABLE]

Now it remains to specify the procedure for selecting the coordinates $S_{0}$ . In view of the above remark, the probability of choosing a degenerate submatrix $B_{S_{0}\times S_{0}}$ should be zero. One sampling scheme that naturally possesses this property is given by the following

Definition 2.2 (Volume sampling).

Let $B$ be an $n\times n$ real symmetric positive semidefinite matrix, let $1\leq\tau\leq\operatorname*{Rank}(B)$ , and let $S_{0}$ be a random variable taking values in $[n]\choose\tau$ . We say that $S_{0}$ is generated according to $\tau$ -element volume sampling with respect to $B$ , denoted by $S_{0}\sim\operatorname*{Vol}_{\tau}(B)$ , if for all $S\in{[n]\choose\tau}$ , we have

[TABLE]

Observe that for $\tau=1$ volume sampling corresponds to picking indices with probabilities proportional to the coordinate Lipschitz constants $B_{ii}$ (diagonal elements of $B$ ). Thus, volume sampling in fact generalizes the well-known coordinate Lipschitz constant sampling. We discuss its implementation in Section 4.

Combining the update rule (5) with the volume sampling of coordinates, we obtain the Randomized Coordinate Descent Method with Volume Sampling (RCDVS), see Algorithm 1. Note that for $\tau=1$ RCDVS coincides with the well-known RCD method from [7].

3 Convergence analysis

We now turn to analyzing the convergence rate of the RCDVS method. To keep the presentation concise, we only study the convergence rates of expectations, although it is not difficult to establish their high probability counterparts using standard techniques.

3.1 Sufficient decrease lemma

We start with the following simple result which directly follows from smoothness and does not yet take into account the particular strategy for sampling coordinates:

Lemma 3.1 (General sufficient decrease lemma).

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a 1-smooth function with respect to the seminorm $\|\cdot\|_{B}$ , where $B$ is an $n\times n$ real symmetric positive semidefinite matrix. Let $x_{0}\in\mathbb{R}^{n}$ be deterministic, let $1\leq\tau\leq\operatorname*{Rank}(B)$ , let $S_{0}$ be a random variable taking values in ${[n]\choose\tau}$ such that $B_{S_{0}\times S_{0}}$ is non-degenerate almost surely, and let $x_{1}:=x_{0}-I_{S_{0}}(B_{S_{0}\times S_{0}})^{-1}\nabla f(x_{0})_{S_{0}}$ . Then $f(x_{1})\leq f(x_{0})$ almost surely, and

[TABLE]

In its current form, Lemma 3.1 is not very useful since it involves some general expectation $\mathbb{E}I_{S_{0}}(B_{S_{0}\times S_{0}})^{-1}I_{S_{0}}^{T}$ which is not clear how to work with. Our task now is to estimate this expectation in a convenient yet non-trivial way for the particular case $S_{0}\sim\operatorname*{Vol}_{\tau}(B)$ .

Assume that all $\tau\times\tau$ submatrices of $B$ are non-degenerate, i.e. the $\tau$ -element volume sampling has full support (the other case will be considered later). Using Cramer’s rule $\operatorname*{Det}(B_{S\times S})(B_{S\times S})^{-1}=\operatorname*{Adj}(B_{S\times S})$ , we can write

[TABLE]

Thus, to estimate the expectation, we need to estimate the following two sums:

The sum of principal minors $\sum_{S\in{[n]\choose\tau}}\operatorname*{Det}(B_{S\times S})$ . 2. 2.

The sum $\sum_{S\in{[n]\choose\tau}}I_{S}\operatorname*{Adj}(B_{S\times S})I_{S}^{T}$ .

The first sum is rather well-known and a closed form expression for it can be found in many standard textbooks on linear algebra (see e.g. Chapter 7 [24]). To present the formula, let us introduce for each $1\leq m\leq n$ , the real elementary symmetric polynomial $\sigma_{m}:\mathbb{R}^{n}\to\mathbb{R}$ of degree $m$ , defined by

[TABLE]

i.e. the sum of all $m$ -ary products of $x_{1},\dots,x_{n}$ , and put $\sigma_{0}(x):=1$ for convenience. The well-known result is

Lemma 3.2 (Sum of principal minors).

Let $B$ be an $n\times n$ real symmetric matrix with eigenvalues $\lambda:=(\lambda_{1},\dots,\lambda_{n})$ , where $\lambda_{1}\geq\dots\geq\lambda_{n}$ , and let $1\leq\tau\leq n$ . Then

[TABLE]

Now we turn to the second sum. To the best of our knowledge, this sum has not been previously considered in the literature. Nevertheless, it turns out that it can also be conveniently expressed in terms of the elementary symmetric polynomials of eigenvalues:

Lemma 3.3.

Let $B$ be an $n\times n$ real symmetric matrix with eigenvalues $\lambda:=(\lambda_{1},\dots,\lambda_{n})$ , where $\lambda_{1}\geq\dots\geq\lambda_{n}$ , let $B=Q\operatorname*{Diag}(\lambda)Q^{T}$ be its spectral decomposition for some $n\times n$ orthogonal matrix $Q$ , and let $1\leq\tau\leq n$ . Then

[TABLE]

*where $\lambda_{-i}$ for each $1\leq i\leq n$ denotes the vector $\lambda$ without the $i$ -th element. *

Let us accept this lemma for now and defer its proof to a separate Section 3.4.

Using Lemma 3.2 together with Lemma 3.3, we can rewrite (7) as follows:

[TABLE]

Thus, we have managed to express the expectation solely in terms of the eigenvectors and eigenvalues of $B$ . However, our new expression for the expectation is still difficult to work with because each elementary symmetric polynomial is in fact a very complex sum. Fortunately, recall that we do not need the expectation itself but only a suitable lower bound for it. To obtain such a bound, it is convenient to introduce

Definition 3.4 ( $\tau$ -coordinate approximation).

Let $B$ an $n\times n$ real symmetric positive semidefinite matrix with eigenvalues $\lambda_{1}\geq\dots\geq\lambda_{n}$ , let $1\leq\tau\leq\operatorname*{Rank}(B)$ , and let $B=Q\operatorname*{Diag}(\lambda_{1},\dots,\lambda_{n})Q^{T}$ be a spectral decomposition of $B$ for some $n\times n$ orthogonal matrix $Q$ . The $\tau$ -coordinate approximation of $B$ , denoted by $B_{\tau}$ , is the $n\times n$ real positive semidefinite matrix

[TABLE]

Observe that $B_{\tau}$ is non-degenerate since otherwise $\lambda_{\tau}=\dots=\lambda_{n}=0$ which, in view of positive semidefiniteness, contradicts the assumption that $\tau\leq\operatorname*{Rank}(B)$ . Also note that $B_{\tau}$ does not depend on the particular orthogonal matrix $Q$ in the spectral decomposition of $B$ . Indeed, the first term in the definition of $B_{\tau}$ can be written as $Q\operatorname*{Diag}(q_{\tau}(\lambda_{1}),\dots,q_{\tau}(\lambda_{n}))Q^{T}$ , where $q_{\tau}:\mathbb{R}\to\mathbb{R}$ is the function $q_{\tau}(t):=\max\{t,\lambda_{\tau}\}$ . It is well-known that such matrices do not depend on the choice of the diagonalizing matrix $Q$ (see e.g. [24, Section 7.3]).

Using the $\tau$ -coordinate approximation, we can now lower bound (10) as follows:

Lemma 3.5.

Let $B$ an $n\times n$ real symmetric positive semidefinite matrix with eigenvalues $\lambda_{1}\geq\dots\geq\lambda_{n}$ , let $1\leq\tau\leq\operatorname*{Rank}(B)$ , and let $B=Q\operatorname*{Diag}(\lambda_{1},\dots,\lambda_{n})Q^{T}$ be a spectral decomposition of $B$ for some $n\times n$ orthogonal matrix $Q$ . Then

[TABLE]

Proof 3.6.

Since the eigenvalues are non-negative, we have

[TABLE]

By the symmetry of elementary symmetric polynomials, this can strengthened to

[TABLE]

which in turn can be further generalized to

[TABLE]

*for all $1\leq i\leq n$ . The claim follows. *

To summarize, we have obtained that in the case when volume sampling has full support, we can replace (6) with

[TABLE]

It remains to show that exactly the same result holds even when some $\tau\times\tau$ principal submatrices of $B$ are possibly degenerate. In this case, instead of (7) we should write more carefully that

[TABLE]

Unfortunately, we cannot use Lemma 3.3 anymore because now (9) overestimates (and not underestimates) the numerator due to the fact that the adjugate to a symmetric positive semidefinite matrix is also symmetric positive semidefinite. However, recall that we are not interested in the numerator itself, but only in how it acts on the gradient $\nabla f(x_{0})$ . For each $1\leq\tau\leq n$ , define the linear subspace

[TABLE]

where $\operatorname*{Im}(B_{S\times S})$ is the image space of $B_{S\times S}$ . Observe that $U_{\tau}(B)=\mathbb{R}^{n}$ if and only if all $\tau\times\tau$ principal submatrices of $B$ are non-degenerate. Our interest in the subspace $U_{\tau}(B)$ lies in the following observation:

Lemma 3.7.

*Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a 1-smooth function with respect to the seminorm $\|\cdot\|_{B}$ , where $B$ is an $n\times n$ real symmetric positive semidefinite matrix. If $f$ is bounded from below, then for each $x_{0}\in\mathbb{R}^{n}$ and each $1\leq\tau\leq n$ , we have $\nabla f(x_{0})\in U_{\tau}(B)$ . *

Proof 3.8.

*Let $S\in{[n]\choose\tau}$ be such that $\operatorname*{Det}(B_{S\times S})=0$ (if there is no such $S$ , the claim is vacuously true). Then the kernel $\operatorname*{Ker}(B_{S\times S})$ is non-trivial and hence there exists a non-zero $h\in\operatorname*{Ker}(B_{S\times S})$ . From smoothness of $f$ with respect to $\|\cdot\|_{B}$ , it follows that $f(x_{0}+tI_{S}h)\leq f(x_{0})+t\langle\nabla f(x_{0})_{S},h\rangle$ for all $t\in\mathbb{R}$ . Hence, $\nabla f(x_{0})_{S}\in\operatorname*{Ker}(B_{S\times S})^{\perp}=\operatorname*{Im}(B_{S\times S})$ , otherwise $f$ is unbounded from below. *

According to Lemma 3.7 and the above remarks, we are interested only in the action of $\sum_{S\in{[n]\choose\tau}:\operatorname*{Det}(B_{S\times S})\neq 0}I_{S}\operatorname*{Adj}(B_{S\times S})I_{S}^{T}$ on the subspace $U_{\tau}(B)$ . But one can easily see that on this subspace it acts exactly as the already studied matrix $\sum_{S\in{[n]\choose\tau}}I_{S}\operatorname*{Adj}(B_{S\times S})I_{S}^{T}$ , and so the case of degenerate submatrices reduces to that of non-degenerate ones.

Thus, regardless of whether there are degenerate principal submatrices or not, we have proved

Lemma 3.9 (Sufficient decrease lemma for volume sampling).

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a function which is bounded from below and 1-smooth with respect to the seminorm $\|\cdot\|_{B}$ , where $B$ is an $n\times n$ real symmetric positive semidefinite matrix. Let $x_{0}\in\mathbb{R}^{n}$ , let $S_{0}\sim\operatorname*{Vol}_{\tau}(B)$ for some $1\leq\tau\leq\operatorname*{Rank}(B)$ , and let $x_{1}:=x_{0}-I_{S_{0}}(B_{S_{0}\times S_{0}})^{-1}\nabla f(x_{0})_{S_{0}}$ . Then $f(x_{1})\leq f(x_{0})$ almost surely and

[TABLE]

To finish this section, let us establish the following relations between $\tau$ -coordinate approximations that will be central in the forthcoming convergence analysis.

Lemma 3.10.

Let $B$ be an $n\times n$ real symmetric positive semidefinite matrix with eigenvalues $\lambda_{1}\geq\dots\geq\lambda_{n}$ , and let $1\leq\tau_{1}<\tau_{2}\leq\operatorname*{Rank}(B)$ . Then

[TABLE]

*where $R_{\lambda}(\tau_{1},\tau_{2})$ is defined by (1). *

Proof 3.11.

Denote $\Sigma_{i}:=\sum_{j=i}^{n}\lambda_{j}$ for $1\leq i\leq n$ . According to (11), we have $B_{\tau_{1}}=Q\operatorname*{Diag}(s^{(1)}_{1},\ldots,s^{(1)}_{n})Q^{T}$ , where

[TABLE]

Similarly, $B_{\tau_{2}}=Q\operatorname*{Diag}(s^{(2)}_{1},\ldots,s^{(2)}_{n})Q^{T}$ , where

[TABLE]

Hence, $(B_{\tau_{2}})^{-1/2}B_{\tau_{1}}(B_{\tau_{2}})^{-1/2}=Q\operatorname*{Diag}(s_{1},\ldots,s_{n})Q^{T}$ , where

[TABLE]

To finish the proof, it now remains to show that $s_{1},\ldots,s_{n}$ are bounded from below by 1 and bounded from above by $\frac{\Sigma_{\tau_{1}}}{\Sigma_{\tau_{2}}}$ .

*For $1\leq i\leq\tau_{1}$ , we have $s_{i}=1+\frac{\Sigma_{\tau_{1}+1}-\Sigma_{\tau_{2}+1}}{\lambda_{i}+\Sigma_{\tau_{2}+1}}$ . Since $\Sigma_{\tau_{1}+1}\geq\Sigma_{\tau_{2}+1}\geq 0$ , it follows from this representation that $1\leq s_{1}\leq\ldots\leq s_{\tau_{1}}$ (recall that $\lambda_{1}\geq\ldots\geq\lambda_{n}\geq 0$ ). Similarly, for $\tau_{1}\leq i\leq\tau_{2}$ (including $\tau_{1}$ ), we have $s_{i}=\frac{\Sigma_{\tau_{1}}}{\lambda_{i}+\Sigma_{\tau_{2}+1}}$ , hence $s_{\tau_{1}}\leq\ldots\leq s_{\tau_{2}}$ . Finally, $s_{\tau_{2}}=\ldots=s_{n}$ (including $\tau_{2}$ ). Thus, $1\leq s_{1}\leq\ldots\leq s_{n}=\frac{\Sigma_{\tau_{1}}}{\Sigma_{\tau_{2}}}$ . *

3.2 Convex functions

Now we are ready to establish several results on the convergence rate of the RCDVS method. We start with the class of smooth convex functions.

Theorem 3.12 (Convergence rate for convex functions).

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function which is 1-smooth with respect to the seminorm $\|\cdot\|_{B}$ , where $B$ is an $n\times n$ real symmetric positive semidefinite matrix. Let $x_{0}$ be a deterministic point in $\mathbb{R}^{n}$ and assume that the sublevel set $L_{f}(x_{0}):=\{x\in\mathbb{R}^{n}:f(x)\leq f(x_{0})\}$ is bounded. Let $1\leq\tau\leq\operatorname*{Rank}(B)$ , $K\geq 1$ , and let $(x_{k})_{k=1}^{K}$ be the random points in $\mathbb{R}^{n}$ generated by $\operatorname*{RCDVS}(f,B,\tau,x_{0},K)$ . Then

[TABLE]

*for all $0\leq k\leq K$ , where $D_{\tau}:=\max_{x\in L_{f}(x_{0})}\min_{x^{*}\in\operatorname*{Argmin}f}\|x-x^{*}\|_{B_{\tau}}$ is the radius of the sublevel set $L_{f}(x_{0})$ measured in the norm $\|\cdot\|_{B_{\tau}}$ . *

Proof 3.13.

*See Section A. *

According to Theorem 3.12, for achieving accuracy $\varepsilon>0$ in terms of the expected value of the objective, one needs the following number of iterations:

[TABLE]

In particular, for $\tau=1$ , we have $D_{\tau}^{2}=\operatorname{Tr}(B)D^{2}$ , where $D$ is the radius of the sublevel set $L_{f}(x_{0})$ measured in the standard Euclidean norm; this recovers the already known result for the RCD method [7].

Let us fix $1\leq\tau_{1}<\tau_{2}\leq\operatorname*{Rank}(B)$ and compare the efficiency estimates of RCDVS with $\tau_{1}$ coordinates with that of $\tau_{2}$ coordinates. We obtain that

[TABLE]

Thus, we need to compare the quantities $D_{\tau_{1}}^{2}$ and $D_{\tau_{2}}^{2}$ . By Lemma 3.10, we have

[TABLE]

for all $x,x^{*}\in\mathbb{R}^{n}$ . Hence, by first minimizing in $x^{*}\in\operatorname*{Argmin}f$ , and then maximizing in $x\in L_{f}(x_{0})$ , we obtain

[TABLE]

This means that the method with a bigger number of coordinates is always not worse than the corresponding method with a smaller number of coordinates, but it can also be faster up to the ratio (1).

3.3 Strongly convex functions

Now let us consider the strongly convex case. For measuring the parameter of strong convexity, it is natural to use the norm $\|\cdot\|_{B_{\tau}}$ . Recall that a differentiable function $f:\mathbb{R}^{n}\to\mathbb{R}$ is called strongly convex with respect to the norm $\|\cdot\|_{B_{\tau}}$ if there exists $\mu_{\tau}>0$ such that

[TABLE]

for all $x,y\in\mathbb{R}^{n}$ . The largest possible value of $\mu_{\tau}$ , satisfying (14), is called the modulus of strong convexity. Observe that if $f$ is additionally 1-smooth with respect to $\|\cdot\|_{B}$ (see (2)), then we must have $\mu_{\tau}B_{\tau}\preceq B$ . Thus, $B$ cannot be degenerate in this situation.

Theorem 3.14 (Convergence rate for strongly convex functions).

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a function which is 1-smooth with respect to the norm $\|\cdot\|_{B}$ , where $B$ is an $n\times n$ real symmetric positive definite matrix, let $1\leq\tau\leq n$ , and let $f$ be strongly convex with respect to the norm $\|\cdot\|_{B_{\tau}}$ with modulus $\mu_{\tau}$ . Let $x_{0}$ be a deterministic point in $\mathbb{R}^{n}$ , let $K\geq 1$ , and let $(x_{k})_{k=1}^{K}$ be the random points in $\mathbb{R}^{n}$ generated by $\operatorname*{RCDVS}(f,B,\tau,x_{0},K)$ . Then

[TABLE]

*for all $0\leq k\leq K$ . *

Proof 3.15.

*See Section B. *

Since $1-\mu_{\tau}\leq e^{-\mu_{\tau}}$ , this means that, given any $\varepsilon>0$ , for achieving accuracy $\varepsilon$ in terms of the expected value of the objective, one needs the following number of iterations:

[TABLE]

For $\tau=1$ , we have $\mu_{\tau}=\operatorname{Tr}(B)\mu$ , where $\mu$ is the strong convexity parameter of $f$ in the standard Euclidean norm. This recovers the convergence rate of the RCD method for strongly convex functions from [7].

Similarly to the discussion in Section 3.2, let us compare the efficiency estimates for different values of $\tau$ . Fix $1\leq\tau_{1}<\tau_{2}\leq n$ . Then, the acceleration rate equals

[TABLE]

Let us compare the constants $\mu_{\tau_{1}}$ and $\mu_{\tau_{2}}$ . By Lemma 3.10, for all $h\in\mathbb{R}^{n}$ , we have

[TABLE]

Hence, by strong convexity of $f$ in the $B_{\tau_{1}}$ -norm, it follows that

[TABLE]

for all $x,y\in\mathbb{R}^{n}$ . This means that the modulus of strong convexity of $f$ with respect to $\|\cdot\|_{B_{\tau_{2}}}$ is at least $\mu_{\tau_{1}}$ , i.e. $\mu_{\tau_{2}}\geq\mu_{\tau_{1}}$ . Similarly, combining the strong convexity of $f$ in the $B_{\tau_{2}}$ -norm with the second inequality in (16), we obtain

[TABLE]

for all $x\in\mathbb{R}^{n}$ . This means that $\mu_{\tau_{1}}\geq\frac{\mu_{\tau_{2}}}{R_{\lambda}(\tau_{1},\tau_{2})}$ . Hence, our reasoning shows that

[TABLE]

Thus, we obtain absolutely the same result as in the previous section: the efficiency of the method monotonically improves with $\tau$ and the acceleration factor can reach the ratio (1) (for example, one can verify that this is the case for a strictly convex quadratic function).

Remark 3.16.

*The results, presented in this section, can be straightforwardly extended from strongly convex functions to a more broader class of gradient dominated functions of degree 2, also known as the functions satisfying the Polyak–Łojasiewicz condition. For more information and different examples of such functions, see [25]. Note also that recently, in the context of coordinate descent methods, there has appeared an even more general condition, called Generalized Error Bound Property [4]. However, we do not know whether our results can be extended for this property. *

3.4 Proof of Lemma 3.3

In this section, we give the proof of Lemma 3.3 assuming that $n\geq 2$ (otherwise the claim is trivial). We start with introducing a little new notation that will be used only inside this section. For a subset $S\in{[n]\choose\tau}$ , by $I_{-S}$ we denote the $n\times(n-\tau)$ matrix obtained from the $n\times n$ identity matrix $I$ by removing the columns with indices from $S$ (i.e. $I_{-S}:=I_{[n]\setminus S}$ ). For an $n\times n$ matrix $B$ and a subset $S\in{[n]\choose\tau}$ , by $B_{-S\times-S}$ we denote the $(n-\tau)\times(n-\tau)$ submatrix obtained from $B$ by removing the rows and columns with indices from $S$ (i.e. $B_{-S\times-S}:=I_{-S}^{T}BI_{-S}$ ); similarly, for a vector $v\in\mathbb{R}^{n}$ , by $v_{-S}$ we denote the subvector of size $n-\tau$ obtained from $v$ by removing the elements with indices from $S$ (i.e. $v_{-S}:=I_{-S}^{T}v$ ); for brevity, for each $1\leq i\leq n$ , we also use $I_{-i}$ , $B_{-i\times-i}$ and $v_{-i}$ instead of more cumbersome $I_{-\{i\}}$ , $B_{-\{i\}\times-\{i\}}$ and $v_{-\{i\}}$ respectively.

To prove Lemma 3.3, let us consider the matrix-valued polynomial

[TABLE]

and show that the left- and right-hand sides of (9) are, up to a constant multiplicative factor, different representations of the $(n-\tau)$ -th derivative of $P$ at zero.

We start with the easier right-hand side. Using the spectral decomposition $B=Q\operatorname*{Diag}(\lambda)Q^{T}$ and the definition of the adjugate matrix, for each $t\in\mathbb{R}$ we readily obtain the following spectral decomposition of $P$ :

[TABLE]

where $d_{i}:\mathbb{R}\to\mathbb{R}$ for each $1\leq i\leq n$ is the polynomial

[TABLE]

Opening the parentheses and grouping the terms by the powers of $t$ , we see that

[TABLE]

for each $1\leq i\leq n$ , and hence

[TABLE]

Now we take another approach to calculating the derivative $P^{(n-\tau)}(0)$ by directly differentiating the original expression (17). The key inductive step here is

Lemma 3.17 (Inductive step).

Let $B$ be an $n\times n$ ( $n\geq 2$ ) real symmetric matrix, let $P$ be the matrix-valued polynomial (17), and let $t\in\mathbb{R}$ . Then

[TABLE]

Suppose for the moment that Lemma 3.17 holds, and let $t\in\mathbb{R}$ be arbitrary. Differentiating both sides of (19) (each time applying Lemma 3.17 to the matrix $B_{-i\times-i}$ ) and assuming that $n\geq 3$ , we obtain

[TABLE]

Similarly, assuming that $n\geq 4$ , we have

[TABLE]

and, more generally (by induction on $r$ ), that

[TABLE]

for all $0\leq r\leq n-1$ . In particular,

[TABLE]

Equating (18) and (20), we obtain the claim of Lemma 3.3.

All that remains is to prove Lemma 3.17.

Proof 3.18.

We begin with a couple of technical simplifications. First, it suffices to prove the claim only for $t=0$ ; the general case then follows by replacing $B$ with $B-tI$ . Second, we can assume that $B$ and all its $(n-1)\times(n-1)$ principal submatrices ( $B_{-1,-1},\dots,B_{-n,-n}$ ) are simultaneously non-degenerate; otherwise we can replace $B$ with $B+\delta I$ for an arbitrary sufficiently small $\delta>0$ (such that $B+\delta I$ satisfies the above requirement) and then pass to the limit as $\delta\to 0$ using the continuity of the both sides of (19) in $\delta$ .

Since $B$ is non-degenerate, there exists a sufficiently small neighborhood around zero such that $B-tI$ is non-degenerate for all $t$ from this neighborhood and

[TABLE]

by Cramer’s rule. Differentiating and denoting $C:=B^{-1}$ , we obtain

[TABLE]

where $c_{1},\dots,c_{n}\in\mathbb{R}^{n}$ are the columns of $C$ . Observe that the $i$ -th row and the $i$ -th column of the matrix $C_{ii}C-c_{i}c_{i}^{T}$ for each $1\leq i\leq n$ consist entirely of zeros, so

[TABLE]

Thus, to obtain the claim, it suffices to demonstrate that

[TABLE]

for all $1\leq i\leq n$ . Replacing $B$ with $P^{T}BP$ if necessary (where $P$ is the $n\times n$ permutation matrix obtained from the identity matrix by moving the $i$ -th column into the end), it is enough to consider only the case $i=n$ . Let

[TABLE]

where $F$ is the top-left $(n-1)\times(n-1)$ principal submatrix of $B$ , $z\in\mathbb{R}^{n-1}$ is the right-most column of $B$ with the last element removed, and $\alpha:=B_{nn}$ is the element in the lower right corner. Note that $F$ is non-degenerate as a principal $(n-1)\times(n-1)$ submatrix of $B$ (by the technical assumption made at the very beginning). Using the formula for inverting a block matrix, we obtain that $\alpha-\langle F^{-1}z,z\rangle\neq 0$ , and

[TABLE]

In particular, we see that

[TABLE]

Since by Cramer’s rule

[TABLE]

it remains to check whether

[TABLE]

*But this is exactly the formula for the determinant of a block matrix. *

4 Implementation of volume sampling

Let $B$ be an $n\times n$ real symmetric positive semidefinite matrix, and let $1\leq\tau\leq\operatorname*{Rank}(B)$ . In this section, we discuss how to generate a random variable $S_{0}\sim\operatorname*{Vol}_{\tau}(B)$ according to volume sampling.

4.1 General algorithm

Recall that volume sampling is sampling with a finite number of outcomes. Thus, in principle, $S_{0}$ can be generated by any general method for generating random variables taking a finite number of values. Let us briefly review one such method which is based on the following result:

Proposition 4.1 (Generating a random variable taking a finite number of values).

*Let $X:=\{x_{1},\dots,x_{N}\}$ be a finite set, and let $p_{1},\dots,p_{N}$ be non-negative numbers such that $\sum_{k=1}^{N}p_{k}=1$ . For each $1\leq k\leq N$ , let $P_{k}:=\sum_{k^{\prime}=1}^{k}p_{k^{\prime}}$ . Let $u$ be a random variable uniformly distributed on the interval $(0,1)$ , and let $\xi:=x_{k_{0}}$ , where $k_{0}:=\min\{1\leq k\leq N:u\leq P_{k}\}$ . Then $\xi$ is a well-defined random variable taking values in $X$ such that $\mathbb{P}(\xi=x_{k})=p_{k}$ for all $1\leq k\leq N$ . *

Proof 4.2.

*For convenience, denote $P_{0}:=0$ . Observe that $0=P_{0}\leq P_{1}\leq\dots\leq P_{N}=1$ . Thus, $k_{0}=k$ for some $1\leq k\leq N$ iff $u$ belongs to the interval $(P_{k-1},P_{k}]$ . Since these intervals are disjoint and their union is $(0,1]$ , the variable $k_{0}$ is well-defined. Hence, $\xi$ is well-defined, and $\mathbb{P}(\xi=x_{k})=\mathbb{P}(k_{0}=k)=\mathbb{P}(P_{k-1}<u\leq P_{k})=P_{k}-P_{k-1}=p_{k}$ for each $1\leq k\leq N$ . *

Proposition 4.1 is in fact a two-stage algorithm for generating a random variable $\xi$ taking values in $\{x_{1},\dots,x_{N}\}$ given the corresponding list of probabilities $p_{1},\dots,p_{N}$ . At the first stage of this algorithm (called preprocessing), we compute the cumulative sums $P_{1},\dots,P_{N}$ . This requires $O(N)$ operations. At the second stage (called sampling), we first generate a random variable $u$ uniformly distributed on $(0,1)$ , then compute $k_{0}$ , and finally output $\xi:=x_{k_{0}}$ . Since the cumulative sums $(P_{k})_{1\leq k\leq N}$ are monotonically increasing, one can use binary search for efficiently finding $k_{0}$ . Thus, the complexity of sampling is just $O(\ln N)$ operations. Note that the preprocessing has to be done only once; after this, one can generate arbitrarily many independent samples using the sampling routine.

In the case of volume sampling, the above procedure looks as follows. At the preprocessing stage, we iterate over all $N={n\choose\tau}$ possible $\tau$ -element subsets of $[n]$ , computing the corresponding principal minors of $B$ and corresponding cumulative sums. This requires $O({n\choose\tau}\tau^{3})$ operations in total assuming that the complexity of calculating a minor of size $\tau$ is $O(\tau^{3})$ . During sampling, we use binary search to find the number $k_{0}$ , and then return the set $S_{0}\in{[n]\choose\tau}$ corresponding to $k_{0}$ . Each sampling thus requires $O(\ln{n\choose\tau})$ operations.

Unfortunately, the above $O({n\choose\tau}\tau^{3})$ preprocessing time makes the general algorithm impractical for most values of $\tau$ . Nevertheless, it is still applicable for several very small values of $\tau$ . For example, when $\tau=1$ , the preprocessing time and memory complexities are both $O(n)$ , while the sampling time and memory complexities are $O(\ln n)$ and $O(1)$ respectively. Another interesting regime is $\tau=2$ . In this case, the preprocessing time and memory complexities are both $O(n^{2})$ , while the sampling time and memory complexities are the same as before. Note that in many applications the objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ has the form $f(x):=\phi(Ax,x)$ , where $A$ is a real $m\times n$ matrix, and $\phi:\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R}$ is a function that can be computed in time $O(m+n)$ (see Section 5 for different examples). In these applications, the $O(n^{2})$ memory is comparable to the cost of storing $A$ , and the $O(n^{2})$ time is comparable to the cost of one computation of the objective and is often allowable (see also Section 4.2 for a possible treatment of sparsity).

While the general procedure described above is not polynomial, we note that there are more specialized methods for volume sampling that are polynomial (e.g. see [20] for an exact algorithm and several efficient approximate ones). However, they are designed for generating just one sample and therefore are not directly suited for using them inside optimization algorithms, where one needs a very fast sampling routine that can be called at each iteration. Perhaps, it is possible to modify these methods by properly splitting them into a polynomial preprocessing stage and an independent sampling stage which is much faster, but we have not investigated this direction.

4.2 Two-element volume sampling for sparse matrices

Now suppose that the matrix $B$ is sparse and our goal is to implement 2-element volume sampling. The general algorithm described above requires $O(n^{2})$ time and $O(n^{2})$ memory for preprocessing, which may be too expensive when $n$ is large. In this section, we present a special method that takes into account the sparsity of $B$ and whose preprocessing time and memory complexities are both $O(\operatorname{nnz}(B)+n)$ , where $\operatorname{nnz}(B)$ is the number of non-zero elements of $B$ . When $B$ is dense, $\operatorname{nnz}(B)=n^{2}$ , but it can be much smaller than $n^{2}$ if $B$ is sufficiently sparse. Once the preprocessing is done, each sampling then has the $O(\ln n)$ time complexity and the $O(1)$ memory complexity, which are exactly the same complexities as those of the general algorithm from the previous section (see Figure 1 for a comparison).

Assume that the matrix $B$ is given in the CSR (Compressed Sparse Row) format111Strictly speaking, the classical CSR format is different from the one we are describing here. In the classical CSR format, the index vectors $j^{(1)},\dots,j^{(n)}$ are concatenated into one large vector, similarly the value vectors $v^{(1)},\dots,v^{(n)}$ are concatenated into one large vector, and instead the numbers $r_{1},\dots,r_{n}$ one stores their cumulative sums. Nevertheless, the format we are describing is algorithmically equivalent to the original CSR format and one can be transformed into the other in a straightforward manner without any time or memory overhead.: for each $1\leq i\leq n$ , we know the $r_{i}$ indices (possibly zero) $i\leq j^{(i)}_{1}<\dots<j^{(i)}_{r_{i}}\leq n$ of all non-zero elements in the $i$ -th row of $B$ which are located to the right of the diagonal (thus, $\{j^{(i)}_{1},\dots,j^{(i)}_{r_{i}}\}=\{i\leq j\leq n:B_{ij}\neq 0\}$ ), as well as the corresponding values $v^{(i)}_{1},\dots,v^{(i)}_{r_{i}}$ of these elements (thus, $v^{(i)}_{k}:=B_{ij^{(i)}_{k}}$ for all $1\leq k\leq r_{i}$ ). For notational convenience, for each $1\leq i<j\leq n;1\leq k\leq r_{i}$ we also define $j^{(i)}_{r_{i}+1}:=n+1$ and

[TABLE]

where $h^{(1)}\in\mathbb{R}^{r_{1}},\dots,h^{(n-1)}\in\mathbb{R}^{r_{n-1}}$ , $t\in\mathbb{R}^{n+1}$ are given, and $h:=(h^{(1)},\dots,h^{(n-1)})$ .

Now the notation has been established and we are ready to present the algorithm. Similarly to the method from the previous section, it is a two-stage procedure that consists of an expensive preprocessing stage (Algorithm 2) and a cheap sampling stage (Algorithm 3) that can be executed as many times as one wishes once the preprocessing has terminated.

Clearly, the time and memory complexities of the preprocessing stage are both $O(\operatorname{nnz}(B)+n)$ . The sampling stage consists of three successive binary searches over certain subsets of $[n]$ and thus has the $O(\ln n)$ time complexity and the $O(1)$ memory complexity when properly implemented (the function $P_{v,h,t}$ should be computed on the fly inside each binary search).

We now prove the correctness of the algorithm.

Theorem 4.3.

*Let $B$ be an $n\times n$ real symmetric positive semidefinite matrix with rank at least two. Let $(h^{(1)},\dots,h^{(n-1)},t,q):=\operatorname*{sparse2vs\_preprocess}(B)$ , and let $S_{0}:=\operatorname*{sparse2vs\_sample}(B,h^{(1)},\dots,h^{(n-1)},t,q)$ . Then $S_{0}$ is a well-defined random variable distributed according to $\operatorname*{Vol}_{2}(B)$ . *

Proof 4.4.

Observe that, for each $1\leq i\leq n-1$ and each $1\leq k\leq r_{i}$ , we have

[TABLE]

For each $1\leq i<j\leq n$ , denote

[TABLE]

Then, for each $1\leq i\leq n-1$ , it holds that

[TABLE]

From Proposition 4.1, it follows that $i_{0}$ is a well-defined random variable taking values in $\{1,\dots,n-1\}$ with probabilities

[TABLE]

Next observe that, for each $1\leq i\leq n-1$ and $1\leq k\leq r_{i}$ , we have

[TABLE]

for all $j^{(i)}_{k}\leq j\leq j^{(i)}_{k+1}-1$ . In particular, $P(i_{0},j^{(i_{0})}_{k+1}-1,k)=P(i_{0},j^{(i_{0})}_{k+1}-1)$ for all $1\leq k\leq r_{i_{0}}$ , from which we see that $P(i_{0},j^{(i_{0})}_{k+1}-1,k)$ monotonically increases with $k$ (since $j^{(i_{0})}_{k+1}$ does so and $P(i_{0},j)$ monotonically increases with $j$ ) until it reaches $P(i_{0},n,r_{i_{0}})=P(i_{0},n)$ when $k=r_{i_{0}}$ . This shows that $k_{l}$ is well-defined. Similarly, we can write $P(i_{0},j,k_{l})=P(i_{0},j)$ for all $j^{(i_{0})}_{k_{l}}\leq j\leq j^{(i_{0})}_{k_{l}+1}-1$ , from which it follows that $P(i_{0},j,k_{l})$ monotonically increases with $j$ . Combining this with the definition of $k_{l}$ , which gives

[TABLE]

we conclude that $j_{0}$ is well-defined and in fact

[TABLE]

Applying Proposition 4.1 again, we obtain that, conditioned on $i_{0}=i$ , the random variable $j_{0}$ takes values in $\{i+1,\dots,n\}$ such that

[TABLE]

for all $1\leq i<j\leq n$ . Undoing the conditioning, we see that $(i_{0},j_{0})$ is a well-defined random variable taking values in $\{(i,j):1\leq i<j\leq n\}$ with probabilities

[TABLE]

*and the claim follows. *

5 Examples of applications

Now we consider several examples of objective functions for which it is possible to apply the RCDVS method and discuss different implementation details.

5.1 Quadratic function

Our first example of an objective function is the convex quadratic $f:\mathbb{R}^{n}\to\mathbb{R}$ , defined by

[TABLE]

where $A$ is a given $n\times n$ real symmetric positive semidefinite matrix and $b\in\mathbb{R}^{n}$ . This function is 1-smooth with respect to the seminorm $\|\cdot\|_{A}$ , so one can minimize it by RCDVS with

[TABLE]

For doing volume sampling, one can either use the general algorithm from Section 4.1 when the matrix $A$ is dense, or the special one from Section 4.2 when $A$ is sparse.

5.2 Separable problems

The second example gives rise to a whole family of objective functions $f:\mathbb{R}^{n}\to\mathbb{R}$ that are admissible for the RCDVS method and can be obtained by composing some smooth separable function with a linear mapping:

[TABLE]

Here $a_{1},\dots,a_{m}\in\mathbb{R}^{n}$ are given vectors and $g_{1},\dots,g_{m}:\mathbb{R}\to\mathbb{R}$ are univariate functions such that $g_{i}$ is $L_{i}$ -smooth $(L_{i}\geq 0)$ for each $1\leq i\leq n$ , meaning that it is differentiable and satisfies

[TABLE]

for all $t,t_{0}\in\mathbb{R}$ . It is easy to see from the definitions that the resulting function $f$ turns out to be 1-smooth with respect to the seminorm $\|\cdot\|_{B}$ , where

[TABLE]

Example 5.1 (Least squares).

Let $f$ be the least squares function

[TABLE]

*where $b_{1},\dots,b_{m}\in\mathbb{R}$ . In this case, $g_{i}(t):=\frac{1}{2}(t-b_{i})^{2}$ is the quadratic function with $L_{i}=1$ for each $1\leq i\leq m$ . *

Example 5.2 (Logistic regression).

Let $f$ be the logistic regression function

[TABLE]

*where $b_{1},\dots,b_{m}\in\{-1,1\}$ . In this case, $g_{i}(t):=\ln(1+e^{-b_{i}t})$ is the logistic function with $L_{i}=\frac{1}{4}$ for each $1\leq i\leq m$ . *

The matrix $B$ can be computed in $O(mn^{2})$ operations. If $n$ is sufficiently small, this can be done rather efficiently, and then one can apply RCDVS for several small values of $\tau$ (e.g. $\tau=2,3$ , etc.).

If $n$ is large, one can still use RCDVS provided that the vectors $a_{1},\dots,a_{m}$ are sparse. Indeed, observe that the number of non-zero elements in $B$ is bounded above by $\sum_{i=1}^{m}p_{i}$ , where $p_{i}$ for each $1\leq i\leq n$ denotes the number of non-zero elements in $a_{i}$ . Furthermore, a sparse representation of $B$ (e.g. the commonly used sparse compressed row/column formats) can be obtained in

[TABLE]

operations (possibly with some logarithmic terms when a further sorting of indices is needed). After this, one can use the efficient algorithm for sparse two-element volume sampling from Section 4.2, whose preprocessing complexity is the same as (23). For example, if for all $1\leq i\leq n$ , we have $p_{i}\leq p$ , where $p$ is some sufficiently small integer, then both the computation of $B$ and the preprocessing procedure take

[TABLE]

operations, which is only linear in both $m$ and $n$ .

5.3 Smoothing technique

Another interesting and quite rich source of examples comes from the smoothing technique [26, 27], which we now briefly review. Let $g:\mathbb{R}^{m}\to\mathbb{R}$ be a convex function. By the Fenchel–Moreau theorem, we can write

[TABLE]

for all $y\in\mathbb{R}^{m}$ , where $g^{*}:G_{*}\to\mathbb{R}$ is the Fenchel conjugate of $g$ with the effective domain $G_{*}$ (assume that $G_{*}$ is bounded). Let $\omega^{*}:\Omega_{*}\to\mathbb{R}$ be a distance generating function on $G_{*}$ with respect to the standard Euclidean norm $\|\cdot\|$ (i.e. a non-negative closed convex function with domain $\Omega_{*}\supseteq G_{*}$ that is 1-strongly convex on $G_{*}$ with respect to $\|\cdot\|$ ). Let $\mu>0$ , and let $g_{\mu}:\mathbb{R}^{n}\to\mathbb{R}$ be the function

[TABLE]

It is known that $g_{\mu}$ satisfies $g_{\mu}(y)\leq g(y)\leq g_{\mu}(y)+\mu\max_{G_{*}}\omega^{*}$ for all $y\in\mathbb{R}^{m}$ , and moreover it is $\frac{1}{\mu}$ -smooth with respect to $\|\cdot\|$ . Thus, $g_{\mu}$ can be seen as a smooth uniform approximation of $g$ , where the parameter $\mu$ controls both the accuracy of approximation and its level of smoothness.

Now let $A$ be an $m\times n$ real matrix, let $b\in\mathbb{R}^{n}$ , and define $f,f_{\mu}:\mathbb{R}^{n}\to\mathbb{R}$ by

[TABLE]

It is easy to see that $f_{\mu}(x)\leq f(x)\leq f_{\mu}(x)+\mu\max_{G_{*}}\omega^{*}$ for all $x\in\mathbb{R}^{n}$ . Furthermore, the function $f_{\mu}$ is $\frac{1}{\mu}$ -smooth with respect to the seminorm $\|\cdot\|_{A^{T}A}$ . Thus, the problem of minimizing $f$ can be replaced with the problem of minimizing its smooth approximation $f_{\mu}$ for some carefully chosen value of $\mu$ . This latter problem can be solved by RCDVS with

[TABLE]

This matrix has exactly the same structure as the one from (22).

Example 5.3.

The function

[TABLE]

*is obtained from $g(y):=\max\{y_{1},\dots,y_{m}\}$ using the negative entropy function $\omega^{*}(s):=\sum_{i=1}^{m}s_{i}\ln s_{i}$ with domain $\Omega_{*}:=\{s\in\mathbb{R}^{m}:\sum_{i=1}^{m}s_{i}=1;\,s_{1},\dots,s_{m}\geq 0\}$ . *

Example 5.4.

The function

[TABLE]

where $H_{\mu}:\mathbb{R}\to\mathbb{R}$ is the Huber function

[TABLE]

*is obtained from the $l^{1}$ -norm $g(y):=\|y\|_{1}$ using the Euclidean distance generating function $\omega^{*}(s):=\frac{1}{2}\|s\|^{2}$ with domain $\Omega_{*}:=\mathbb{R}^{m}$ . *

Example 5.5.

The function

[TABLE]

*is obtained from $g(y):=\|y\|$ using the function $\omega^{*}(s):=1-\sqrt{1-\|s\|^{2}}$ with domain $\Omega_{*}:=\{s\in\mathbb{R}^{n}:\|s\|\leq 1\}$ . *

5.4 Combinations of previous examples

Finally, one can take non-negative linear combinations of the already considered examples. Indeed, let $f_{1},\dots,f_{r}:\mathbb{R}^{n}\to\mathbb{R}$ be functions, where $f_{i}$ for each $1\leq i\leq r$ is 1-smooth with respect to an $n\times n$ real symmetric positive semidefinite matrix $B_{i}$ , and let $\alpha_{1},\dots,\alpha_{r}>0$ . Then the sum $f:=\sum_{i=1}^{r}\alpha_{i}f_{i}$ is 1-smooth with respect to the seminorm $\|\cdot\|_{B}$ with $B:=\sum_{i=1}^{r}\alpha_{i}B_{i}$ .

Example 5.6.

Let $f$ be the $l^{2}$ -regularized logistic regression function

[TABLE]

where $a_{1},\dots,a_{m}\in\mathbb{R}^{n}$ , $b_{1},\dots,b_{m}\in\{-1,1\}$ , $\gamma>0$ . In this case,

[TABLE]

6 Numerical experiments

In this section, we investigate the practical behavior of RCDVS and compare it with that of a couple of other already known methods.

The first one is the RCD method. Recall that it is in fact the same method as RCDVS with $\tau=1$ . In comparing RCDVS with RCD, we are interested in investigating how the actual acceleration ratio of RCDVS corresponds to our theoretical prediction (see (1))

[TABLE]

The difference between the actual acceleration ratio and the theoretical one is that the latter is the ratio of the theoretical upper bounds on the performance of the methods, while the former is the ratio of the real number of iterations performed by the methods on a particular problem instance.

The second one, denoted SDNA, is Method 1 from [16] which uses so-called $\tau$ -nice sampling. As was already discussed in Section 1.1, this method is exactly the same method as RCDVS with the only difference that it uses uniform sampling (without replacement) instead of volume sampling. In comparing RCDVS with SDNA, we are interested in seeing how important is the sampling strategy to the performance of the general coordinate descent scheme that we consider in this paper.

6.1 Quadratic function

For the first experiment, we have chosen the convex quadratic function from Section 5.1 and set $\tau=2$ . Our goal is to observe how the behavior of the methods changes when the spectral gap between the two largest eigenvalues of $A$ increases. For this, we construct the matrix $A$ as follows. First, we choose some $\lambda_{1}\geq\lambda_{2}:=100$ and set $A:=\operatorname*{Diag}(\lambda_{1},\lambda_{2},1,\dots,1)$ . Then we successively perform 10 random Householder reflections $A\mapsto(I-2uu^{T})A(I-2uu^{T})$ on the rows and columns of $A$ , where each time the direction $u$ is sampled uniformly from the unit sphere in $\mathbb{R}^{n}$ . Observe that, by construction, the eigenvalues of $A$ are exactly $\lambda_{1},\lambda_{2},1,\dots,1$ . Once $A$ is constructed, we set $b:=Ax^{*}$ , where $x^{*}$ is generated from the uniform distribution on the hypercube $[-1,1]^{n}$ and run each method from $x_{0}:=0$ until the objective value becomes $\varepsilon$ -close to the optimal one for $\varepsilon:=0.01$ . This procedure is repeated 10 times to take into account the randomness in the data.

The results of the experiment222All experimental results in this paper were obtained on a laptop with the Intel Core i7-8650U CPU (1.90GHz x 8) and 16 GB DDR4 RAM, no parallelism was used. The source code is available at https://github.com/arodomanov/rcdvs. for different values of $n$ and the eigenvalue ratio $\lambda_{1}/\lambda_{2}$ are shown in Figure 2. Each column in this table displays the median value of the corresponding statistic: “It” is the total number of iterations (in thousands) taken by the method until its termination; “T” is the corresponding total running time (in seconds); “Acc” is the actual acceleration ratio of the method over RCD in terms of the number of iterations (this ratio may be less than 1 when there is no acceleration); “%” expresses the “Acc” for RCDVS as a percent of the theoretical prediction (24).333To keep the table concise, we report only the first few significant digits for each statistic. For example, in the first row of the table, the value 2 in the column “It” for RCDVS may actually stand for 2.1, 2.53, or even 2.999. From this table, we can see that the number of iterations for RCD and SDNA grows significantly with the spectral gap between the two largest eigenvalues, while for RCDVS there is almost no growth at all. As a result, RCDVS dramatically outperforms the other two methods both in terms of iterations and total running time, especially for large values of $\lambda_{1}/\lambda_{2}$ . By inspecting the “Acc” column, we observe that the actual acceleration ratio of RCDVS with respect to RCD monotonically increases with the spectral gap, which is natural. What is more important, the “%” always takes values around 100, which means that our theoretical prediction (24) is quite accurate. SDNA, on the contrary, performs even worse than RCD in most cases.

6.2 Huber function

In the second experiment, we still use $\tau=2$ but now consider the Huber function from Example 5.4. In contrast to the previous one, this objective is non-strongly convex.

The design of the experiment is almost the same as before. To generate the matrix $A$ , we choose $\lambda_{1}\geq\lambda_{2}:=100$ , set $A$ to be the $m\times n$ diagonal matrix with elements $\sqrt{\lambda_{1}/\mu},\sqrt{\lambda_{2}/\mu},\sqrt{1/\mu},\dots,\sqrt{1/\mu}$ , where $\mu:=0.01$ , and then successively perform 10 random Householder reflections $A\mapsto(I-2uu^{T})A(I-2vv^{T})$ on the rows and columns of $A$ , where each time $u$ and $v$ are uniformly distributed on the unit spheres in $\mathbb{R}^{m}$ and $\mathbb{R}^{n}$ respectively. Such a construction ensures that the matrix $B:=\frac{1}{\mu}A^{T}A$ (see Section 5.3) has eigenvalues $\lambda_{1},\lambda_{2},1,\dots,1$ (plus $n-m$ zeros when $m<n$ ). The vector $b$ , the starting point $x_{0}$ and the termination criterion for the methods are absolutely the same as in the previous experiment.

One additional remark should be made in the case when $m<n$ . Recall that in this situation $B$ is in fact degenerate and hence some of its principal submatrices may not be invertible. This does not cause any difficulties for RCD and RCDVS since the probability of choosing a degenerate submatrix in these methods is zero. However, this is not so for SDNA and thus, strictly speaking, SDNA is not well-defined in the case of a degenerate matrix $B$ . To fix this problem, we use the More–Penrose pseudoinverse instead of the usual inverse in this method. (We have also tried to simply skip the update when a degenerate submatrix has been chosen, but this strategy turned out to work somewhat worse.)

The results of the experiment are shown in Figure 3. In the same way as before, we see that RCDVS always significantly outperforms RCD and its actual acceleration ratio is quite close to the theoretical prediction. However, it is interesting that this time SDNA works quite well for problems with $m<n$ (almost comparably to RCDVS). In particular, its number of iterations is almost independent of the spectral gap. Nevertheless, for $n>m$ its behavior is the same as in the previous experiment.

Now let us consider the same problem but with much bigger dimensions. For this, we slightly change the way we construct $A$ . This time, we generate it as a sparse matrix using the procedure described above but with sparse directions $u$ and $v$ that are chosen as follows. First, we take some integer $1\leq p\leq\min\{m,n\}$ that controls the resulting sparsity level of $B$ . After this, we pick $p$ random uniformly distributed indices and fill the positions corresponding to these indices with a random vector uniformly distributed on the unit sphere in $\mathbb{R}^{p}$ , the rest of the positions are set to zero. Of course, in order to work with a sparse problem, every method has to be properly modified. In particular, we should use the special algorithm from Section 4.2 for doing volume sampling in RCDVS.

The results for the bigger dimensions are shown in Figure 4. Somewhat surprisingly, the problems with $m<n$ are now even more difficult for SDNA than those with $m>n$ . Otherwise, the overall picture is the same as previously.

6.3 Logistic regression

Now we consider the $l^{2}$ -regularized logistic regression function from Example 5.6, which is very popular in the context of machine learning. The termination criterion for each method is the same as before with the difference that now the optimal objective value is unknown and we have to calculate it numerically in advance. Nevertheless, this auxiliary computation is needed only for our presentation and does not affect the actual performance of the methods in any way.

We set $\gamma:=1$ since this is a default value of the regularization parameter used in practice. However, instead of generating the data $a_{1},\dots,a_{m}$ and $b_{1},\dots,b_{m}$ artificially as we did before, now we take some real-world data from the LIBSVM website444https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, which is summarized in Figure 5. Here $m$ is the number of observations and $n$ is the number of features. The next 4 columns display the four largest eigenvalues of the matrix $B:=\frac{1}{4}\sum_{i=1}^{m}a_{i}a_{i}^{T}+\gamma I$ (see Example 5.6), while the last 3 columns show the theoretical acceleration ratio (24) for the three corresponding values of $\tau$ . The main reason why we are presenting this table is to demonstrate that it is not uncommon for real data to have significant spectral gaps between the first largest eigenvalues (although these gaps are not as big as in our previous experiments with artificial data).

For the results of the experiment, see Figure 6, where, in contrast to the previous two experiments, we additionally consider several small values of $\tau$ for RCDVS and SDNA. We can see that, on the real data, the method SDNA looks much better than previously and in many cases it outperforms RCD. Nevertheless, RCDVS is still a winner, and its actual acceleration rate is usually even faster than predicted by theory.

7 Conclusion

We have analyzed the randomized coordinate descent method with volume sampling (RCDVS) for minimizing a function, that is smooth with respect to some positive semidefinite matrix $B$ . In its iterations, this method randomly selects $\tau$ -element subsets of coordinates with probabilities proportional to the determinants of principal submatrices of $B$ . We have shown, both theoretically and empirically, that the increase in $\tau$ from $\tau_{1}$ to $\tau_{2}$ improves the convergence rate up to the factor (1), which depends on the spectral gap between the $\tau_{1}$ st and $\tau_{2}$ nd eigenvalues of $B$ .

However, there are still many important directions for further research:

•

Accelerated method. In addition to the basic randomized coordinate descent, there also exists the accelerated one [10, 11], where the coordinates are sampled with probabilities proportional to the square roots of the diagonal elements of $B$ . Is it possible to accelerate RCDVS in a similar manner, possibly using the square roots of the determinants as probabilities?

•

Constrained and composite optimization. We have considered only the basic smooth unconstrained minimization. However, most optimization methods can often be extended to handle problems involving some simple constraints (e.g. box constraints or linear ones), or they can also be extended to working with composite functions (when the objective is the sum of a smooth function and a simple convex possibly non-smooth function), while still retaining the original convergence rate. Can we generalize RCDVS to these settings?555Currently, the main problem here is that, up to our knowledge, there are no corresponding results even for the case $\tau=1$ , that is for the RCD method.

•

Special volume sampling algorithms or different kind of sampling. Although the results, that we have obtained, are true for any value of $\tau$ , from the practical point of view, currently there is only one choice that is suitable for large scale problems, namely $\tau=2$ (apart from previously known $\tau=1$ ). The problem is that currently there are no algorithms for volume sampling whose preprocessing/sampling complexity is appropriate for large scale applications (e.g. $O(n^{2})$ instead of $O(n^{3})$ ). However, this does not mean that it is not possible to devise such algorithms, especially when the matrix possesses special structure (e.g. sparse, banded, low-rank etc.). Another interesting question is whether volume sampling can be replaced with some other kind of sampling which is more practical but still gives similar results.

Acknowledgments

The authors are thankful to the anonymous reviewers for their attentive reading and valuable comments.

Appendix A Proof of Theorem 3.12

Proof A.1.

Note that $\operatorname*{Argmin}f$ is non-empty and compact by the Weierstrass theorem ( $f$ is continuous as a convex function with open domain, the sublevel set $L_{f}(x_{0})$ is bounded by the statement and is closed as the inverse image of a closed set under a continuous mapping). Hence, both $\min$ and $\max$ in the definition of $D_{\tau}$ are attained and, in particular, $D_{\tau}$ is finite.

Using Lemma 3.9, we obtain $x_{k}\in L_{f}(x_{0})$ and

[TABLE]

for all $k\geq 0$ .

Let $k\geq 0$ . By the convexity of $f$ and Cauchy-Schwarz inequality, we have

[TABLE]

Hence, by the definition of $D_{\tau}$ , it follows that

[TABLE]

from which, by Jensen’s inequality, we conclude that

[TABLE]

Combining (25) and (26) and writing $\delta_{k}:=\mathbb{E}f(x_{k})-\min f$ , we finally obtain

[TABLE]

for all $k\geq 0$ . Now the claim follows by a standard argument. Indeed, we can assume without loss of generality that $\delta_{k}$ is strictly positive for each $k\geq 0$ . Then, using (27) together with the monotonicity of $\delta_{k}$ , we obtain

[TABLE]

for all $k\geq 0$ . By induction, it follows that

[TABLE]

*for all $k\geq 0$ , where the last inequality is a consequence of (27) and the positivity of $\delta_{1}$ . *

Appendix B Proof of Theorem 3.14

Proof B.1.

Let $0\leq k\leq K$ . By the same argument as in the proof of Theorem 3.12,

[TABLE]

At the same time, by strong convexity of $f$ in the norm $\|\cdot\|_{B_{\tau}}$ , we have666This is a standard inequality for strongly convex functions, and can be easily proved from the definition (14) by minimizing both sides in $y\in\mathbb{R}^{n}$ .

[TABLE]

Combining (28) and (29), we obtain that

[TABLE]

or, equivalently, after rearranging, that

[TABLE]

Thus, subtracting $\min f$ from both sides, we get

[TABLE]

*The claim now follows by induction. *

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Richtárik and M. Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming , 156(1-2):433–484, 2016.
2[2] I. Necoara, Y. Nesterov, and F. Glineur. Random block coordinate descent methods for linearly constrained optimization over networks. Journal of Optimization Theory and Applications , 173(1):227–254, 2017.
3[3] O. Fercoq and P. Richtárik. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review , 58(4):739–771, 2016.
4[4] I. Necoara and D. Clipici. Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds. SIAM Journal on Optimization , 26(1):197–226, 2016.
5[5] J. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l 1-regularized loss minimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning , ICML’11, pages 321–328, USA, 2011. Omnipress.
6[6] Z. Peng, M. Yan, and W. Yin. Parallel and distributed sparse optimization. In 2013 Asilomar conference on signals, systems and computers , pages 659–646. IEEE, 2013.
7[7] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization , 22(2):341–362, 2012.
8[8] A. Beck and L. Tetruashviferli. On the convergence of block coordinate descent type methods. SIAM journal on Optimization , 23(4):2037–2060, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

A Randomized Coordinate Descent Method with Volume Sampling††thanks: Submitted to the editors DATE.

Abstract

keywords:

1 Introduction

1.1 Related work

1.2 Contents

1.3 Notation

2 Randomized coordinate descent with volume sampling

Remark 2.1**.**

Definition 2.2** (Volume sampling).**

3 Convergence analysis

3.1 Sufficient decrease lemma

Lemma 3.1** (General sufficient decrease lemma).**

Lemma 3.2** (Sum of principal minors).**

Lemma 3.3**.**

Definition 3.4** (τ\tauτ-coordinate approximation).**

Lemma 3.5**.**

Proof 3.6**.**

Lemma 3.7**.**

Proof 3.8**.**

Lemma 3.9** (Sufficient decrease lemma for volume sampling).**

Lemma 3.10**.**

Proof 3.11**.**

3.2 Convex functions

Theorem 3.12** (Convergence rate for convex functions).**

Proof 3.13**.**

3.3 Strongly convex functions

Theorem 3.14** (Convergence rate for strongly convex functions).**

Proof 3.15**.**

Remark 3.16**.**

3.4 Proof of Lemma 3.3

Lemma 3.17** (Inductive step).**

Proof 3.18**.**

4 Implementation of volume sampling

4.1 General algorithm

Proposition 4.1** (Generating a random variable taking a finite number of values).**

Proof 4.2**.**

4.2 Two-element volume sampling for sparse matrices

Theorem 4.3**.**

Proof 4.4**.**

5 Examples of applications

5.1 Quadratic function

5.2 Separable problems

Example 5.1** (Least squares).**

Example 5.2** (Logistic regression).**

5.3 Smoothing technique

Example 5.3**.**

Example 5.4**.**

Example 5.5**.**

5.4 Combinations of previous examples

Example 5.6**.**

6 Numerical experiments

6.1 Quadratic function

6.2 Huber function

6.3 Logistic regression

7 Conclusion

Acknowledgments

Appendix A Proof of Theorem 3.12

Proof A.1**.**

Appendix B Proof of Theorem 3.14

Proof B.1**.**

Remark 2.1.

Definition 2.2 (Volume sampling).

Lemma 3.1 (General sufficient decrease lemma).

Lemma 3.2 (Sum of principal minors).

Lemma 3.3.

Definition 3.4 ( $\tau$ -coordinate approximation).

Lemma 3.5.

Proof 3.6.

Lemma 3.7.

Proof 3.8.

Lemma 3.9 (Sufficient decrease lemma for volume sampling).

Lemma 3.10.

Proof 3.11.

Theorem 3.12 (Convergence rate for convex functions).

Proof 3.13.

Theorem 3.14 (Convergence rate for strongly convex functions).

Proof 3.15.

Remark 3.16.

Lemma 3.17 (Inductive step).

Proof 3.18.

Proposition 4.1 (Generating a random variable taking a finite number of values).

Proof 4.2.

Theorem 4.3.

Proof 4.4.

Example 5.1 (Least squares).

Example 5.2 (Logistic regression).

Example 5.3.

Example 5.4.

Example 5.5.

Example 5.6.

Proof A.1.

Proof B.1.