Faster randomized block Kaczmarz algorithms

Ion Necoara

arXiv:1902.09946·math.OC·February 27, 2019·SIAM J. Matrix Anal. Appl.

Faster randomized block Kaczmarz algorithms

Ion Necoara

PDF

TL;DR

This paper introduces a family of randomized block Kaczmarz algorithms optimized for distributed systems, with proven linear convergence and novel strategies for block selection and stepsize extrapolation.

Contribution

It develops a new framework for randomized block Kaczmarz algorithms with convergence guarantees and practical strategies for distributed implementation.

Findings

01

Algorithms converge linearly in expectation.

02

Performance depends on matrix conditioning and block sampling.

03

Resolves open problem on practical efficiency of extrapolated methods.

Abstract

The Kaczmarz algorithm is a simple iterative scheme for solving consistent linear systems. At each step, the method projects the current iterate onto the solution space of a single constraint. Hence, it requires very low cost per iteration and storage, and it has a linear rate of convergence. Distributed implementations of Kaczmarz have become, in recent years, the de facto architectural choice for large-scale linear systems. Therefore, in this paper we develop a family of randomized block Kaczmarz algorithms that uses at each step a subset of the constraints and extrapolated stepsizes, and can be deployed on distributed computing units. Our approach is based on several new ideas and tools, including stochastic selection rule for the blocks of rows, stochastic conditioning of the linear system, and novel strategies for designing extrapolated stepsizes. We prove that randomized block…

Tables1

Table 1. Table 1: The key convergence results obtained in this paper for algorithm RBK for the three choices of the extrapolated stepsize. Here, matrix A 𝐴 A is normalized and λ max subscript 𝜆 \lambda_{\max} and λ min ( λ min nz ) subscript 𝜆 superscript subscript 𝜆 nz \lambda_{\min}(\lambda_{\min}^{\text{nz}}) denote the largest and smallest (non-zero) eigenvalue of A A T 𝐴 superscript 𝐴 𝑇 AA^{T} , respectively.

RBK algorithm	Convergence rates	Remarks
constant stepsize	$𝐄 [{‖ x^{k} - x_{k}^{} ‖}^{2}] \leq {(1 - \frac{τ}{m} \frac{λ_{\min}^{nz}}{λ_{\max}^{block}})}^{k} {‖ x^{0} - x_{0}^{} ‖}^{2}$	Theorem 4.1
normalized $A$		Theorem 4.1
adaptive stepsize	$𝐄 [{‖ x^{k} - x_{k}^{} ‖}^{2}] \leq {(1 - \frac{τ}{m} \frac{λ_{\min}^{nz}}{λ_{\max}^{block}})}^{k} {‖ x^{0} - x_{0}^{} ‖}^{2}$	Theorem 4.3
normalized $A$		Theorem 4.3
Chebyshev stepsize	${‖ 𝐄 [x^{k} - x_{k}^{}] ‖}^{2} \leq {(1 - \sqrt{\frac{λ_{\min}}{λ_{\max}}})}^{2 k} {‖ x^{0} - x_{0}^{} ‖}^{2}$	Theorem 5.1
normalized $A$ & $λ_{\min} > 0$		Theorem 5.1

Equations223

Find x s.t. A x = b .

Find x s.t. A x = b .

x^{k + 1} = x^{k} - α_{k} \frac{a _{i_{k}}^{T} x ^{k} - b _{i_{k}}}{∥ a _{i_{k}} ∥ ^{2}} a_{i_{k}} .

x^{k + 1} = x^{k} - α_{k} \frac{a _{i_{k}}^{T} x ^{k} - b _{i_{k}}}{∥ a _{i_{k}} ∥ ^{2}} a_{i_{k}} .

x^{k + 1} = x^{k} - α_{k} A_{J_{k}}^{†} (A_{J_{k}} x^{k} - b_{J_{k}}),

x^{k + 1} = x^{k} - α_{k} A_{J_{k}}^{†} (A_{J_{k}} x^{k} - b_{J_{k}}),

x^{k + 1} = x^{k} - α_{k} (i \in J_{k} \sum ω_{i} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}),

x^{k + 1} = x^{k} - α_{k} (i \in J_{k} \sum ω_{i} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}),

α_{k} = \frac{2 \sum _{i \in J_{k}} ω ˉ _{i} ( a _{i}^{T} x ^{k} - b _{i} ) ^{2}}{∥ \sum _{i \in J_{k}} ω ˉ _{i} ( a _{i}^{T} x ^{k} - b _{i} ) a _{i} ∥ ^{2}},

α_{k} = \frac{2 \sum _{i \in J_{k}} ω ˉ _{i} ( a _{i}^{T} x ^{k} - b _{i} ) ^{2}}{∥ \sum _{i \in J_{k}} ω ˉ _{i} ( a _{i}^{T} x ^{k} - b _{i} ) a _{i} ∥ ^{2}},

λ_{m a x}^{block} = J \sim P max λ_{m a x} (A_{J}^{T} A_{J})

λ_{m a x}^{block} = J \sim P max λ_{m a x} (A_{J}^{T} A_{J})

(RBK) : Draw at each step a sample J_{k} \sim P and update: x^{k + 1} = x^{k} - α_{k} (i \in J_{k} \sum ω_{i}^{k} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}),

(RBK) : Draw at each step a sample J_{k} \sim P and update: x^{k + 1} = x^{k} - α_{k} (i \in J_{k} \sum ω_{i}^{k} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}),

O (\frac{m λ _{m a x}^{block}}{τ λ _{m i n}^{nz}} lo g \frac{1}{ε}),

O (\frac{m λ _{m a x}^{block}}{τ λ _{m i n}^{nz}} lo g \frac{1}{ε}),

O (τ n \cdot \frac{m λ _{m a x}^{block}}{τ λ _{m i n}^{nz}} lo g \frac{1}{ε}) vrs. O (n \cdot \frac{m}{λ _{m i n}^{nz}} lo g \frac{1}{ε}) .

O (τ n \cdot \frac{m λ _{m a x}^{block}}{τ λ _{m i n}^{nz}} lo g \frac{1}{ε}) vrs. O (n \cdot \frac{m}{λ _{m i n}^{nz}} lo g \frac{1}{ε}) .

O (\frac{λ _{m a x}}{λ _{m i n}} lo g \frac{1}{ε}),

O (\frac{λ _{m a x}}{λ _{m i n}} lo g \frac{1}{ε}),

x \in R^{n} min \frac{1}{2 m} ∥ A x - b ∥^{2} (:= \frac{1}{2 m} i = 1 \sum m (a_{i}^{T} x - b_{i})^{2}) .

x \in R^{n} min \frac{1}{2 m} ∥ A x - b ∥^{2} (:= \frac{1}{2 m} i = 1 \sum m (a_{i}^{T} x - b_{i})^{2}) .

x \in R^{n} min \frac{1}{2} ∥ x ∥^{2} s.t. A x = b .

x \in R^{n} min \frac{1}{2} ∥ x ∥^{2} s.t. A x = b .

y \in R^{m} min \frac{1}{2} ∥ A^{T} y ∥^{2} - b^{T} y,

y \in R^{m} min \frac{1}{2} ∥ A^{T} y ∥^{2} - b^{T} y,

x min ∥ x - x^{k} ∥^{2} s.t. a_{i_{k}}^{T} x = b_{i_{k}} .

x min ∥ x - x^{k} ∥^{2} s.t. a_{i_{k}}^{T} x = b_{i_{k}} .

x^{k + 1} = x^{k} - \frac{α _{k}}{∥ a _{i_{k}} ∥ ^{2}} \nabla f_{i_{k}} (x^{k}) .

x^{k + 1} = x^{k} - \frac{α _{k}}{∥ a _{i_{k}} ∥ ^{2}} \nabla f_{i_{k}} (x^{k}) .

y^{k + 1} = y^{k} - \frac{α _{k}}{∥ a _{i_{k}} ∥ ^{2}} \nabla_{i_{k}} g (y^{k}) \cdot e_{i_{k}},

y^{k + 1} = y^{k} - \frac{α _{k}}{∥ a _{i_{k}} ∥ ^{2}} \nabla_{i_{k}} g (y^{k}) \cdot e_{i_{k}},

E [∥ x^{k} - x_{k}^{*} ∥^{2}] \leq (1 - \frac{λ _{m i n}^{nz} ( A ^{T} A )}{∥ A ∥ _{F}^{2}})^{k} ∥ x^{0} - x_{0}^{*} ∥^{2},

E [∥ x^{k} - x_{k}^{*} ∥^{2}] \leq (1 - \frac{λ _{m i n}^{nz} ( A ^{T} A )}{∥ A ∥ _{F}^{2}})^{k} ∥ x^{0} - x_{0}^{*} ∥^{2},

∥ x^{k + 1} - x^{*} ∥^{2}

∥ x^{k + 1} - x^{*} ∥^{2}

= ∥ x^{k} - x^{*} ∥^{2} - α (2 - α) \frac{( a _{i}^{T} x ^{k} - b _{i} ) ^{2}}{∥ a _{i} ∥ ^{2}} .

E_{i} [∥ x^{k + 1} - x^{*} ∥^{2} ∣ x^{k}]

E_{i} [∥ x^{k + 1} - x^{*} ∥^{2} ∣ x^{k}]

∥ A x^{k} - b ∥^{2} = ∥ A (x^{k} - x_{k}^{*}) ∥^{2} \geq λ_{m i n}^{nz} (A A^{T}) ∥ x^{k} - x_{k}^{*} ∥^{2} .

∥ A x^{k} - b ∥^{2} = ∥ A (x^{k} - x_{k}^{*}) ∥^{2} \geq λ_{m i n}^{nz} (A A^{T}) ∥ x^{k} - x_{k}^{*} ∥^{2} .

E [∥ x^{k + 1} - x_{k}^{*} ∥^{2}] \leq (1 - \frac{α ( 2 - α ) λ _{m i n}^{nz} ( A A ^{T} )}{∥ A ∥ _{F}^{2}}) E [∥ x^{k} - x_{k}^{*} ∥^{2}] .

E [∥ x^{k + 1} - x_{k}^{*} ∥^{2}] \leq (1 - \frac{α ( 2 - α ) λ _{m i n}^{nz} ( A A ^{T} )}{∥ A ∥ _{F}^{2}}) E [∥ x^{k} - x_{k}^{*} ∥^{2}] .

p_{i} = P (i \in J) .

p_{i} = P (i \in J) .

E_{J} [i \in J \sum θ_{i}] = J \subseteq [m] \sum (i \in J \sum θ_{i}) P (J) = i \in [m] \sum θ_{i} (J : i \in J \sum P (J)) = i \in [m] \sum p_{i} θ_{i} .

E_{J} [i \in J \sum θ_{i}] = J \subseteq [m] \sum (i \in J \sum θ_{i}) P (J) = i \in [m] \sum θ_{i} (J : i \in J \sum P (J)) = i \in [m] \sum p_{i} θ_{i} .

p_{i} = P (i \in J) = J : i \in J \sum P (J) = \frac{( τ - 1 m - 1 )}{( τ m )} = \frac{τ}{m} .

p_{i} = P (i \in J) = J : i \in J \sum P (J) = \frac{( τ - 1 m - 1 )}{( τ m )} = \frac{τ}{m} .

p_{i} = \frac{1}{ℓ} .

p_{i} = \frac{1}{ℓ} .

x^{k + 1} = x^{k} - α_{k} \frac{A _{J_{k}}^{T} ( A _{J_{k}} x ^{k} - b _{J_{k}} )}{∥ A _{J_{k}} ∥ _{F}^{2}} or x^{k + 1} = x^{k} - α_{k} \frac{A _{J_{k}}^{T} D _{J_{k}} ( A _{J_{k}} x ^{k} - b _{J_{k}} )}{τ _{k}},

x^{k + 1} = x^{k} - α_{k} \frac{A _{J_{k}}^{T} ( A _{J_{k}} x ^{k} - b _{J_{k}} )}{∥ A _{J_{k}} ∥ _{F}^{2}} or x^{k + 1} = x^{k} - α_{k} \frac{A _{J_{k}}^{T} D _{J_{k}} ( A _{J_{k}} x ^{k} - b _{J_{k}} )}{τ _{k}},

x^{k + 1} = x^{k} - \frac{τ _{k} α _{k}}{\sum _{i \in J_{k}} ∥ a _{i} ∥ ^{2}} (\frac{1}{τ _{k}} i \in J_{k} \sum (a_{i}^{T} x^{k} - b_{i}) a_{i}) .

x^{k + 1} = x^{k} - \frac{τ _{k} α _{k}}{\sum _{i \in J_{k}} ∥ a _{i} ∥ ^{2}} (\frac{1}{τ _{k}} i \in J_{k} \sum (a_{i}^{T} x^{k} - b_{i}) a_{i}) .

x^{k + 1} = x^{k} - \frac{α _{k}}{\sum _{i \in J_{k}} ∥ a _{i} ∥ ^{2}} (i \in J_{k} \sum (a_{i}^{T} x^{k} - b_{i}) a_{i}) .

x^{k + 1} = x^{k} - \frac{α _{k}}{\sum _{i \in J_{k}} ∥ a _{i} ∥ ^{2}} (i \in J_{k} \sum (a_{i}^{T} x^{k} - b_{i}) a_{i}) .

x^{k + 1} = x^{k} - α (i \in J_{k} \sum ω_{i} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}) .

x^{k + 1} = x^{k} - α (i \in J_{k} \sum ω_{i} \frac{a _{i}^{T} x ^{k} - b _{i}}{∥ a _{i} ∥ ^{2}} a_{i}) .

λ_{m a x}^{block} = J \sim P max λ_{m a x} (A_{J}^{T} diag (\frac{1}{∥ a _{i} ∥ ^{2}}, i \in J) A_{J}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\headers

Faster randomized block Kaczmarz algorithmsI. Necoara

Faster randomized block Kaczmarz algorithms

††thanks: Submitted to the editors DATE.\fundingThis work was supported by the Executive Agency for Higher Education, Research and Innovation Funding (UEFISCDI), Romania, PNIII-P4-PCE-2016-0731, project ScaleFreeNet, no. 39/2017. The author thanks Yu. Nesterov and F. Glineur from Universite Catholique de Louvain for useful discussions on the Chebyshev-based Kaczmarz scheme.

Ion Necoara Department of Automatic Control and Systems Engineering, University Politehnica Bucharest, Splaiul Independentei 313, Bucharest, 060042, Romania (). [email protected]

Abstract

The Kaczmarz algorithm is a simple iterative scheme for solving consistent linear systems. At each step, the method projects the current iterate onto the solution space of a single constraint. Hence, it requires very low cost per iteration and storage, and it has a linear rate of convergence. Distributed implementations of Kaczmarz have become, in recent years, the de facto architectural choice for large-scale linear systems. Therefore, in this paper we develop a family of randomized block Kaczmarz algorithms that uses at each step a subset of the constraints and extrapolated stepsizes, and can be deployed on distributed computing units. Our approach is based on several new ideas and tools, including stochastic selection rule for the blocks of rows, stochastic conditioning of the linear system, and novel strategies for designing extrapolated stepsizes. We prove that randomized block Kaczmarz algorithm converges linearly in expectation, with a rate depending on the geometric properties of the matrix and its submatrices and on the size of the blocks. Our convergence analysis reveals that the algorithm is most effective when it is given a good sampling of the rows into well-conditioned blocks. Besides providing a general framework for the design and analysis of randomized block Kaczmarz methods, our results resolve an open problem in the literature related to the theoretical understanding of observed practical efficiency of extrapolated block Kaczmarz methods.

keywords:

Consistent linear systems, Kaczmarz algorithm, random blocks of rows, expected linear convergence. LaTeX

{AMS}

15A06 , 90C20, 90C06.

1 Introduction

Given a real matrix $A\in\mathbb{R}^{m\times n}$ and a real vector $b\in\mathbb{R}^{m}$ , in this paper we search for a solution of the linear system $Ax=b$ :

[TABLE]

We assume throughout the paper that the system is consistent, that is there exists a vector $x^{*}\in\mathbb{R}^{n}$ for which $Ax^{*}=b$ . Let us denote the set of solutions by ${\cal X}=\{x\in\mathbb{R}^{n}:\;Ax=b\}$ . Linear systems represent a modeling paradigm for solving many engineering and physics problems: partial differential equations [19], sensor networks [27], filtering [11], signal processing [7], computerized tomography [9], machine learning and optimal control [20]. In these applications it is usually sufficient to find a point which is not too far from the solution set ${\cal X}$ . In particular, one chooses the error tolerance $\varepsilon>0$ and aims to find a point $x$ satisfying $\|x-\Pi_{{\cal X}}(x)\|^{2}\leq\varepsilon$ , where $\Pi_{{\cal X}}(\cdot)=\arg\min_{y\in{\cal X}}\|\cdot-y\|$ is the projection function onto solution set ${\cal X}$ , and $\|\cdot\|$ is the standard Euclidean norm on $\mathbb{R}^{n}$ . In the case when a randomized algorithm is used to find $x$ , which renders $x$ a random vector, one replace this condition with $\mathbf{E}\left[\|x-\Pi_{{\cal X}}(x)\|^{2}\right]\leq\varepsilon$ , where $\mathbf{E}\left[\cdot\right]$ denotes the expectation with respect to the randomness of the algorithm.

1.1 Iterative methods

In practice, $m$ and $n$ are usually large so that iterative methods, e.g. the so-called row-action methods are preferred (in a row-action method only one block of rows of $A$ is used in a certain iteration [2]). One of these methods is the iterative method of Kaczmarz [10, 23, 13]. In some situations, it is even more efficient than the conjugate gradient method, which is the most popular iterative algorithm for solving large linear systems [19]. In fact Kaczmarz algorithm was implemented by Hounsfield in the very first medical scanner [9]. At each step, the Kaczmarz algorithm projects the current iterate onto the solution space of a single row $a_{i_{k}}^{T}$ and then choose the next iterate along the line connecting the current iterate and the projection, leading to the following iterative process:

[TABLE]

Usually, the stepsize $\alpha_{k}$ is chosen in the interval $(0,\;2)$ . For $\alpha_{k}=1$ we recover the basic Kaczmarz algorithm [10]. Note that this update rule requires low cost per iteration and storage of order ${\cal O}(n)$ . In contrast, in block Kaczmarz methods a subset of rows $A_{J_{k}}$ are used at each iteration, with $J_{k}\subseteq[m]$ and $|J_{k}|>1$ . We usually distinguish two approaches. The first variant is simply a block generalization of basic Kaczmarz algorithm, that is, we project the current iterate onto the solution space of the entire block $A_{J_{k}}$ and then choose the next iterate along the line connecting the current iterate and the projection:

[TABLE]

where $A_{J_{k}}^{\dagger}$ denotes the pseudoinverse of $A_{J_{k}}$ . Usually, the stepsize $\alpha_{k}$ is chosen $1$ . This is the approach followed e.g. in [5, 8, 16, 22] and we refer to this iterative process as block projection Kaczmarz algorithm. The main drawback of (3) is that each iteration is expensive, since we need to apply the pseudoinverse to a vector, or equivalently, we must solve a least-squares problem at each iteration, having cost per iteration of order ${\cal O}(\tau^{2}n)$ , where $\tau=|J_{k}|$ . Moreover, it is not adequate for distributed implementations. The second variant of block Kaczmarz avoids these issues, by projecting the current estimate onto * each individual* row that forms the block matrix $A_{J_{k}}$ , and the resulting projections are averaged to form the next iterate. This leads to the following iteration:

[TABLE]

where the weights $\omega_{i}\!\in\![0,\ 1]$ such that $\sum_{i\in J_{k}}\omega_{i}\!=\!1$ , and $\alpha_{k}\!\in\!(0,\;2)$ . Note that update (4) is very easy to implement on distributed computing units and it is comparable in terms of cost per iteration with the basic Kaczmarz update (2), i.e., of order ${\cal O}(\tau n)$ . This is the scheme considered e.g. in [1, 2, 14, 21] and we also analyze it in this paper and refer to it as block Kaczmarz algorithm. Assuming $\alpha_{k}\in(0,\;2)$ , then the iterative process (2) is known to converge linearly [13, 23] (see also Section 3.3). Moreover, linear convergence results for the iteration (3), with particular stepsize $\alpha_{k}=1$ , were recently derived in [8, 16, 22]. However, we are not aware of any convergence rates depending on the size of the blocks $|J_{k}|$ and the geometric properties of the matrix $A$ and its submatrices $A_{J_{k}}$ for the iterative process (4).

1.2 Extrapolation

It is well known that the practical performance of block Kaczmarz method (4) can be enhanced, and often dramatically so, using extrapolation. This refers to the practice of moving further along the line connecting the last iterate and the average of the projections by using a stepsize $\alpha_{k}\geq 2$ , see e.g. [1]. For example, since the iterative process (4) can be slow, in [14, 21] an extrapolated variant of (4) has been introduced with the following adaptive choice for the stepsize $\alpha_{k}$ :

[TABLE]

where we use the notation $\bar{\omega}_{i}={\omega}_{i}/\|a_{i}\|^{2}$ and, for convenience, we define $0/0=1$ . From Jensen’s inequality it follows that $\alpha_{k}\geq 2$ . However, in numerical experiments, it has been observed that the extrapolation parameter $\alpha_{k}$ from (5) can be much larger than $2$ . Moreover, the sequence $x^{k}$ generated by the iterative process (4) using the extrapolated adaptive stepsize $\alpha_{k}$ from (5) usually converges much faster than the same sequence $x^{k}$ from (4) but generated with stepsize $\alpha_{k}\in(0,\ 2)$ [1, 2, 3, 14, 21]. However, despite more than $80$ years of research on block Kaczmarz methods, the empirical success of extrapolation schemes is not supported by theory. That is, to the best of our knowledge, there is no theory explaining why these methods with $\alpha_{k}\geq 2$ require less iterations than their non-extrapolated variants $\alpha_{k}=1$ .

1.3 Rows importance

While selecting the index set $J\subseteq[m]$ uniformly random appears as the most natural choice, it is likely the case that some blocks of rows of $A$ are more important than others. As an illustration, consider the scenario in which there exists $T\subset[m]$ such that ${\cal X}=\{x\in\mathbb{R}^{n}:A_{T}x=b_{T}\}$ , where $A_{T}$ denotes the block matrix of $A$ whose rows are indexed in the set $T$ . Clearly, the rows $a_{i}$ for $i\in T$ are more important than the rows $a_{i}$ for $i\notin T$ . This is an extreme scenario: if $T$ is known, one should simply remove the non-important rows from the representation to begin with. However, even if none of the rows can be removed, it is often the case that some (blocks of) rows are more important than others in the sense that one should project on these more often. In fact, the operator theory shows that some sampling strategies of the blocks of rows are more effective than others, in terms of conditioning, see e.g. [16, 25]. We are not aware of any paper on block Kaczmarz method (4) that take importance of blocks of rows into consideration. An exception to this are some recent works [16, 8, 22], but on the block projection Kaczmarz algorithm (3) (i.e., [16, 8, 22] analyze rows importance for the method that projects the current estimate on the entire solution space of $A_{J}x=b_{J}$ , as opposed to our algorithm (4), where we only project on the individual rows of the submatrix $A_{J}$ and then average).

1.4 Outline

In Section 2 we summarize selected key contributions of this paper. In Section 3 we present some preliminary results for Kaczmarz algorithm. In Section 4 we define general random block Kaczmarz algorithms and derive new convergence rates. In Section 5 we present an acceleration of block Kaczmarz algorithm using Chebyshev-based stepsizes and derive the corresponding convergence rates.

1.5 Notation

For $x\in\mathbb{R}^{n}$ , the standard Euclidean norm is denoted by $\|x\|=\sqrt{x^{T}x}$ . For a positive integer $m$ , let $[m]=\{1,2,\dots,m\}$ . By $e_{i}$ we denote the $i$ th column of the identity matrix $I_{n}\in\mathbb{R}^{n\times n}$ . Let $A\in\mathbb{R}^{m\times n}$ be a matrix. By $\|A\|_{F}$ , $\|A\|$ , $\text{rank}(A)$ , $a_{i}^{T}$ , $\lambda_{\min}^{\text{nz}}(A)$ and $\lambda_{\max}(A)$ we denote its Frobenius norm, spectral norm, rank, $i$ th row, the smallest non-zero eigenvalue, and the largest eigenvalue, respectively. For an index set $J\subset[m]$ , by $A_{J}\in\mathbb{R}^{|J|\times n}$ we denote the matrix with the rows $a_{i}^{T}$ for $i\in J$ . The projection of a point $x$ onto a closed convex set $X$ is denoted by $\Pi_{X}(x)=\arg\min_{z}\{\|x-z\|:z\in X\}$ . A matrix is called normalized if all its rows have the Euclidean norm equal to $1$ .

2 Contributions

In this section we briefly review our key contributions and results, leaving the theoretical details to the rest of the paper.

2.1 General framework

We develop a unified framework for studying extrapolation and rows importance questions for consistent linear systems, together with randomized block Kaczmarz methods for solving such systems of linear equalities. We define a probability space $([m],{\cal F},\mathbf{P})$ . By sampling $J\sim\mathbf{P}$ , we are choosing a block of rows $A_{J}$ from the matrix $A$ . In this way we achieve two goals at the same time:

(i)

First, this sampling defines a general stochastic selection rule which we shall use to design a randomized block Kaczmarz method, described in Section 2.2 below.

(ii)

Second, the choice of probability measure is a natural way to assign importance to the blocks of $A$ .

Note that the probability $\mathbf{P}$ is a parameter playing the dual role of controlling the representation of the solution set ${\cal X}$ as an intersection of blocks of rows of matrix $A$ , and defining the importance sampling procedure, which in turn defines the algorithm. For matrices with normalized rows (i.e. each row has norm $1$ ), we have identified the following stochastic conditioning parameter:

[TABLE]

as the key quantity characterizing importance sampling. In particular, our analysis reveals that the most effective importance rule is the one that makes $\lambda_{\max}^{\text{block}}$ small, i.e. there is a sampling of the blocks of the rows into well-conditioned blocks. Moreover, the operator theory literature provides detailed information about the existence and construction of such good sampling (see Section 4.3).

2.2 Algorithms

We propose a block Kaczmarz algorithmic framework that uses a randomized scheme to choose a subset of the constraints at each iteration (see Sections 4 and 5):

[TABLE]

where the weights satisfy $\omega_{i}^{k}\in[0,\ 1]$ such that $\sum_{i\in J_{k}}\omega_{i}^{k}=1$ . One important property of our algorithmic framework is the use of several extrapolated stepsizes $\alpha_{k}$ , that, in general, are much larger than the stepsize $\alpha_{k}\in(0,\ 2)$ usually used in the literature. More precisely, we analyze three choices for the stepsize $\alpha_{k}$ : (i) one depending on the geometric properties of the submatrices of $A$ of the form ${\cal O}(1/\lambda_{\max}^{\text{block}})$ ; (ii) one adaptive stepsize similar to (5); (iii) one stepsize using the roots of the Chebyshev polynomials. All three extrapolation procedures yield $\alpha_{k}\geq 2$ and hence they accelerate drastically the convergence of RBK algorithm. Another feature of our algorithm is that it allows to project in parallel onto several rows, thus providing flexibility in matching the implementation of the algorithm on the distributed architecture at hand. Moreover, RBK algorithm can be interpreted, for some particular choices of the weights and stepsize, as a minibatch stochastic gradient descent or block coordinate descent method applied to a specific optimization problem.

2.3 Convergence rates

To the best of our knowledge, convergence rates of Kaczmarz type methods were only previously derived for stepsizes belonging to the interval $(0,\ 2)$ [8, 16, 22, 23]. Moreover, the existing convergence estimates for block Kaczmarz algorithm (4) do not show any dependence on the size of the blocks $|J|$ or on the geometric properties of the block submatrices $A_{J}$ [1, 2, 3, 14, 21]. On the other hand, our convergence analysis for the randomized block Kaczmarz (RBK) algorithm is one of the first proving an (expected) linear rate of convergence that is expressed explicitly in terms of the geometric properties of the matrix and its submatrices and of the size of the blocks. Moreover, our analysis allows to derive convergence estimates for all three choices of the extrapolated stepsize. From our knowledge, this is the first time the randomized block Kaczmarz algorithm with extrapolation ( $|J|>1$ and $\alpha_{k}>2$ ) is shown to have a better convergence rate than its basic variant (2) ( $|J|=1$ and $\alpha_{k}=1$ ). We have identified $\lambda_{\max}^{\text{block}}$ as the key quantity determining whether extrapolation helps or not, and how much (the smaller $\lambda_{\max}^{\text{block}}$ , the more it helps). For example, for normalized matrices, RBK with the extrapolation rules (i)–(ii) has an expected linear rate for the square distance of the iterates to the optimal solution set of the form (see Table 1):

[TABLE]

where $\lambda_{\min}^{\text{nz}}$ denotes the smallest non-zero eigenvalue of $AA^{T}$ . Thus, a convergence rate depending on the geometric properties of the matrix $A$ and its submatrices $A_{J}$ and on the size of the blocks $\tau=|J|$ . When comparing RBK with basic Kaczmarz in terms of total computational cost to achieve an $\varepsilon$ solution we get:

[TABLE]

Therefore, our convergence rate also explains why and when the randomized block Kaczmarz algorithm with the constant extrapolated stepsize (16) or adaptive extrapolated stepsize (5) works better compared to its basic counterpart. In particular, the analysis reveals that a distributed implementation of extrapolated RBK algorithm is most effective when the sampling of the blocks of rows yields a partition into well-conditioned blocks, that is, the stochastic conditioning parameter $\lambda_{\max}^{\text{block}}$ is small.

For the third choice of the extrapolated stepsize, depending on the roots of Chebyshev polynomials, and for normalized matrices having $\lambda_{\min}>0$ we get a linear rate for the expected iterates of the form (see Table 1):

[TABLE]

where $\lambda_{\min}(\lambda_{\max})$ denote the smallest (largest) eigenvalue of $AA^{T}$ , respectively. Note that this convergence estimate is the same as for the conjugate gradient method and it is optimal for this class of iterative schemes, as the condition number of the matrix is square rooted.

3 Preliminaries

Note that the problem of finding a solution of the linear system $Ax=b$ can be posed as a quadratic optimization problem, the so-called linear least-square problem:

[TABLE]

A more particular formulation is to find the least-norm solution of the linear system:

[TABLE]

The dual of optimization problem (8) takes also the form of a quadratic program:

[TABLE]

where the primal variable $x$ and the dual variable $y$ are related through the relation $x=A^{T}y$ . Let us define the primal and dual objective functions $f(x)=(1/2m)\|Ax-b\|^{2}$ and $g(y)=1/2\|A^{T}y\|^{2}-b^{T}y$ , respectively. Recall that the set of solutions is denoted ${\cal X}=\{x\in\mathbb{R}^{n}:\;Ax=b\}$ and for any given $x$ we define its projection onto ${\cal X}$ by $x^{*}=\Pi_{{\cal X}}(x)$ .

3.1 Basic Kaczmarz algorithm

The Kaczmarz algorithm is an iterative scheme for solving the linear system $Ax=b$ that requires only ${\cal O}(n)$ cost per iteration and storage and has a linear rate of convergence. At each iteration $k$ , the algorithm selects (cyclically, randomly) a row $i_{k}\in[m]$ of the linear system and does an orthogonal projection of the current estimate vector $x^{k}$ onto the corresponding hyperplane $a_{i_{k}}^{T}x=b_{i_{k}}$ :

[TABLE]

Then, we choose the next iterate along the line connecting the current iterate and the projection. This leads to the following iteration for randomized/cyclic Kaczmarz algorithm [10, 23]:

Usually, $\alpha_{k}$ is chosen constant in interval $(0,\;2)$ . For $\alpha_{k}=1$ we recover basic Kaczmarz algorithm [10].

3.2 Interpretations

We can view randomized Kaczmarz algorithm, i.e. when $i_{k}$ is chosen randomly, as an optimization method for solving a specific primal or dual optimization problem. More precisely, Kaczmarz algorithm is a particular case of:

SGD (Stochastic Gradient Descent): The randomized Kaczmarz (Algorithm 1) is equivalent to one step of the stochastic gradient descent method [17] applied to the finite sum problem (7). Specifically, a component function $i_{k}$ , $f_{i_{k}}(x)=1/2(a_{i_{k}}^{T}x-b_{i_{k}})^{2}$ , is chosen randomly and a negative gradient step (having $\nabla f_{i_{k}}(x)=(a_{i_{k}}^{T}x-b_{i_{k}})a_{i_{k}}$ ) of this partial function in $x^{k}$ with stepsize $\alpha_{k}/\|a_{i_{k}}\|^{2}$ is considered:

[TABLE]

RCD (Random Coordinate Descent): The randomized Kaczmarz (Algorithm 1) is equivalent to one step of randomized coordinate descent method [18] applied to the dual problem (9). Specifically, a negative gradient step in the random $i_{k}$ th component of $y$ (having the expression $\nabla_{i_{k}}g(y)=a_{i_{k}}^{T}A^{T}y-b_{i_{k}}$ ) with stepsize $\alpha_{k}/\|a_{i_{k}}\|^{2}$ is taken, yielding:

[TABLE]

where $e_{i}$ denotes the $i$ th column of the identity matrix in $\mathbb{R}^{n\times n}$ . We recover easily the iteration of Algorithm 1 by simply multiplying this update with $A^{T}$ and using the relation between the primal and dual variables given by $x=A^{T}y$ . Note that in both interpretations, we need to choose a specific stepsize, in order to prove convergence, see [17, 18].

3.3 Convergence properties

It is known that Algorithm 1 converges to the minimum norm solution of $Ax=b$ when it is initialized with $x^{0}=0$ , but the speed of convergence is not simple to quantify, and especially, depends on the ordering of the rows [4]. The situation changes if one considers a randomization such that in each step one chooses a row of the system matrix at random, according to a probability P. In the seminal paper [23] it has been shown that sampling the rows of $A$ with probability $\textbf{P}(i=i_{k})=\frac{\|a_{i_{k}}\|^{2}}{\|A\|_{F}^{2}}$ for all $i\in[m]$ and using constant stepsize $\alpha=1$ , we get a linear convergence rate in expectation of the form:

[TABLE]

where $\lambda_{\min}^{\text{nz}}(\cdot)$ denotes the minimum non-zero eigenvalue of a given matrix and $x^{*}_{k}=\Pi_{{\cal X}}(x^{k})$ . For completeness, let us derive this convergence rate. Considering the stepsize $\alpha_{k}$ constant in the interval $(0,\;2)$ and using that $\langle x-x^{*},(a_{i}^{T}x-b_{i})a_{i}\rangle=(a_{i}^{T}x-b_{i})^{2}$ for any $x^{*}$ a solution of $Ax=b$ , we get:

[TABLE]

Taking now the conditional expectation under the probability $\textbf{P}(i=i_{k})=\frac{\|a_{i_{k}}\|^{2}}{\|A\|_{F}^{2}}$ , we get:

[TABLE]

Further, it is well known from the Courant-Fischer theorem that for any matrix $A$ we have $\|Ax\|^{2}\geq\lambda_{\min}^{\text{nz}}(AA^{T})\|x\|^{2}$ for all $x\in\text{range}(A^{T})$ . Moreover, we have that $x-\Pi_{{\cal X}}(x)\in\text{range}(A^{T})$ for any $x$ . In conclusion, if we denote $x_{k}^{*}=\Pi_{{\cal X}}(x^{k})$ , we get:

[TABLE]

Using this inequality in the recurrence above and taking expectation over the entire history we get the following linear convergence rate in expectation:

[TABLE]

For the optimal choice $\alpha^{*}=1$ (i.e. $\alpha^{*}=\arg\max_{\alpha}\alpha(2-\alpha)$ ) we get the simpler convergence estimate (10) derived in [23]. Note that for ill-conditioned problems, i.e. $\lambda_{\min}^{\text{nz}}(AA^{T})$ small and $\|A\|_{F}$ large, this linear convergence is very slow using a constant stepsize $\alpha\in(0,\;2)$ . In the next sections we prove that block variants of randomized Kaczmarz (Algorithm 1) with properly chosen extrapolated stepsize $\alpha_{k}$ larger than $2$ can substantially accelerate the convergence rate (11).

3.4 Preliminary probability results

Let $J$ be a random set-valued map with values in $2^{[m]}$ . Any realization $J\subseteq[m]$ of this random variable, referred to as sampling and having the same notation as the random variable, is characterized by the probability distribution $\textbf{P}(J)$ . We also define the probability with which an index $i\in[m]$ can be found in $J$ as:

[TABLE]

Then, for any scalars $\theta_{i}$ , with $i\in[m]$ , the following relation holds in expectation:

[TABLE]

The following examples for sampling blocks of rows of $A\in\mathbb{R}^{m\times n}$ will be used in our subsequent analysis.

Uniform sampling: One natural choice is the uniform sampling of $\tau$ unique indexes of rows that make up $J$ , i.e. $|J|=\tau$ for all samplings, with $1\leq\tau\leq m$ fixed. For this choice of the random variable $J$ , we observe that we have a total number of $\binom{m}{\tau}$ possible values that $J$ can take. Thus, for the uniform sampling we have $\textbf{P}(J)=1/\binom{m}{\tau}$ . We can also express $p_{i}$ for the uniform sampling as:

[TABLE]

Partition sampling: Another choice is the partition sampling, i.e. consider a partition of $[m]$ given by $\{J_{1},\cdots,J_{\ell}\}$ , and then take $\textbf{P}(J)=1/\ell$ or $\textbf{P}(J)=\|A_{J}\|_{F}^{2}/\|A\|_{F}^{2}$ for all $J\in\{J_{1},\cdots,J_{\ell}\}$ . For example, for the first probability choice of the partition sampling, $p_{i}$ is given by:

[TABLE]

In particular, if all the subsets in the partition have the same cardinality, i.e. $|J_{l}|=\tau$ for all $l\in[\ell]$ , and $A$ is normalized, then the two probabilities are the same and $\ell=m/\tau$ . Hence, $p_{i}=\frac{\tau}{m}$ . These preliminary results will help us in the convergence analysis of randomized block Kaczmarz algorithms we propose next.

4 Randomized block Kaczmarz algorithms

In this section we design new variants of randomized Kaczmarz, Algorithm (1), considering at each step a block of rows of the linear system $Ax=b$ and different choices for the stepsize. For all these methods we prove expected linear convergence rates. Note that block Kaczmarz methods have been also considered in other works, see e.g. [1, 2, 14, 21] and the references therein. Nevertheless, to our knowledge, this paper is the first one that provides an expected linear rate of convergence that depends explicitly on geometric properties of the system matrix $A$ and its submatrices $A_{J}$ . Moreover, the convergence estimates hold for several extrapolated stepsizes. In our Randomized Block Kaczmarz (RBK) algorithm, at each iteration, instead of projecting on only one hyperplane, we consider projections onto several hyperplanes and then take as a new direction a convex combination of these projections with some stepsize (see Algorithm 2).

Here $J_{k}=\{i_{k}^{1},\cdots,i_{k}^{\tau_{k}}\}\subseteq[m]$ is the set of indexes corresponding to the rows selected at iteration $k$ of size $\tau_{k}\in[1,m]$ and P denotes the probability distribution over the collection of subsets of indexes of $[m]$ . Moreover, the weights $\omega_{k}=(\omega_{k}^{i})_{i\in J_{k}}$ are chosen positive and summing to 1. Thus, in our analysis we assume bounded weights satisfying $0<\omega_{\min}\leq\omega_{k}^{i}\leq\omega_{\max}<1$ for all $i\in J_{k}$ and $k\geq 0$ . Two simple choices for the weights are e.g. $\omega_{k}^{i}=\|a_{i}\|^{2}/\sum_{i\in J_{k}}\|a_{i}\|^{2}$ or $\omega_{k}^{i}=1/\tau_{k}$ for all $k\geq 0$ . In these two particular cases we get the following compact updates:

[TABLE]

respectively, where the diagonal matrix $D_{J}=\text{diag}(1/\|a_{i}\|^{2},\;i\in J)\in\mathbb{R}^{\tau\times\tau}$ . Several choices for the stepsize will be given in the next sections, based on over-relaxations (extrapolations), i.e. $\alpha_{k}>2$ . Similarly, as for Kaczmarz algorithm, RBK (Algorithm 2) can be interpreted as:

BSGD (Batch Stochastic Gradient Descent): One iteration of RBK algorithm can be viewed as one step of the batch stochastic gradient descent [17] applied to the finite sum problem (7) when the weights $\omega_{k}$ are chosen in a particular fashion. Specifically, if we choose the particular weights $\omega_{k}^{i}=\|a_{i}\|^{2}/\sum_{i\in J_{k}}\|a_{i}\|^{2}$ and uniform probability, then we recover the batch stochastic gradient descent method with a certain choice of the stepsize:

[TABLE]

RBCD (Randomized Block Coordinate Descent): One iteration of RBK algorithm can be viewed as one step of the block coordinate descent method [15, 18] applied to the dual problem (9) when the weights $\omega_{k}$ are chosen in a particular fashion. Specifically, if we choose the particular weights $\omega_{k}^{i}=\|a_{i}\|^{2}/\sum_{i\in J_{k}}\|a_{i}\|^{2}$ , then we recover the block coordinate descent method with a certain choice of the stepsize:

[TABLE]

However, for general weights $\omega_{k}$ and stepsize $\alpha_{k}$ , RBK algorithm cannot be interpreted in these ways, thus our scheme is more general. In the following, we denote $x_{k}^{*}=\Pi_{{\cal X}}(x^{k})$ , that is the projection of $x^{k}$ onto the solution set ${\cal X}$ of the linear system $Ax=b$ .

4.1 Randomized block Kaczmarz algorithm with constant stepsize

In this section we investigate the convergence rate of RBK algorithm for constant extrapolated stepsize $\alpha_{k}=\alpha>2$ and weights $\omega_{k}^{i}=\omega_{i}$ for all $k$ . Thus, the iteration of RBK (Algorithm 2) becomes in this case:

[TABLE]

The weights are chosen to satisfy $0<\omega_{\min}\leq\omega_{i}\leq\omega_{\max}<1$ for all $i$ and sum to $1$ . Let us also define the following stochastic conditioning parameter depending on the geometric properties of the submatrices $A_{J}$ :

[TABLE]

Then, we consider an extrapolated constant stepsize of the form:

[TABLE]

When we choose a random variable such that all the samplings satisfy $|J|=\tau$ , with $\tau\in[1,m]$ , then it is straightforward to see that $\lambda_{\max}^{\text{block}}<\tau$ provided that $\text{rank}(A_{J})\geq 2$ . Hence, in this case we use an over-relaxed (extrapolated) stepsize, since usually $2\omega_{\min}/\omega_{\max}^{2}\lambda_{\max}^{\text{block}}>2$ . For example, for $\omega_{i}=1/\tau$ , we get $2\tau/\lambda_{\max}^{\text{block}}>2$ . Using (12) we also define the positive semidefinite matrix $W$ as:

[TABLE]

From our best knowledge, the choice (16) for the stepsize in the block Kaczmarz algorithm seems to be new. The next theorem proves the convergence rate of this algorithm which depends explicitly on the geometric properties of the system matrix $A$ and its submatrices $A_{J}$ .

Theorem 4.1.

Let $\{x^{k}\}_{k\geq 0}$ be generated by RBK (Algorithm 2) with the particular update (15), i.e. the weights satisfy $0<\omega_{\min}\leq\omega_{i}\leq\omega_{\max}<1$ for all $i\in[m]$ and the stepsize $\alpha=\frac{(2-\delta)\omega_{\min}}{\omega_{\max}^{2}\lambda_{\max}^{\text{block}}}$ for some $\delta\in(0,1]$ . Then, we have the following linear convergence rate in expectation:

[TABLE]

Proof 4.2.

Since we assume a consistent linear system, that is there is $x^{*}$ such that $Ax^{*}=b$ , we have:

[TABLE]

We need to take conditional expectation over $J_{k}$ . However, for a general random sampling $J$ we have from (12) that the expectation over the first sum from above yields the lower bound:

[TABLE]

Thus, we obtained:

[TABLE]

Moreover, using that for any $Q\succeq 0$ we have $Q^{2}\preceq\lambda_{\text{max}}(Q)Q$ , the expectation over the second sum also yields the following upper bound:

[TABLE]

where recall that $\lambda_{\max}^{\text{block}}=\max_{J\sim\textbf{P}}\lambda_{\max}\left(A_{J}^{T}\text{diag}\left(\frac{1}{\|a_{i}\|^{2}},i\in J\right)A_{J}\right)$ . Therefore, taking conditional expectation w.r.t. the block $J_{k}$ over entire history ${\cal F}_{k}=\{J_{0},\cdots,J_{k-1}\}$ in the recurrence above, we get:

[TABLE]

In order to ensure decrease we need $2\alpha\omega_{\min}-\alpha^{2}\omega_{\max}^{2}\lambda_{\max}^{\text{block}}\geq 0$ , that is we get an extrapolated stepsize:

[TABLE]

and the optimal stepsize is obtained by maximizing $2\alpha\omega_{\min}-\alpha^{2}\omega_{\max}^{2}\lambda_{\max}^{\text{block}}$ in $\alpha$ which leads to:

[TABLE]

Hence, taking stepsize $\alpha=(2-\delta)\omega_{\min}/\omega_{\max}^{2}\lambda_{\max}^{\text{block}}$ for some $\delta\in(0,1]$ , we get:

[TABLE]

On the other hand, it is well-known from the Courant-Fischer theorem that for any matrix $A$ we have $\|Ax\|^{2}\geq\lambda_{\min}^{\text{nz}}(AA^{T})\|x\|^{2}$ for all $x\in\text{range}(A^{T})$ . Moreover, we have that $x-\Pi_{{\cal X}}(x)\in\text{range}(A^{T})$ for any $x$ . In conclusion, using that $W=A^{T}DA$ with the diagonal matrix $D=\text{diag}\left(\frac{p_{i}}{\|a_{i}\|^{2}},i\in[m]\right)$ invertible, we get that:

[TABLE]

Using this inequality in the recurrence above and taking expectation over the entire history we get:

[TABLE]

*which shows an expected linear convergence rate for RBK depending on the parameters $\lambda_{\min}^{\text{nz}}(W)$ and $\lambda_{\max}^{\text{block}}$ associated to the system matrix $A$ and its submatrices $A_{J}$ , respectively. *

Now, let us consider the uniform and partition sampling examples of Section 3.4 where all the blocks sampling have the same size $|J|=\tau$ . In this case we have $p_{i}=\frac{\tau}{m}$ . Let us also consider the particular choices $\delta=1$ , weights $\omega_{i}=1/\tau$ , and matrices $A$ with normalized rows, i.e. $\|a_{i}\|=1$ for all $i\in[m]$ . Hence, $\|A\|_{F}^{2}=m$ . Then, our convergence rate (17) becomes:

[TABLE]

Comparing with the convergence rate (10) of the basic Kaczmarz method (recall that for normalized matrices $\|A\|_{F}^{2}=m$ ) we get an improvement $\frac{\tau}{\lambda_{\max}^{\text{block}}}>1,$ which shows that for RBK algorithm with the new extrapolated stepsize (16) we can get a speed-up even of order approximately $\tau$ compared to basic Kaczmarz algorithm on matrices with well-conditioned blocks (i.e. on matrices having $\lambda_{\max}^{\text{block}}\ll\tau$ ). Section 4.3 provides choices for the sampling that lead to a small stochastic conditioning parameter $\lambda_{\max}^{\text{block}}$ .

4.2 Randomized block Kaczmarz algorithm with adaptive stepsize

Since the previous algorithm involves a stepsize depending on $\lambda_{\max}^{\text{block}}$ , which may be difficult to compute in large-scale settings (i.e. when the random variable $J$ is complicated and the number of rows $m$ is large), in this section we design a randomized block Kaczmarz algorithm with an adaptive stepsize, which doe not require the computation of $\lambda_{\max}^{\text{block}}$ . More precisely, we consider a variant of RBK (Algorithm 2) with variable weights and an adaptive stepsize approximating online $\lambda_{\max}^{\text{block}}$ . For simplicity of the notation let us define $\bar{\omega}_{i}^{k}=\frac{\omega_{i}^{k}}{\|a_{i}\|^{2}}$ . Then, we consider the iteration of RBK (Algorithm 2) with an adaptive extrapolated stepsize of the form:

[TABLE]

Note that we do not need to compute $L_{k}$ for the second case when implementing the algorithm. Recall that we consider weights satisfying $0<\omega_{\min}\leq\omega_{i}^{k}\leq\omega_{\max}<1$ for all $k,i$ , and summing to $1$ . Hence, from Jensen’s inequality we always have $L_{k}\geq 1$ and consequently $2L_{k}\geq 2$ , i.e. we use extrapolation. Further, in our convergence analysis we take a stepsize of the form $\alpha_{k}=(2-\delta)L_{k}$ for some $\delta\in(0,\;1]$ . Moreover, we denote $x_{k}^{*}=\Pi_{{\cal X}}(x^{k})$ , that is the projection of $x^{k}$ onto the solution set of the linear system. It has been observed in practice that block Kaczmarz iteration with this adaptive choice for the stepsize has better performances than the same algorithm but with stepsize $\alpha_{k}\in(0,2)$ , see e.g. [1, 2, 3, 14, 21]. However, from our knowledge, there is no theory explaining why and when this adaptive method works. The next theorem proves the convergence rate of the adaptive algorithm depending explicitly on the geometric properties of the system matrix $A$ and its submatrices $A_{J}$ and answers to the question related to the theoretical understanding of observed practical efficiency of extrapolated block Kaczmarz methods.

Theorem 4.3.

Let $\{x^{k}\}_{k\geq 0}$ be generated by RBK (Algorithm 2) with the adaptive stepsize $\alpha_{k}=(2-\delta)L_{k}$ for some $\delta\in(0,1]$ and the weights satisfying $0<\omega_{\min}\leq\omega_{i}^{k}\leq\omega_{\max}<1$ for all $k,i$ . Then, we have the following linear convergence in expectation:

[TABLE]

Proof 4.4.

Using that $\langle x-x^{*},(a_{i}^{T}x-b_{i})a_{i}\rangle=(a_{i}^{T}x-b_{i})^{2}$ in the update of RBK, we get:

[TABLE]

Note that we get the same recurrence also for the trivial choice $L_{k}=1/\lambda_{\max}\left(A_{J_{k}}^{T}\text{diag}\left(\bar{\omega}_{i}^{k},i\in J_{k}\right)A_{J_{k}}\right)$ . Now let us bound $L_{k}$ . For the nontrivial case, using that $\lambda_{\max}(MN)=\lambda_{\max}(NM)$ for any two matrices $M$ and $N$ of appropriate dimensions, we have:

[TABLE]

This inequality holds trivially for the second choice (case) of $L_{k}$ . Therefore, we can further bound $L_{k}$ for all the cases as follows:

[TABLE]

Using this bound in the recurrence above we get:

[TABLE]

Taking now the conditional expectation and using again (12), we get:

[TABLE]

It is also known from the Courant-Fischer theorem that for any matrix $A$ we have $\|Ax\|^{2}\geq\lambda_{\min}^{\text{nz}}(AA^{T})\|x\|^{2}$ for all $x\in\text{range}(A^{T})$ . Moreover, we have that $x-\Pi_{{\cal X}}(x)\in\text{range}(A^{T})$ for any $x$ . In conclusion, using that $W=A^{T}DA$ , with the diagonal matrix $D=\text{diag}\left(\frac{p_{i}}{\|a_{i}\|^{2}},i\in[m]\right)$ invertible, we get that:

[TABLE]

Using this inequality in the recurrence above and taking expectation over the entire history we get:

[TABLE]

*hence proving the statement of the theorem. *

There is a tight connection between the constant stepsize (16) and the adaptive stepsize (20). Indeed, for simplicity let us consider uniform weights $\omega_{i}^{k}=1/\tau$ and normalized matrices ( $\|a_{i}\|=1$ for all $i,k$ ). Then, from (4.4) we obtain:

[TABLE]

Hence, $L_{k}$ represents an online approximation of $\tau/\lambda_{\max}^{\text{block}}$ and therefore:

[TABLE]

In conclusion, the adaptive stepsize (20) can be viewed as a practical online approximation of the constant extrapolated stepsize (16). Finally, let us simplify the convergence rate (21) for the uniform and partition sampling examples of Section 3.4 having all the blocks sampling the same size $|J|=\tau$ . In this case we have $p_{i}=\frac{\tau}{m}$ . Let us also consider the particular choices $\delta=1$ , weights $\omega_{i}=1/\tau$ , and normalized matrices $A$ . Then, our convergence rate (21) becomes:

[TABLE]

We observe that this convergence rate coincides with (19). However, the adaptive block Kaczmarz scheme has more chances to accelerate, since from (23) the variable stepsize is, in general, larger than the constant stepsize counterpart.

4.3 When block Kaczmarz works?

Comparing the convergence rates of RBK algorithm with the constant stepsize (16) and with the adaptive stepsize (20) given in (19) and (24), respectively, with the convergence rate of the basic Kaczmarz method given in (10), we obtain an improvement $\frac{\tau}{\lambda_{\max}^{\text{block}}}>1$ for the block variants. Recall that the stochastic conditioning parameter $\lambda_{\max}^{\text{block}}$ is defined as:

[TABLE]

Therefore, we can get a speed-up even of order approximately $\tau$ for well conditioned matrices, i.e. for matrices having $\lambda_{\max}^{\text{block}}\ll\tau$ . This shows that the probability P plays a key role in defining the importance sampling procedure and consequently in the convergence behavior of RBK. Fortunately, the operator theory literature provides detailed information about the existence of such good probabilities defining the importance sampling. This is usually referred in the literature as good paving [16]. This section summarizes the main results from the literature on row paving and provides a technique for constructing a good paving. The idea is to find a random partition of the rows of the matrix $A$ such that each subset has approximately equal size. Results on existence of good row pavings were derived e.g. in [26]:

Lemma 4.5.

*Let $A$ be a normalized matrix with $m$ rows and $\theta\in(0,1)$ . Then, there is a randomized partition $\{J_{1},\cdots\!,J_{\ell}\}$ of the rows indices with $\ell\geq{\cal O}(\|A\|^{2}\log(1+m)/\theta^{2})$ such that $\lambda_{\max}^{\text{block}}\leq 1+\theta$ . *

Although this is only an existential result, the literature describes several efficient algorithms for constructing good row pavings. For example, assume that $\kappa$ is a permutation of the set $[m]=\{1,2,\cdots,m\}$ , chosen uniformly at random. For each $i=1:\ell$ , define the subsets:

[TABLE]

It is clear that $\{J_{1},\cdots,J_{\ell}\}$ is a random partition of $[m]$ into $\ell$ blocks of approximately equal size. For every normalized matrix, such a random partition leads to a row paving whose $\lambda_{\max}^{\text{block}}$ is relatively small.

Lemma 4.6.

*Let $A$ be a normalized matrix with $m$ rows. Consider a randomized partition $\{J_{1},\cdots\!,J_{\ell}\}$ of the rows indices with $\ell\geq\|A\|^{2}$ subsets. Then, $\{J_{1},\cdots,J_{\ell}\}$ is a row paving with the upper bound $\lambda_{\max}^{\text{block}}\leq 6\log(1+m)$ with probability at least $1-m^{-1}$ . *

A proof of this type of result appears in [25], see also [16]. By merging our theorems on the convergence of RBK algorithm with the previous result on the good paving, we obtain:

Theorem 4.7.

Let $A$ be a normalized matrix and $\{J_{1},\cdots,J_{\ell}\}$ be a random partition of the rows of $A$ , as given by Lemma 4.6, such that $\tau=m/\ell$ is a positive integer. Under the assumptions of Theorems 4.1 and 4.3, the randomized block Kaczmarz method, Algorithm 2, with weights $\omega_{i}^{k}=1/\tau=\ell/m$ for all $i,k$ , and constant stepsize (16) or adaptive stepsize (20) with $\delta=1$ , admits the convergence estimate:

[TABLE]

In conclusion, our new convergence analysis shows when a block variant of Kaczmarz algorithm really works, i.e. we can choose a subset of rows $\tau>1$ at each step, when $\lambda_{\max}^{\text{block}}\ll\tau$ . Hence, a distributed implementation of the RBK algorithm is most effective when the probability distribution P yields a partition of the rows into well-conditioned blocks. Otherwise, we can just apply the basic Kaczmarz algorithm with $\tau=1$ . Moreover, our analysis shows that the optimal batchsize is of order $\tau\sim m/\|A\|^{2}$ . Assuming, for simplicity, that $\tau=m/\ell$ is a positive integer, from Lemma 4.6

[TABLE]

holds with high probability, provided that the matrix $A$ satisfies the following inequality

[TABLE]

Recall that, for a normalized matrix $A$ with $m$ rows, the squared spectral norm $\|A\|^{2}$ attains its maximal value $m$ when $\text{rank}(A)=1$ , i.e. its rows are identical. Therefore, the inequality (26) stipulates that the rows of $A$ must exhibit a large amount of diversity in order for RBK algorithm with extrapolated stepsizes (16) or (20) to perform better than the basic Kaczmarz scheme. Note that convergence rates similar to (25) has been derived in [16] for the block projection Kaczmarz algorithm (3) with the particular stepsize $\alpha_{k}=1$ . However, RBK requires the computation of $\tau$ scalar products in $\mathbb{R}^{n}$ at each iteration, so that its computational cost per iteration is ${\cal O}(\tau n)$ , and thus cheaper than the one corresponding to block projection Kaczmarz (3) that requires solving a least-squares problem at each iteration in about ${\cal O}(\tau^{2}n)$ .

5 Randomized block Kaczmarz algorithm with Chebychev-based stepsize

Finally, we show that we can also choose extrapolated stepsizes in RBK (Algorithm 2) based on the roots of Chebyshev polynomials. For simplicity, we consider either the uniform or partition sampling of Section 3.4 having $|J|=\tau$ . We also assume normalized matrices $A$ and constant weights $\omega_{k}^{i}=1/\tau$ for all $k,i$ . Under these settings, for RBK algorithm with Chebyshev-based stepsize we derive linear or sublinear convergence estimates depending whether $\lambda_{\min}(AA^{T})\!>\!0$ or $\lambda_{\min}(AA^{T})\!=\!0$ , respectively. Below we investigate these two cases.

5.1 Case 1: $\lambda_{\min}(AA^{T})>0$

We get the following linear convergence for this variant of RBK:

Theorem 5.1.

Assume normalized matrix $A$ such that $\lambda_{\min}(AA^{T})>0$ . Let $\{x^{k}\}_{k\geq 0}$ be generated by RBK (Algorithm 2) with the uniform or partition sampling and the weights $\omega_{k}^{i}=1/\tau$ for all $k,i$ . Further, for a fixed number of iterations $k$ the stepsizes $\{\alpha_{j}\}_{j=0}^{k-1}$ are depending on the roots of the Chebyshev polynomial of degree $k$ (see Appendix) as follows:

[TABLE]

where $\kappa$ is a permutation of $[0\!:\!k\!-\!1]$ . Then, we have the following linear convergence for expected iterates:

[TABLE]

Proof 5.2.

For the iteration of RBK (Algorithm 2) we have for any solution $x^{*}\in{\cal X}$ :

[TABLE]

Taking conditional expectation and using (12) with $p_{i}=\tau/m$ for uniform or partition sampling, we get:

[TABLE]

Multiplying from the left this recurrence with $A$ we get:

[TABLE]

or equivalently, using that $Ax^{*}=b$ and taking expectations over the entire history, we obtain:

[TABLE]

Iterating this recurrence and defining the matrix $G=\frac{1}{m}AA^{T}\in\mathbb{R}^{m\times m}$ we obtain:

[TABLE]

If we define the polynomial in the matrix $G$ as $P_{k}(G)=\prod_{j=0}^{k-1}\left(I_{m}-\alpha_{j}G\right)$ , then we can bound the norm of the expected residual by:

[TABLE]

Recall that we consider consistent linear system with $\lambda_{\min}(AA^{T})>0$ . Then, from standard reasoning the spectrum of $G=\frac{1}{m}AA^{T}$ satisfies $\Lambda(G)\subset\mathbb{R}_{++}$ . More precisely:

[TABLE]

Therefore, if we denote by $\lambda_{i}$ the $i$ th eigenvalue of $G$ , we have the following bound:

[TABLE]

In conclusion, we can choose the stepsizes $\alpha_{j}$ for $j=0:k-1$ such that $P_{k}(\lambda)=\prod_{j=0}^{k-1}\left(1-\alpha_{j}\lambda\right)$ is the polynomial least deviating from zero on the interval $[\ell,u]$ and satisfying $P_{k}(0)=1$ . It is well known that this is the polynomial given in terms of a Chebyshev polynomial (see Appendix for a brief review of the main properties of Chebyshev polynomials):

[TABLE]

Then, we can guarantee the following linear convergence in expectation (see Lemma 5.5 in Appendix):

[TABLE]

The stepsizes $\alpha_{j}$ , for $j=0:k-1$ , are chosen as the inverse roots of polynomial $P_{k}(\lambda)$ (see Appendix):

[TABLE]

where $\kappa$ is some fixed permutation of $[0:k-1]$ . We can also derive convergence rates in $\mathbf{E}\left[x^{k}-x_{k}^{*}\right]$ using that $\mathbf{E}\left[x^{k}-x_{k}^{*}\right]\in\text{range}(A^{T})$ , and consequently from Courant-Fischer lemma and (29) we have:

[TABLE]

*proving thus the linear convergence estimate of the theorem. *

From Jensen’s inequality we have $\|\mathbf{E}\left[\cdot\right]\|\leq\mathbf{E}\left[\|\cdot\|\right]$ . In conclusion, $\|\mathbf{E}\left[\cdot\right]\|$ is a weaker criterion than $\mathbf{E}\left[\|\cdot\|\right]$ . Note that convergence rates in the weaker criterion $\|\mathbf{E}\left[x^{k}-x_{k}^{*}\right]\|$ have been also given for another variant of Kaczmarz algorithm in [22] or for the random coordinate descent method in [24]. Moreover, the convergence rate from Theorem 5.1 is the same as for the conjugate gradient method and it is optimal for this class of iterative schemes. However, since this rate does not depend on the size of the blocks $|J|$ , then we usually implement this accelerated variant of Kaczmarz by sampling single rows, that is, $|J|=1$ .

5.2 Case 2: $\lambda_{\min}(AA^{T})=0$

In this case we get sublinear convergence for this variant of RBK:

Theorem 5.3.

Assume normalized matrix $A$ such that $\lambda_{\min}(AA^{T})=0$ . Let $\{x^{k}\}_{k\geq 0}$ be generated by RBK (Algorithm 2) with the uniform or partition sampling and the weights $\omega_{k}^{i}=1/\tau$ for all $k,i$ . Further, for a fixed number of iterations $k$ the stepsizes $\{\alpha_{j}\}_{j=0}^{k-1}$ are depending on the roots of the Chebyshev polynomial of degree $k$ as follows:

[TABLE]

where $\kappa$ is some permutation of $[0\!:\!k-1]$ . Then, we have the following sublinear convergence for the residual of the normal system in expectation:

[TABLE]

Proof 5.4.

From (28) we also get the relation:

[TABLE]

Now, if we consider the normal system $A^{T}Ax=A^{T}b$ , which coincides with $\nabla f(x)=0$ , we have:

[TABLE]

where $x^{*}$ denotes any solution of $Ax=b$ (recall that we consider consistent linear systems). If we define the matrix $G=\frac{1}{m}A^{T}A$ and the polynomial $Q_{k}(G)=G\prod_{j=0}^{k-1}\left(I_{n}-\alpha_{j}G\right)$ , then we obtain the following bound for the residual of the normal system in expectation:

[TABLE]

Since we assume $\lambda_{\min}(AA^{T})=\lambda_{\min}(A^{T}A)=0$ , then the spectrum of $G=\frac{1}{m}A^{T}A$ satisfies:

[TABLE]

Therefore, if we denote by $\lambda_{i}$ the $i$ th eigenvalue of $G$ , we have the following bound:

[TABLE]

In conclusion, we can choose the stepsizes $\alpha_{j}$ for $j=0:k-1$ such that $Q_{k}(\lambda)=\lambda\prod_{j=0}^{k-1}\left(1-\alpha_{j}\lambda\right)$ of degree $k+1$ is the polynomial least deviating from zero on the interval $[0,u]$ and satisfying $Q_{k}(0)=0$ and $Q_{k}^{\prime}(0)=1$ . We show below that this polynomial is also given in terms of a Chebyshev polynomial. Indeed, let us consider the closest root to $-1$ of the Chebyshev polynomial of degree $k+1$ (i.e. $T_{k+1}$ ):

[TABLE]

Then, we define the polynomial:

[TABLE]

Note that this polynomial satisfies the required properties: $\text{deg}(Q_{k})=k+1$ , $Q_{k}(0)=\frac{uT_{k+1}(r_{k+1})}{(1-r_{k+1})T_{k+1}^{\prime}(r_{k+1})}=0$ (recall that $r_{k+1}$ is the $k+1$ root of $T_{k+1}$ ) and $Q_{k}^{\prime}(0)=\frac{T_{k+1}^{\prime}(r_{k+1})}{T_{k+1}^{\prime}(r_{k+1})}=1$ . In conclusion, we get the following bound for this choice of $Q_{k}(\lambda)$ :

[TABLE]

where in the inequality we used that $|T_{k+1}(x)|\leq 1$ for any $x\in[-1,\;1]$ and that the root $r_{k+1}\leq 0$ (see Appendix). Further, since $T_{k+1}(\cos(\theta))=\cos((k+1)\theta)$ , if we differentiate we get $\sin(\theta)T_{k+1}^{\prime}(\cos(\theta))=(k+1)\sin((k+1)\theta)$ . Now, for $r_{k+1}=\cos\left(\pi-\pi/(2k+2)\right)$ we obtain:

[TABLE]

for $k$ sufficiently large (we used that $\sin(\pi-\theta)\sim\theta$ for $\theta$ small). In conclusion, we get the following sublinear convergence (using the notation $\|u\|_{(AA^{T})}=\|A^{T}u\|$ ):

[TABLE]

for $k$ sufficiently large (i.e. for $k$ such that $\sin(\pi-\pi/(2k+2))\sim\pi/(2k+2)$ ). Finally, using that $\lambda_{\max}(A^{T}A)=\lambda_{\max}(AA^{T})$ we get (30). The stepsizes $\alpha_{j}$ , for $j=0:k-1$ , are chosen as the inverse roots of polynomial $Q_{k}(\lambda)$ (see Appendix):

[TABLE]

*where $\kappa$ is some fixed permutation of $[0:k-1]$ . *

Note that the RBK algorithm with Chebyshev-based stepsize belongs to the class of Chebyshev semi-iterative methods [6]. However, from our knowledge, this work is the first one that uses the properties of the Chebyshev polynomials in order to accelerated the convergence rate of randomized block Kaczmarz (RBK) algorithm. Other types of acceleration of Kaczmarz algorithm have been proposed e.g. in [7, 12, 22]. For example, in [22] two dependent steps of basic randomized Kaczmarz algorithm are taken, one from $x^{k}$ and one from $x^{k-1}$ , and then an affine combination of the results produces the next iterate $x^{k+1}$ . For this scheme, [22] derives a similar convergence rate as in Theorem 5.1. In [12] Nesterov’s accelerated random coordinate descent method from [18] is applied to the dual problem (9), leading in the primal space to an accelerated randomized Kaczmarz scheme with momentum. For this accelerated Kaczmarz scheme [12] derives the convergence rate $\mathbf{E}\left[\|x^{k}-x_{k}^{*}\|^{2}\right]\leq(1-\sqrt{\lambda_{\min}(AA^{T})}/m)^{k}\|x^{0}-x_{0}^{*}\|^{2}$ . Although this rate is worse than (27) in terms of constants, it is given in the stronger criterion $\mathbf{E}\left[\|x^{k}-x_{k}^{*}\|^{2}\right]$ . Remains an open problem whether Theorem 5.1 can be also given in the stronger criterion $\mathbf{E}\left[\|x^{k}-x_{k}^{*}\|^{2}\right]$ .

Appendix (Chebyshev polynomials)

In this section some properties of the Chebyshev polynomials are briefly reviewed. We refer to e.g. [19] for more details on Chebyshev polynomials. The Chebyshev polynomials $T_{k}(x)$ , where $\text{deg}(T_{k})=k$ and $k\geq 0$ , are defined by the recursive relation:

[TABLE]

From the above recurrence we observe that the leading coefficient of $T_{k}(x)$ is $2^{k-1}$ , i.e. $T_{k}(x)=2^{k-1}x^{k}+\;\text{lower powers of}\;x$ . In particular, for $x\in[-1,\;1]$ , the Chebyshev polynomials can be written equivalently:

[TABLE]

The equivalence can be verified as follows using that $x=\cos(\theta)$ :

[TABLE]

It follows that $T_{k}(1)=1$ . From this representation of $T_{k}(x)$ it also follows that:

[TABLE]

Moreover, all the $k$ roots of $T_{k}(x)$ are given by:

[TABLE]

In conclusion, we get also the following representation for $T_{k}(x)$ :

[TABLE]

It is also easy to see the following interval transformation $[\ell,\;u]\to[-1,\;1]$ through the relation:

[TABLE]

One important property of the Chebyshev polynomials is that $\frac{1}{2^{k-1}}T_{k}(x)$ has minimal deviation from [math] among all polynomials of degree $k$ with leading coefficient $1$ on $[-1,\;1]$ :

[TABLE]

An immediate consequence of the above property valid for Chebyshev polynomials is the following lemma:

Lemma 5.5.

Let $0<\ell<u$ and $T_{k}^{(\ell,u)}(x)=T_{k}\left(\frac{2x}{u-\ell}-\frac{u+\ell}{u-\ell}\right)$ . Then, the optimal value and the optimal polynomial $P_{k}^{*}$ of the following optimization problem are:

[TABLE]

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Bauschke, P. Combettes and S. Kruk, Extrapolation algorithm for affine-convex feasibility problems , Numerical Algorithms, 41(3):239–274, 2006.
2[2] Y. Censor, Row action methods for huge sparse systems and their applications , SIAM Review, 23: 444–466, 1981.
3[3] Y. Censor, W. Chen, P. Combettes, R. Davidi and G. Herman, On the Effectiveness of Projection Methods for Convex Feasibility Problems with Linear Inequality Constraints , Computational Optimization and Applications, 51(3): 1065–1088, 2012.
4[4] F. Deutsch and H. Hundal, The rate of convergence for the method of alternating projections , Journal of Mathematical Analysis and Applications, 205(2): 381–405, 1997.
5[5] T. Elfving, Block-iterative methods for consistent and inconsistent linear equations , Numer. Math., 35(1): 1–12, 1980.
6[6] G. Golub and R. Varga, Chebyshev semi-iterative methods, successive over-relaxation methods and second-order Richardson iterative methods I, II , Numer. Math., 3:147–168, 1961.
7[7] M. Hanke and W. Niethammer, On the acceleration of Kaczmarz’s method for inconsistent linear systems , Linear Algebra Applications, 130: 83–98, 1990.
8[8] R. Gower and P. Richtarik, Randomized iterative methods for linear systems , SIAM Journal on Matrix Analysis and Applications, 36(4): 1660–1690, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Faster randomized block Kaczmarz algorithms

Abstract

keywords:

1 Introduction

1.1 Iterative methods

1.2 Extrapolation

1.3 Rows importance

1.4 Outline

1.5 Notation

2 Contributions

2.1 General framework

2.2 Algorithms

2.3 Convergence rates

3 Preliminaries

3.1 Basic Kaczmarz algorithm

3.2 Interpretations

3.3 Convergence properties

3.4 Preliminary probability results

4 Randomized block Kaczmarz algorithms

4.1 Randomized block Kaczmarz algorithm with constant stepsize

Theorem 4.1**.**

Proof 4.2**.**

4.2 Randomized block Kaczmarz algorithm with adaptive stepsize

Theorem 4.3**.**

Proof 4.4**.**

4.3 When block Kaczmarz works?

Lemma 4.5**.**

Lemma 4.6**.**

Theorem 4.7**.**

5 Randomized block Kaczmarz algorithm with Chebychev-based stepsize

5.1 Case 1: λmin⁡(AAT)>0\lambda_{\min}(AA^{T})>0λmin​(AAT)>0

Theorem 5.1**.**

Proof 5.2**.**

5.2 Case 2: λmin⁡(AAT)=0\lambda_{\min}(AA^{T})=0λmin​(AAT)=0

Theorem 5.3**.**

Proof 5.4**.**

Appendix (Chebyshev polynomials)

Lemma 5.5**.**

Theorem 4.1.

Proof 4.2.

Theorem 4.3.

Proof 4.4.

Lemma 4.5.

Lemma 4.6.

Theorem 4.7.

5.1 Case 1: $\lambda_{\min}(AA^{T})>0$

Theorem 5.1.

Proof 5.2.

5.2 Case 2: $\lambda_{\min}(AA^{T})=0$

Theorem 5.3.

Proof 5.4.

Lemma 5.5.