An Implicit Representation and Iterative Solution of Randomly Sketched   Linear Systems

Vivak Patel; Mohammad Jahangoshahi; Daniel Adrian Maldonado

arXiv:1904.11919·math.NA·December 23, 2020·SIAM J. Matrix Anal. Appl.

An Implicit Representation and Iterative Solution of Randomly Sketched Linear Systems

Vivak Patel, Mohammad Jahangoshahi, Daniel Adrian Maldonado

PDF

TL;DR

This paper introduces an implicit, iterative approach to randomized sketching for linear systems that addresses practical issues like sketch size selection and storage costs, while enhancing convergence analysis and rates.

Contribution

It proposes an implicit method for solving sketched linear systems that eliminates the need for pre-determined sketch size and reduces storage costs, connecting sketching with randomized iterative solvers.

Findings

01

Improved convergence theory for randomized iterative solvers under various sampling schemes.

02

Enhanced convergence rates with controlled computational and storage costs.

03

Validated approach on forty-nine different linear systems.

Abstract

Randomized linear system solvers have become popular as they have the potential to reduce floating point complexity while still achieving desirable convergence rates. One particularly promising class of methods, random sketching solvers, has achieved the best known computational complexity bounds in theory, but is blunted by two practical considerations: there is no clear way of choosing the size of the sketching matrix apriori; and there is a nontrivial storage cost of the sketched system. In this work, we make progress towards addressing these issues by implicitly generating the sketched system and solving it simultaneously through an iterative procedure. As a result, we replace the question of the size of the sketching matrix with determining appropriate stopping criteria; we also avoid the costs of explicitly representing the sketched linear system; and our implicit representation…

Tables2

Table 1. Table 1 : A summary of the time and total computational cost (effort) incurred by Algorithm 1 and its parallelized variants. We do not report any advantages that should be exploited when A 𝐴 A or w 𝑤 w are sparse. In the shared and distributed memory platforms, we assume that there are p 𝑝 p processors dedicated to computing A ′ w superscript 𝐴 ′ 𝑤 A^{\prime}w and b ′ w superscript 𝑏 ′ 𝑤 b^{\prime}w , and one processor dedicated to computing the updates. The “Network” column refers to whether communication costs over a network are incurred.

Total Time and Effort Costs to Iteration $k$
Platform	Computing ${[\begin{matrix} A & b \end{matrix}]}^{'} w$		Update Costs		Network
	Time	Total Effort	Iterate	Matrix
Sequential	$𝒪 [k n d]$	$𝒪 [k n d]$	$𝒪 [k d^{2}]$	$𝒪 [k d^{3}]$	No
Shared Memory	$𝒪 [k n d / p]$	$𝒪 [k n d]$	$𝒪 [k d^{2}]$	$𝒪 [k d^{3}]$	No
Distributed Memory	$𝒪 [k n d / p^{2}]$	$𝒪 [k n d / p]$	$𝒪 [k d^{2}]$	$𝒪 [k d^{3}]$	Yes

Table 2. Table 2 : A comparison in the estimated theoretical bounds on the rates of convergence of Gaussian-sketched base randomized methods in ℓ 2 superscript ℓ 2 \ell^{2} between this work and the results in Richtárik and Takác ( 2020 ) . The estimates are made by simulation of the theoretical rates. The comparison is made on five different matrices available in the MatrixDepot , as described in Section 5 . The main message is that the results of Richtárik and Takác ( 2020 ) are tighter than our result, as they apply to the average case. This is expected as our result applies to more than just the i.i.d. sampling case and hold with probability one (asymptotically).

Comparison of Estimated Theoretical Rates of Convergence
Matrix Name	Estimated Rates by Result
	Theorem 4.8 of Richtárik and Takác (2020)	Proposition 3
deriv2	$1 - 𝒪 [10^{- 4}]$	$1 - 𝒪 [10^{- 35}]$
heat	$1 - 𝒪 [10^{- 15}]$	$1 - 𝒪 [10^{- 34}]$
randsvd	$1 - 𝒪 [10^{- 15}]$	$1 - 𝒪 [10^{- 71}]$
ursell	$1 - 𝒪 [10^{- 16}]$	$1 - 𝒪 [10^{- 161}]$
wing	$1 - 𝒪 [10^{- 16}]$	$1 - 𝒪 [10^{- 163}]$

Equations332

A x^{*} = b .

A x^{*} = b .

x_{k + 1} = x_{k} + V_{k} (b - A x_{k}),

x_{k + 1} = x_{k} + V_{k} (b - A x_{k}),

P [I = i] = {\frac{∥ A _{i,} ∥ _{2}^{2}}{∥ A ∥ _{F}^{2}} 0 i = 1, \dots, n otherwise .

P [I = i] = {\frac{∥ A _{i,} ∥ _{2}^{2}}{∥ A ∥ _{F}^{2}} 0 i = 1, \dots, n otherwise .

x_{k + 1} = x_{k} + A_{I,} e_{I}^{'} (b - A x_{k}) = x_{k} + A_{I,} (b_{I} - A_{I,}^{'} x_{k}) / ∥ A_{I,} ∥_{2}^{2},

x_{k + 1} = x_{k} + A_{I,} e_{I}^{'} (b - A x_{k}) = x_{k} + A_{I,} (b_{I} - A_{I,}^{'} x_{k}) / ∥ A_{I,} ∥_{2}^{2},

P [J = j] = {\frac{∥ A _{, j} ∥ _{2}^{2}}{∥ A ∥ _{F}^{2}} 0 j = 1, \dots, d otherwise .

P [J = j] = {\frac{∥ A _{, j} ∥ _{2}^{2}}{∥ A ∥ _{F}^{2}} 0 j = 1, \dots, d otherwise .

x_{k + 1} = x_{k} + e_{J} A_{, J}^{'} (b - A x_{k}) / ∥ A_{, J} ∥_{2}^{2},

x_{k + 1} = x_{k} + e_{J} A_{, J}^{'} (b - A x_{k}) / ∥ A_{, J} ∥_{2}^{2},

x_{k + 1} = x_{k} + (E_{T}^{'} A^{'} A E_{T})^{†} E_{T}^{'} A^{'} (b - A x_{k}),

x_{k + 1} = x_{k} + (E_{T}^{'} A^{'} A E_{T})^{†} E_{T}^{'} A^{'} (b - A x_{k}),

x_{k + 1} = x_{k} + A^{'} N_{k}^{'} (N_{k} A A^{'} N_{k}^{'})^{†} N_{k} (b - A x_{k}),

x_{k + 1} = x_{k} + A^{'} N_{k}^{'} (N_{k} A A^{'} N_{k}^{'})^{†} N_{k} (b - A x_{k}),

x_{k + 1} = x_{k} + M_{k} V_{k} (b - A x_{k}) .

x_{k + 1} = x_{k} + M_{k} V_{k} (b - A x_{k}) .

∥ x_{k + 1} - x^{*} ∥_{B}^{2},

∥ x_{k + 1} - x^{*} ∥_{B}^{2},

S_{k + 1}

S_{k + 1}

∥ x_{k + 1} - x^{*} ∥_{B}^{2} = tr [B (I - M_{k} V_{k} A) S_{k} (I - M_{k} V_{k} A)^{'}] .

∥ x_{k + 1} - x^{*} ∥_{B}^{2} = tr [B (I - M_{k} V_{k} A) S_{k} (I - M_{k} V_{k} A)^{'}] .

M_{k} (V_{k} A S_{k} A^{'} V_{k}^{'}) - S_{k} A^{'} V_{k}^{'} = 0.

M_{k} (V_{k} A S_{k} A^{'} V_{k}^{'}) - S_{k} A^{'} V_{k}^{'} = 0.

M_{k} = S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†},

M_{k} = S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†},

S_{k + 1} = S_{k} - S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†} V_{k} A_{k} S_{k} .

S_{k + 1} = S_{k} - S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†} V_{k} A_{k} S_{k} .

x_{k + 1} = x_{k} + M_{k} V_{k} (b - A x_{k}),

x_{k + 1} = x_{k} + M_{k} V_{k} (b - A x_{k}),

M_{k} = S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†};

M_{k} = S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†};

S_{k + 1} = S_{k} - S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†} V_{k} A_{k} S_{k} .

S_{k + 1} = S_{k} - S_{k} A^{'} V_{k}^{'} (V_{k} A S_{k} A^{'} V_{k}^{'})^{†} V_{k} A_{k} S_{k} .

M_{k} = {\frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k} ∥ v _{k} ∥ _{2}^{2}} S_{k} A^{'} w_{k} v_{k}^{'} 0 S_{k} A^{'} w_{k} \neq = 0 otherwise,

M_{k} = {\frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k} ∥ v _{k} ∥ _{2}^{2}} S_{k} A^{'} w_{k} v_{k}^{'} 0 S_{k} A^{'} w_{k} \neq = 0 otherwise,

S_{k + 1} = {S_{k} - \frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k}} S_{k} A^{'} w_{k} w_{k}^{'} A S_{k} S_{k} S_{k} A^{'} w_{k} \neq = 0 otherwise.

S_{k + 1} = {S_{k} - \frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k}} S_{k} A^{'} w_{k} w_{k}^{'} A S_{k} S_{k} S_{k} A^{'} w_{k} \neq = 0 otherwise.

x_{k + 1} = {x_{k} + \frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k}} S_{k} A^{'} w_{k} w_{k}^{'} (b - A x_{k}) x_{k} S_{k} A^{'} w_{k} \neq = 0 otherwise .

x_{k + 1} = {x_{k} + \frac{1}{w _{k}^{'} A S _{k} A ^{'} w _{k}} S_{k} A^{'} w_{k} w_{k}^{'} (b - A x_{k}) x_{k} S_{k} A^{'} w_{k} \neq = 0 otherwise .

x_{k + 1}

x_{k + 1}

S_{k + 1}

x_{k + 1}

x_{k + 1}

S_{k + 1}

w_{0}^{'} w_{1}^{'} ⋮ w_{r - 1}^{'}

w_{0}^{'} w_{1}^{'} ⋮ w_{r - 1}^{'}

[R_{1} E_{1} R_{2} E_{2} \dots R_{n} E_{n}],

[R_{1} E_{1} R_{2} E_{2} \dots R_{n} E_{n}],

U := {i : u [i] \neq = 0} \subset j \in Q ⋃ Z_{j} \cup Q,

U := {i : u [i] \neq = 0} \subset j \in Q ⋃ Z_{j} \cup Q,

(I_{d} - Z Z^{'}) (I_{d} - Z Z^{'}) q = (I_{d} - 2 Z Z^{'} + Z Z^{'}) q = (I_{d} - Z Z^{'}) q = u,

(I_{d} - Z Z^{'}) (I_{d} - Z Z^{'}) q = (I_{d} - 2 Z Z^{'} + Z Z^{'}) q = (I_{d} - Z Z^{'}) q = u,

u [l] = q [l] - j = 1 \sum k (q^{'} z_{j}) z_{j} [l] = q [l] - j \in Q \sum (q^{'} z_{j}) z_{j} [l],

u [l] = q [l] - j = 1 \sum k (q^{'} z_{j}) z_{j} [l] = q [l] - j \in Q \sum (q^{'} z_{j}) z_{j} [l],

l \neq \in j \in Q ⋃ Z_{j} \cup Q .

l \neq \in j \in Q ⋃ Z_{j} \cup Q .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An Implicit Representation and Iterative Solution of Randomly Sketched Linear Systems

Vivak Patel, Mohammad Jahangoshahi & Daniel Adrian Maldonado

Abstract

Randomized linear system solvers have become popular as they have the potential to reduce floating point complexity while still achieving desirable convergence rates. One particularly promising class of methods, random sketching solvers, has achieved the best known computational complexity bounds in theory, but is blunted by two practical considerations: there is no clear way of choosing the size of the sketching matrix apriori; and there is a nontrivial storage cost of the sketched system. In this work, we make progress towards addressing these issues by implicitly generating the sketched system and solving it simultaneously through an iterative procedure. As a result, we replace the question of the size of the sketching matrix with determining appropriate stopping criteria; we also avoid the costs of explicitly representing the sketched linear system; and our implicit representation also solves the system at the same time, which controls the per-iteration computational costs.

Additionally, our approach allows us to generate a connection between random sketching methods and randomized iterative solvers (e.g., randomized Kaczmarz method, randomized Gauss-Seidel). As a consequence, we exploit this connection to (1) produce a stronger, more precise convergence theory for such randomized iterative solvers under arbitrary sampling schemes (i.i.d., adaptive, permutation, dependent, etc.), and (2) improve the rates of convergence of randomized iterative solvers at the expense of a user-determined increases in per-iteration computational and storage costs. We demonstrate these concepts on numerical examples on forty-nine distinct linear systems.

1 Introduction

Over the past few decades, randomized linear system solvers have become popular as they have the potential to reduce floating point complexity or maintain limited memory footprints, while still achieving desirable convergence rates (e.g., Strohmer and Vershynin, 2009; Woodruff, 2014). In particular, the noniterative class of randomized linear system solvers, based on random matrix sketching (see Woodruff, 2014), have exceptionally low computational complexities, at least in theory. Unfortunately, the theoretical promise of these random matrix sketching solvers is blunted by their practical limitations: there is no clear way of choosing the size of the sketching matrix and there is a nontrivial storage cost of the projected system (Mahoney, 2016). In fact, the practical challenges of random matrix sketching solvers have prevented them from being fully embraced by the numerical optimization community (e.g., Nocedal, 2018).

In this work, we begin to address these two primary practical issues of random matrix sketching, which we recall are: the challenge of choosing the size of the sketching matrix, and the challenge of storing the projected system. Our main insight is to recast the separate sketch-then-solve core of random sketching methods into an equivalent, iterative sketch-and-solve, in which the sketching matrix is generated incrementally without being explicitly stored and the system is incrementally solved from the implicitly derived sketched matrix.111It is worth mentioning that the random sketch solvers have been used iteratively in a different sense (e.g., see Gower and Richtárik, 2015): the noniterative scheme is simply repeated in order to get better convergence properties. We are not doing this, but rather turning the noniterative scheme into an iterative one. As a result of our approach, (1) we can implicitly grow the size of the sketching matrix until a user-determined stopping criteria is reached without having to determine the size of the sketching matrix apriori; (2) we implicitly represent the sketched system without having to explicitly store the projected system, which allows us to avoid the cost of storing the projected system; and (3) we can naturally implement random sketching solvers within distributed and parallel computing paradigms. Thus, our approach of converting the usual sketch-then-solve procedure to a sketch-and-solve procedure begins to address the aforementioned practical challenges of random matrix sketching.

Moreover, our approach provides a bridge between the newer concerns around sketching-based solvers and more classical areas of applied mathematics research such as stopping criteria. One such bridge is the placement of random sketching methods and (what we will call) base randomized iterative methods222We will be more precise about what we refer to as base methods. For now, such methods are exemplified by randomized Kaczmarz (Strohmer and Vershynin, 2009) and randomized Gauss-Seidel (Leventhal and Lewis, 2010). on a single spectrum of procedures, which has several immediate consequences.

First, the number of rows of the sketching matrix that results in the solution (this number is a random quantity) connects to an alternative rate-of-convergence result for general base randomized iterative methods that guarantees a rate-of-convergence less than one for arbitrary sampling schemes—even for underdetermined systems (Theorem 5). Consequently, our results complement and improve on previous results in several ways. In particular, we allow for arbitrary sampling schemes, not just sampling schemes that are independent and identically distributed as in Gower and Richtárik (2015) (Lemma 4.2), Richtárik and Takác (2020) (Theorem 4.8), Zouzias and Freris (2013) (Theorem 3.4), and Ma et al. (2015) (Equation 3.10). Moreover, our results do away with the exactness assumption (see Richtárik and Takác, 2020, Assumption 2), and precisely characterize the inexactness that can occur for arbitrary sampling schemes (Theorems 5 and 6). Additionally, our results define convergence on a maximal set—effectively, a set occurring with probability one for sampling schemes of interest—, which builds on the work of Chen and Powell (2012). As example applications of our results, we supply rates of convergence with probability one for random permutation sampling methods (Proposition 2) and independent, identically distributed sampling schemes (asymptotically, see Proposition 3). As a more interesting application of our results, we specify generic conditions for the convergence of a broad class of adaptive schemes (see Subsection 4.5), which can account for the maximum residual scheme, the maximum distance scheme, schemes that randomize over a greedy subset, and schemes that are greedy over randomized subsets (Motzkin and Schoenberg, 1954; Gubin et al., 1967; Lent, 1976; Censor, 1981; Nutini et al., 2016; Bai and Wu, 2018; Haddock and Ma, 2019). We note that the rates that we provide as examples are rather loose in comparison to results that are specialized to each case, yet our results often supply information that is not available in these other results as discussed above.

Second, we can generate a series of “intermediate” procedures between sketching methods and base methods that trade-off between computational resources (e.g., floating-point operations, storage) and rates of convergence. Thus, we can take a sketching method and reduce its computational footprint in exchange for a slower rate of convergence, or increase the computational footprint of base methods to improve their rate of convergence (Algorithm 2). Moreover, these “intermediate” procedures can be readily parallelized as we discuss in Section 2.

Finally, by shifting our perspective from improving the sketch-then-solve procedure to improving the performance of base methods, we find that our approach is a randomized orthogonalization procedure in the row space of the coefficient matrix of the linear system. Thus, by presenting our approach from this latter perspective, we will simplify the introduction and the related theory of our approach. Now, before pursuing this further, we reiterate our main contributions.

First, we turn the typical sketch-then-solve noniterative random sketching solver into an iterative, sketch-and-solve method, which lays a foundation for addressing the previously enumerated practical challenges of random sketching solvers: there is no clear way of choosing the size of the sketching matrix apriori; and there is a nontrivial storage cost of the sketched system. 2. 2.

Second, through our approach, we place random sketching methods and base randomized iterative methods (e.g., randomized Kaczmarz, randomized Gauss-Seidel, and Sketch-and-Project (Gower and Richtárik, 2015)) on a single spectrum of methods. 3. 3.

Third, owing to this connection, we are able to generate “intermediate” methods between random sketching and base methods, which can trade-off between computational resources and rates of convergence. 4. 4.

Fourth, owing to this connection, we use the geometric implications of random sketching methods to develop an alternative rate-of-convergence result for general base methods for arbitrarily determined systems and arbitrary sampling schemes, which advances the with-probability-one results of Chen and Powell (2012), generalizes the deterministic cyclic results in Bai and Liu (2013); Galántai (2005); Wallace and Sekmen (2014), complements the mean-squared error results of Richtárik and Takác (2020), and accounts for a litany of adaptive methods considered in Motzkin and Schoenberg (1954); Gubin et al. (1967); Lent (1976); Censor (1981); Nutini et al. (2016); Bai and Wu (2018); Haddock and Ma (2019). 5. 5.

Finally, we provide a generic set of conditions for characterizing a broad class of adaptive methods, and, from these conditions, prove convergence and rate-of-convergence results for a number of classical and emerging adaptive methods in the literature under a unified framework (see Subsection 4.5).

The remainder of this paper is organized as follows. In Section 2, we introduce our procedure; we state the connection between our procedure and random sketching methods, which allows us to convert the less practical sketch-then-solve approach to our sketch-and-solve approach; and, finally, we introduce our general algorithm and variants for low-memory environments, shared memory environments, distributed memory environments, and large, sparse, structured linear systems. In Sections 3 and 4, we develop the convergence theory for the two methodological extremes—sketching and base methods—leaving the intermediate, more complex cases to future work, and discuss particular examples. In Section 5, we test our algorithms on forty-nine distinct linear systems. In Section 6, we conclude this work and preview future efforts.

2 Our Procedure

While our motivating application is to address the practicality of random sketching methods, our approach is best introduced from the perspective of base randomized iterative methods. Here, we review the basic formulation of randomized iterative methods (Subsection 2.1), which we then use to heuristically introduce our general procedure (Subsection 2.2). We then refine our procedure for the case of rank-one methods, such as Randomized Kaczmarz and Randomized Gauss-Seidel, which allows us to restate random sketching from a sketch-then-solve procedure to a sketch-and-solve procedure (Subsection 2.3). We conclude this section with comments on algorithmic refinements for parallel platforms (Subsection 2.4.1), limited memory platforms (Subsection 2.4.2), and, for structured linear systems, limited communication platforms (Subsection 2.4.3).

2.1 A Brief Overview

Let $A\in\mathbb{R}^{n\times d}$ and $b\in\mathbb{R}^{n}$ be the coefficient matrix and constant vector, respectively. Assuming consistency, our goal is to determine an $x^{*}\in\mathbb{R}^{d}$ , not necessarily unique, such that

[TABLE]

In a base randomized iterative approach, a sequence of iterates $\{x_{k}:k+1\in\mathbb{N}\}$ is generated that has the form

[TABLE]

where $V_{k}\in\mathbb{R}^{d\times n}$ are independent random variables, which we call residual projection matrices (RPM). The RPM defines the base technique which is being used. To make this formulation concrete, we give several examples of randomized iterative methods that have this formulation.

Randomized Kaczmarz.

Let $A_{i,}\in\mathbb{R}^{d}$ denote the $i^{\text{th}}$ row of $A$ and let $e_{i}$ denote the $i^{\text{th}}$ standard basis vector of dimension $n$ . Define the random variable $I$ such that

[TABLE]

Now, given an independent copy of $I$ at each $k$ , define the RPM, $V_{k}=A_{I,}e_{I}^{\prime}/\left\|A_{I,}\right\|_{2}^{2}.$ Then, using Eq. 2,

[TABLE]

which is the Randomized Kaczmarz method of Strohmer and Vershynin (2009). $\quad\blacksquare$

Randomized Gauss-Seidel.

Let $A_{,j}\in\mathbb{R}^{n}$ denote the $j^{\text{th}}$ column of $A$ and let $f_{j}$ denote the $j^{\text{th}}$ standard basis vector of dimension $d$ . Define a random variable $J$ such that

[TABLE]

Now, given an independent copy of $J$ at each $k$ , define the RPM, $V_{k}=e_{J}A_{,J}^{\prime}/\left\|A_{,J}\right\|_{2}^{2}.$ Then, using Eq. 2,

[TABLE]

which is the Randomized Gauss-Seidel method of Leventhal and Lewis (2010). $\quad\blacksquare$

Randomized Block Coordinate Descent.

Let $t$ be a subset of $\{1,\ldots,d\}$ . Let $E_{t}\in\mathbb{R}^{d\times|\tau|}$ whose columns are the $d$ -dimensional standard basis vectors whose non-zero components correspond to the indices in $t$ . Let $\mathcal{T}$ be a partition $\{1,\ldots,d\}$ , and define a random variable $T$ that randomly selects a partition in $\mathcal{T}$ . Given an independent copy of $T$ at each $k$ , define the RPM, $V_{k}=(E_{T}^{\prime}A^{\prime}AE_{T})^{\dagger}E_{T}^{\prime}A^{\prime}$ . Then, using Eq. 2,

[TABLE]

which is a version of the randomized block coordinate descent method specified by (Gower and Richtárik, 2015, Equation 3.14). $\quad\blacksquare$

Sketch-and-Project.

Let $\{N_{0},N_{1},\ldots\}$ be a sequence of sketching matrices with $n$ columns. Define the $k^{\mathrm{th}}$ RPM to be $V_{k}=A^{\prime}N_{k}^{\prime}(N_{k}AA^{\prime}N_{k}^{\prime})^{\dagger}N_{k}$ . Then, using Eq. 2,

[TABLE]

which is the general sketch-and-project method (Gower and Richtárik, 2015, Equation 2.2). $\quad\blacksquare$

2.2 A Heuristic Derivation

Here, given a strategy for defining $\{V_{k}:k+1\in\mathbb{N}\}$ , we consider how to augment the randomized iterative method with prior information in order to improve convergence. For this purpose, we propose defining a sequence of matrices $\{M_{k}:k+1\in\mathbb{N}\}\subset\mathbb{R}^{d\times d}$ (discussed below) and modify Eq. 2 to be

[TABLE]

Of course, $M_{k}$ can simply be absorbed by $V_{k}$ ; however, our goal is to augment a randomized iterative method. For this reason, we will keep these two quantities separate.

The main question now is how to choose $\{M_{k}:k+1\in\mathbb{N}\}$ . Our guiding principle is that $M_{k}$ should minimize some measure of error between $x_{k+1}$ and $x^{*}$ . However, implementing this guiding principle requires (1) choosing an appropriate error measure and (2) handling the fact that $x^{*}$ is unknown. In order to convey the intuition behind our procedure, we now state the heuristics that we use to make these choices.

Choosing an Error Measure.

Temporarily, suppose $x^{*}$ is known, and suppose we choose the $l^{1}$ error as our measure. Then, we must minimize the difference between the next iterate and $x^{*}$ . While this error metric might have merit, solving it is a convex optimization problem that is as difficult to solve as the original linear system. Therefore, we will need an error measure which gives an explicit representation for $M_{k}$ . Hence, one sensible choice is to use the Mahalanobis norm,

[TABLE]

where $B$ is a positive definite, symmetric $\mathbb{R}^{d\times d}$ matrix.

Compensating for the Unknown Solution.

Now, we consider the task of compensating for the unknown $x^{*}$ . For a fixed $x^{*}$ and for all $k+1\in\mathbb{N}$ , let $S_{k}=(x_{k}-x^{*})(x_{k}-x^{*})^{\prime}$ . Then, $S_{k+1}$ is related to $S_{k}$ by

[TABLE]

where we have made use of Eq. 3. Using Eq. 5, we can rewrite Eq. 4 as

[TABLE]

To find an optimal $M_{k}$ , we differentiate the right hand side and set the quantity equal to zero, which, explicitly is

[TABLE]

Clearly, $V_{k}A{S}_{k}A^{\prime}V_{k}^{\prime}$ is positive semi-definite, so the solution to such a system will be the minimizer of the original objective function. However, Eq. 6 may have many possible solutions or may fail to be consistent. In the case of nonunique solutions, we arbitrarily choose the solution with the smallest Frobenius norm. In the case of an inconsistent system, we arbitrarily choose the solution that minimizes the Frobenius norm of the residual and has the minimal Frobenius norm. In both cases, a straightforward calculation gives

[TABLE]

where $\dagger$ represents the Moore-Penrose Pseudo-inverse. Using Eq. 7 with Eq. 5, we have the following recursion

[TABLE]

From Eq. 7 and Eq. 8, it is clear that if $S_{0}$ were known, then the remaining unknown quantities could be determined.

Our Procedure.

Since $S_{0}$ is unknown, we use the following heuristic procedure instead. First, we let $S_{0}=I_{d}$ , where $I_{d}$ is the $d$ -dimensional identity matrix. Then, we recursively define $M_{k}$ and $S_{k}$ according to Eqs. 7 and 8. To summarize, given $\{V_{k}:k+1\in\mathbb{N}\}$ , we let $S_{0}=I_{d}$ , let $x_{0}\in\mathbb{R}^{d}$ , and define

[TABLE]

where

[TABLE]

and

[TABLE]

To interpret the terms in the above procedure, we begin by ignoring $S_{k}$ (i.e., set it to the identity). In this case, $M_{k}$ and its role in updating $x_{k}$ to $x_{k+1}$ is familiar: $M_{k}$ serves to map the residual onto the row space of $V_{k}A$ , thereby ensuring that $x_{k+1}$ satisfies $V_{k}Ax_{k+1}=V_{k}b$ . If we now consider the role of $S_{k}$ , we see that it is an orthogonal projector that “weights” the behavior of $M_{k}$ to ensure that $x_{k+1}$ satisfy $V_{i}Ax_{k+1}=V_{i}b$ for $i\leq k$ . We will see these interpretations clearly and formally when we focus on the case of rank-one $V_{k}$ next.

We pause here momentarily to discuss the relationship between our procedure, as specified by Eqs. 9, 10 and 11, and the sketch-and-project method in Gower and Richtárik (2015) and Richtárik and Takác (2020). At first glance, it may seem that our procedure is a special case of sketch-and-project with adaptive choices of the inner product at each iteration of the sketch-and-project update. Unfortunately, an effort to recast our approach as a special case of sketch-and-project breaks down at two fundamental points. First, the adaptive choices of the sketch-and-project inner product would have to be the inverse of $S_{k}$ , which are orthogonal projection matrices. As a result, the inverse is ill-defined and the inner product is ill-defined. Of course, this can be rectified by allowing for a pseudo-metric, but this then results in the second major point of difficulty: the theory presented in Gower and Richtárik (2015) and Richtárik and Takác (2020) relies on the determinism and invertibility of the matrix defining the metric space to prove convergence. Thus, sketch-and-project, without a substantial investment, cannot readily include our approach. On the other hand, we can state sketch-and-project as a base randomized iterative approach, as shown in Subsection 2.2, and then improve on it with our procedure via Eqs. 9, 10 and 11.

2.3 Rank-One Refinements and Random Sketching

By choosing $x_{0}\in\mathbb{R}^{d}$ and $S_{0}=I_{d}$ , Eqs. 9, 10 and 11 describe an orthogonal projection procedure for typical randomized iterative procedures. However, because our goal is to improve the practicality of random sketching methods, we will need to focus on a particular refinement of the general procedure that occurs when $\{V_{k}\}$ are rank-one matrices, that is, when there exist pairs of vectors $\{(v_{k},w_{k})\}$ such that $V_{k}=v_{k}w_{k}^{\prime}$ for each $k$ . In this case, Eqs. 10 and 11 become

[TABLE]

and

[TABLE]

Moreover, if we substitute Eq. 12 into Eq. 9, we recover

[TABLE]

It follows from Eqs. 14 and 13 that in the case of a rank-one RPM, the left singular vector of the RPM is not important. To give some explicit examples, recall that rank-one RPM methods include the important special cases of randomized Kaczmarz and Gauss-Seidel.

Randomized Kaczmarz with Orthogonalization.

Let $A_{i,}\in\mathbb{R}^{d}$ denote the $i^{\text{th}}$ row of $A$ and let $e_{i}$ denote the $i^{\text{th}}$ standard basis vector of dimension $n$ . Define the random variable $I$ arbitrarily taking values in $\{1,\ldots,n\}$ . Now, given an independent copy of $I$ at each $k$ , the randomized Kaczmarz method has rank-one RPM, $V_{k}=A_{I,}e_{I}^{\prime}/\left\|A_{I,}\right\|_{2}^{2}.$ Then, using Eqs. 14 and 13, the randomized Kaczmarz method with orthogonalization is

[TABLE]

when $S_{k}A^{\prime}e_{I}\neq 0$ , or is $x_{k+1}=x_{k}$ and $S_{k+1}=S_{k}$ otherwise. $\quad\blacksquare$

Randomized Gauss-Seidel with Orthogonalization.

Let $A_{,j}\in\mathbb{R}^{n}$ denote the $j^{\text{th}}$ column of $A$ and let $f_{j}$ denote the $j^{\text{th}}$ standard basis vector of dimension $d$ . Define a random variable $J$ arbitrarily taking values in $\{1,\ldots,d\}$ . Now, given an independent copy of $J$ at each $k$ , the randomized Gauss-Seidel method has rank-one RPM, $V_{k}=e_{J}A_{,J}^{\prime}/\left\|A_{,J}\right\|_{2}^{2}.$ Then, using Eqs. 14 and 13, the randomized Gauss-Seidel method with orthogonalization is

[TABLE]

when $S_{k}A^{\prime}A_{,J}\neq 0$ , or is $x_{k+1}=x_{k}$ and $S_{k+1}=S_{k}$ otherwise. $\quad\blacksquare$

Again, we see from the two preceding examples that the left singular vector of the rank-one RPM does not play a role in the updates for our procedure. As we now explain, this observation is critical for converting the impractical, noniterative randomized sketch-then-solve methods into iterative randomized sketch-and-solve methods.

Recall that the fundamental sketch-then-solve procedure is to construct a specialized matrix $N^{\mathrm{sketch}}\in\mathbb{R}^{k\times n}$ , then generate and solve the smaller, sketched problem $(N^{\mathrm{sketch}}A)x=N^{\mathrm{sketch}}b$ (see Woodruff, 2014, Ch. 1).333We note that the typical formulation considers linear regression rather than a linear system. The special matrix $N^{\mathrm{sketch}}$ , called the sketching matrix, can be generated in a variety of ways such as making each entry an independent, identically distributed Gaussian random variable (Indyk and Motwani, 1998), or by setting the columns of $N^{\mathrm{sketch}}$ as uniformly sampled columns (with replacement) of the appropriately-dimensioned identity matrix (Cormode and Muthukrishnan, 2005).

In order to convert the usual sketch-then-solve procedure into our sketch-and-solve procedure, we simply set $\{w_{k}:k+1\in\mathbb{N}\}\subset\mathbb{R}^{n}$ to the transposed rows of $N^{\mathrm{sketch}}$ , which we will rigorously demonstrate in Section 3. Of course, this requires that we have a streaming procedure for generating arbitrarily many rows of $N^{\mathrm{sketch}}$ . For concreteness, we show how to do this for the two sketching strategies just mentioned.

Random Gaussian Sketch.

In the random Gaussian sketch, the entries of the sketching matrix, $N^{\mathrm{sketch}}$ , are independent, standard normal random variables. Accordingly, we let $\{w_{k}\}$ be independent, $n$ -dimensional standard normal vectors. We see that if $N^{\mathrm{sketch}}$ has $r$ rows, then $N^{\mathrm{sketch}}$ and

[TABLE]

have the same distribution. $\quad\blacksquare$

Count Sketch.

Fix $K\in\mathbb{N}$ , and let $\{E_{1},E_{2},\ldots\}$ be drawn from the $\mathbb{R}^{K}$ standard basis vectors with replacement. Define a sequence of Rademacher random variables $\{R_{1},R_{2},\ldots\}$ which are independent and independent of $\{E_{1},E_{2},\ldots\}$ . The count sketch sketching matrix, $N^{\mathrm{sketch}}$ , is specified by

[TABLE]

which is a matrix whose entries are either $-1$ , [math] or $1$ . Generally, the choice of $K$ is the topic of substantial theory and consideration (Cormode and Muthukrishnan, 2005; Clarkson and Woodruff, 2017). Owing to the fact that we have a streaming procedure, we do not need to worry too much about $K$ . Therefore, we generate $\{w_{k}\}$ as follows:

Generate a count sketch matrix with $K$ small. In our experiments below, we used $K=10$ . 2. 2.

To generate a $w_{k}$ , pop a row of the matrix and set it to $w_{k}$ . 3. 3.

Once the count sketch matrix is exhausted, regenerate a new count sketch matrix with the same $K$ . Repeat.

From this strategy, (a) if we let $\{N_{(i)}:i\in\mathbb{N}\}$ denote a sequence of independent $K\times n$ count sketch matrices, (b) $i_{k}$ denote the remainder of an integer $k$ divided by $K$ and incremented by one, and (c) we let $\{e_{i}\}$ denote the standard basis vectors of $\mathbb{R}^{K}$ , then $w_{k}=N_{(i_{k})}^{\prime}e_{i_{k}}$ for all $k+1\in\mathbb{N}$ . $\quad\blacksquare$

Thus, if we let RPMStrategy() define a generic user-defined procedure for choosing $\{w_{k}:k+1\in\mathbb{N}\}$ , then this observation gives us Algorithm 1 for (1) converting the sketch-then-solve procedure into a sketch-and-solve procedure, and (2) adding orthogonalization to such base methods as randomized Kaczmarz and randomized Gauss-Seidel.

2.4 Algorithmic Refinements Considering the Computing Platform

Algorithm 1 implicitly assumes the traditional sequential programming paradigm. However, the performance of the algorithm can be improved by taking advantage of parallel computing architectures. Here, we will consider a handful of important computing architecture abstractions and how our procedure can adapt to different configurations. In Subsection 2.4.1, we will consider the case of a parallel computing architecture for which the communication overhead, which is proportional to the dimension $d$ , is not a limiting factor. For this subsection, the problems that we have in mind come from data and imaging sciences, where $n\gg d$ and $d$ is reasonably sized. In Subsection 2.4.2, we consider a similar class of problems where the communication of $\mathcal{O}\left[d\right]$ -sized vectors is acceptable and $n\gg d$ , but that $d$ is so large that storing and manipulating a matrix in $\mathbb{R}^{d\times d}$ is burdensome. Finally, in Subsection 2.4.3 we will consider problems in which computational overhead becomes a bottleneck for scalability, but that we have structured systems that will allow us to circumvent this issue. For this ultimate subsection, the problems that we have in mind here come from the solution of systems of differential equations (e.g., Dongarra and Sørensen, 1986).

2.4.1 Asynchronous Parallelization on Shared and Distributed Memory Platforms

First, when we are using a matrix sketch for RPMStrategy(), one of the expensive components of the computation is determining $\begin{bmatrix}A&b\end{bmatrix}^{\prime}w_{k}$ . Fortunately, in our sketch-and-solve procedure, this expensive computation can be trivially asynchronously parallelized on a shared memory platform when

the data within the rows $\begin{bmatrix}A&b\end{bmatrix}$ are stored together, and 2. 2.

the RPMStrategy() generates $\{w_{k}:k+1\in\mathbb{N}\}$ that are either independent (e.g., the Gaussian Strategy) or can be grouped into independent subsets (e.g., the Count-Sketch strategy).

When these two requirements are met, each processor can generate its own $\{w_{k}:k+1\in\mathbb{N}\}$ independently of the other processors, and evaluate $\begin{bmatrix}A&b\end{bmatrix}^{\prime}w_{k}$ . It can then simply write the resulting row to an address reserved for performing the iterate and $S_{k}$ matrix updates by the master processor. Importantly, this procedure does not require locking any of the rows of $\begin{bmatrix}A&b\end{bmatrix}$ , and the reserved addresses can use fine grained locks to prevent any wasted calculations.

Similarly, in our sketch-and-solve procedure, computing $\begin{bmatrix}A&b\end{bmatrix}^{\prime}w_{k}$ can be trivially asynchronously parallelized on a distributed memory platform using a Fork-join model, when

the rows of $\begin{bmatrix}A&b\end{bmatrix}$ are distributed across the different storages, and 2. 2.

the RPMStrategy() generates $\{w_{k}:k+1\in\mathbb{N}\}$ such that $w_{k}$ have independent groups of components (e.g., the Gaussian Strategy and the Count-Sketch strategy).

When these two requirements are met, each processor can generate its own $\{w_{k}:k+1\in\mathbb{N}\}$ and operate on the local rows of $\begin{bmatrix}A&b\end{bmatrix}$ . It can then simply pass the resulting row to the master processor which performs the iterate and $S_{k}$ matrix updates. For each iteration, a scattering and gathering of the data is performed but no other data exchange is required.

Table 1 summarizes the time and total computational costs of computing $x_{k}$ and $S_{k}$ from $x_{0}$ and $S_{0}$ in the following context: (1) the sequential platform refers to the case where there is a single processor with a sufficiently large memory to store the system, and perform the necessary operations in Algorithm 1; (2) the shared memory platform assumes that there are $p+1$ processors that share a sufficiently large memory. One of the processors is dedicated to performing the iterate and matrix updates, while the remaining $p$ processors compute $\begin{bmatrix}A&b\end{bmatrix}^{\prime}w_{k}$ ; (3) the distributed memory architecture assumes that there are $p+1$ processors each with a sufficient memory capacity. The rows of $\begin{bmatrix}A&b\end{bmatrix}$ are split evenly or nearly evenly amongst $p$ of the processors, and each process only manipulates its local information about $A$ and $b$ . Finally, master processor is dedicated to performing the iterate and matrix updates.

2.4.2 Memory-Reduced Procedure

Another notable aspect of Algorithm 1 (and its aforementioned parallel variants described above) is that it must store and manipulate the matrix $S_{k}$ at each iteration, which is clearly expensive when $d$ is large or is excessive when $d^{3}$ is comparable to $n$ or greater than $n$ . This difficulty motivates a partial orthogonalization approach, as described in Algorithm 2. In this approach, a user-defined parameter $m<d$ specifies the number of $d$ -dimensional vectors needed to implicitly store an approximate representation of $S_{k}$ (based on Theorem 1). With this implicit representation, the cost of computing $u_{k}$ reduces to $\mathcal{O}\left[md\right]$ ,444If $q_{k}$ replace $u_{k}$ in the calculation of $z_{k}$ , then the cost of computing $u_{k}$ is $\mathcal{O}\left[dm^{2}\right]$ (see Golub and Van Loan, 2012, Ch. 5.2). which, consequently, reduces the overall cost of updating $x_{k}$ to $x_{k+1}$ to $\mathcal{O}\left[md\right]$ . Moreover, because $S_{k}$ is implicitly represented by a $m$ $d-$ dimesnional vectors in $\mathcal{S}$ , there is no notable additional computational cost incurred for updating $S_{k}$ to $S_{k+1}$ . Thus, an entire iteration incurs a computational cost $\mathcal{O}\left[md\right]$ plus the cost of computing $\begin{bmatrix}A&b\end{bmatrix}^{\prime}w_{k}$ , which can be mollified under the strategies above in shared memory or distributed memory platforms.

Remark 1.

Algorithm 2* is an efficient implementation of the partial orthogonalization procedure and, as a result, at $m=0$ , seems to only recover row-action base randomized iterative methods as specified by Eq. 51. A less efficient algorithm based on directly applying Eqs. 12 and 13 with the appropriate low memory modification would recover all rank-one base randomized iterative methods when $m=0$ .*

2.4.3 Optimizing Communication Overhead. Structured Systems

In the above approaches, we take for granted that $d$ is not so large such that communicating $\mathcal{O}\left[d\right]$ vectors is acceptable during the procedure. However, for many problems coming from the solution of differential equations (e.g., see Dongarra and Sørensen, 1986), $d$ and $n$ are of the same order and are so large that communicating $\mathcal{O}\left[d\right]$ vectors at arbitrary points during the procedure is impossible. Fortunately, linear system problems in this class are highly sparse and structured (Saad, 2003, Ch. 2). A simple example is the case where $A$ is a square, banded system with nonzero bandwidth $\tilde{Q}+1$ for some $\tilde{Q}\ll n=d$ ; that is, $A_{ij}=0$ if $|i-j|>\tilde{Q}$ and the remaining $A_{ij}$ can take arbitrary values.

For such sparse and structured problems, our methodology can be efficiently implemented across a distributed memory platform with $p$ processors under some additional qualifications. However, to understand these qualifications, let us first introduce some notation and concepts that define the communication pattern across the $p$ nodes.

Suppose somehow that we distribute the equations of our linear system of interest across $p$ nodes. Figure 1 shows how the coefficient matrix of a $20\times 20$ banded system with bandwidth $5$ can be distributed across five nodes. Note, in this example, the entries of the constant vector would be stored on the same processor as the corresponding rows of the coefficient matrix. Moreover, we need a way of tracking which components of $x$ are manipulated by each node: let $\mathcal{X}_{i}$ be the set of indices of the components of $x$ with nonzero coefficients at node $i$ in the distributed system for $i=1,\ldots,p$ . In our example, $\mathcal{X}_{1}=\{1,\ldots,6\}$ , $\mathcal{X}_{2}=\{3,\ldots,10\}$ , $\mathcal{X}_{3}=\{7,\ldots,14\}$ , $\mathcal{X}_{4}=\{11,\ldots,18\}$ , and $\mathcal{X}_{5}=\{15,\ldots,20\}$ . Finally, for any vector $z$ and any set $\mathcal{X}$ over the indices of $z$ , let $z[\mathcal{X}]$ be the vector whose elements are the elements of $z$ indexed by $\mathcal{X}$ .

From this example and from our discussion in Subsection 2.4.1 of distributing the RPMStrategy(), we can use the local rows of $A$ at Node 1 and a Gaussian sketch to generate a $q_{1}\in\mathbb{R}^{d}$ such that $q_{1}[\{1,\ldots,6\}]$ are arbitrarily valued and $q_{1}[\{7,\ldots,20\}]=0$ . Thus, our vector $q_{k}$ is highly sparse and can be generated locally on the node. However, following Algorithm 1, the next step of computing $u_{k}$ requires computing the product between $S_{k}$ and $q_{k}$ , which, in a naive implementation, would require storing a dense $d\times d$ matrix $S_{k}$ and computing a global matrix-vector product. Such a required computation raises several concerns, which we detail and address in the following enumeration.

Given that $d$ is relatively large to the computing environment, is storing a $d\times d$ matrix even feasible? Generally, the answer will be that storing such a matrix is infeasible. However, by exploiting the properties of $S_{k}$ (see Theorem 1), we will approximately and implicitly store $S_{k}$ as $\mathcal{S}$ , which is a collection of orthonormal vectors. 2. 2.

Even if we use $\mathcal{S}$ in place of $S_{k}$ , will the resulting implicit matrix-vector product and update of $\mathcal{S}$ incur prohibitive communication costs? To answer these questions completely, we will need to specify how the implicit matrix-vector product will be computed and how $\mathcal{S}$ will be stored. Here, we will compute the implicit matrix-vector product by using twice-iterated classical Gram-Schmidt (Algorithm 4), which was shown to be numerically stable in the seminal work of Giraud et al. (2005). Owing to this calculation pattern, we can store $\mathcal{S}$ in a distributed fashion across the $p$ processors, which we detail below along with the communication cost of the synchronization of $\mathcal{S}$ .

To understand the costs associated with computing $u$ from the orthonormal vectors in $\mathcal{S}$ and the vector $q$ , we will characterize the support of $u$ (i.e., index set of its nonzero entries).

Lemma 1.

Let $q\in\mathbb{R}^{d}$ and let $\mathcal{Q}=\{i:q[i]\neq 0\}\subset\{1,\ldots,d\}$ . Let $\{z_{1},\ldots,z_{m}\}\subset\mathbb{R}^{d}$ be a set of orthonormal vectors (hence, $m\leq d$ ), and let $\mathcal{Z}_{j}=\{i:z_{j}[i]\neq 0\}\subset\{1,\ldots,d\}$ for $j=1,\ldots,m$ . If $u$ denotes the result of Algorithm 4 applied to $q$ over the set $\{z_{1},\ldots,z_{m}\}$ then

[TABLE]

where $\mathfrak{Q}=\{j:\mathcal{Q}\cap\mathcal{Z}_{j}\neq\emptyset\}\subset\{1,\ldots,m\}$ .

Proof.

Letting $Z$ denote the matrix whose columns are elements of the orthonormal set, we recall that classical Gram-Schmidt generates $u=(I_{d}-ZZ^{\prime})q$ . Thus, twice iterated Gram-Schmidt can be written as

[TABLE]

which is expected in exact arithmetic. Thus, we can consider classical Gram-Schmidt and ignore the iteration to compute the support of $u$ . For any $l=1,\ldots,d$ ,

[TABLE]

where we use the fact that if $j\not\in\mathfrak{Q}$ then $q^{\prime}z_{j}=\sum_{l\in\mathcal{Q}\cap\mathcal{Z}_{j}}q[l]z_{j}[l]=\sum_{l\in\emptyset}q[l]z_{j}[l]=0$ . For a contradiction, suppose $l\in\mathcal{U}$ such that

[TABLE]

Then, $q[l]=0$ and $z_{j}[l]=0$ for $j\in\mathfrak{Q}$ . Using the above formula for $u[l]$ , $u[l]=0-\sum_{j\in\mathfrak{Q}}(q^{\prime}z_{j})0=0$ , which is a contradiction. ∎

At iteration $k$ , Lemma 1 states that the support of $u_{k}$ will depend on the support of $\{z_{m},\ldots,z_{1}\}$ , which, in turn, has elements whose support depend on (a subset of) $\{u_{k-1},\ldots,u_{0}\}$ . Moreover, if $\{u_{k-1},\ldots,u_{0}\}$ has elements whose combined support cover $\{1,\ldots,d\}$ , which will be necessary to solve the system,555Note, if the combined supports of the elements of $\{u_{k-1},\ldots,u_{0}\}$ do not cover all of $\{1,\ldots,d\}$ , then some components of our iterates, $\{x_{k}\}$ will not be updated. it is possible that the support of $u_{k}$ will be all of $\{1,\ldots,d\}$ (ignoring any trivial independence in the system). Thus, it appears that we will eventually have to store vectors in $\mathcal{S}$ whose support is all of $\{1,\ldots,d\}$ . Naively, we may think that we need a faithful copy of $\mathcal{S}$ at each node in the system, which incurs prohibitive communication costs as the support of $u_{k}$ tends to $\{1,\ldots,d\}$ . While this is true, a careful inspection of Gram-Schmidt and the nonzero patterns of $q_{k}$ suggest a less naive approach, which we now detail.

We begin by supposing that on a processor $i\in\{1,\ldots,p\}$ , only $z_{j}[\mathcal{X}_{i}]$ are stored on the node for every $j=1,\ldots,k$ . Immediately, we have eliminated the need for synchronizing all of $\mathcal{S}$ on each processor. Instead, we need only to synchronize those components of $z_{j}$ in $\mathcal{X}_{i}\cap\mathcal{X}_{j}$ for all $i\neq j$ . Thus, we have that our synchronization costs will depend on the maximum overlap, $Q$ , between two processors, which, formally, is

[TABLE]

Now, we can understand the precise nature of this synchronization by inspecting Algorithm 4. If for some $j=1,\ldots,p$ , $q_{k}[\mathcal{X}_{j}^{c}]=0$ , then

[TABLE]

From Eq. 20, we see that we must communicate the values of $q[l]$ to all nodes $i\in\{1,\ldots,p\}\setminus\{j\}$ such that $\mathcal{X}_{j}\cap\mathcal{X}_{i}\neq\emptyset$ , and we must communicate the $m$ inner products to all $p-1$ nodes. The resulting number of floating point values that must be communicated (counting each replicate to a node individually) during the first iteration of Algorithm 4 is

[TABLE]

where $\mathfrak{Q}_{j}=\{i:\mathcal{X}_{i}\cap\mathcal{X}_{j}\neq\emptyset\}$ for $j=1,\ldots,p$ (see the notation in Lemma 1). For the second iteration of Algorithm 4, we must broadcast $m$ inner products that are partially computed (using some ordering that respects the non-associative property of floating point complexity) on each node to the remaining $p-1$ nodes. Thus, the number of floating point values that must be communicated (counting each replicate to a processor individually) to ensure synchronization is

[TABLE]

which we can bound by

[TABLE]

Noting that $Q$ represents the maximum shared indices between two nodes and that $F$ represents the maximum number of nodes that overlap, the first term in the bound can be controlled by the ordering choice of the differential equations that generate the system, but a discussion of this topic is beyond the scope of this work. Algorithm 5 summarizes a simple version of the procedure described here. We can also modify this algorithm to the low memory context of Algorithm 1 by limiting the number of vectors that can be stored in $\mathcal{S}$ .

3 Convergence Theory for Orthogonalization

Here, we prove that the complete orthogonalization approach (i.e., Algorithm 1) converges to the solution under a variety of sampling RPM strategies. In Subsection 3.1, we establish a collection of core results that are useful in characterizing the behavior of our procedure. A key feature of these core results is that they will rely on a stopping time $T$ , which will depend on the random variables $\{w_{k}\}$ . Therefore, in Subsection 3.2, we characterize $T$ under common probabilistic relationships between the elements of $\{w_{k}\}$ . All statements hold with probability one unless stated otherwise.

3.1 Core Results

We establish two key results. First, we establish that our procedure is an orthogonalization procedure: that is, the matrices $\{S_{k}\}$ project the current search direction onto a subspace that is orthogonal to previous search directions. Second, we characterize the limit point of our iterates, $\{x_{k}\}$ , in terms of a true solution of the linear system and the subspace generated by the rank-one RPMs, $\{V_{k}\}$ .

Theorem 1.

Let $\{w_{l}:l+1\in\mathbb{N}\}\subset\mathbb{R}^{n}$ be an arbitrary sequence in $\mathbb{R}^{n}$ , and let $\mathcal{R}_{0}=\{0\}\subset\mathbb{R}^{d}$ and $\mathcal{R}_{l}=\mathrm{span}\left[A^{\prime}w_{0},\ldots,A^{\prime}w_{l-1}\right]$ for $l\in\mathbb{N}$ . Now, let $S_{0}=I_{d}$ and $\{S_{l}:l\in\mathbb{N}\}$ be defined recursively as in Eq. 13. Then, for $l\geq 0$ , $S_{l}$ is an orthogonal projection matrix onto $\mathcal{R}_{l}^{\perp}$ .

Proof.

We will prove the result by induction. For the base case, $l=0$ , $S_{0}=I_{d}$ . It follows that $S_{0}$ is an orthogonal projection onto $\mathcal{R}_{0}^{\perp}=\mathbb{R}^{d}$ since $S_{0}^{2}=I_{d}^{2}=I_{d}=S_{0}$ and $\mathrm{range}\left(I_{d}\right)=\mathbb{R}^{d}$ . Now suppose that the result holds for $l>0$ . If $S_{l}A^{\prime}w_{l}=0$ then there is nothing to show. Therefore, for the remainder of this proof, suppose $S_{l}A^{\prime}w_{l}\neq 0$ .

First, we show that $S_{l+1}$ is a projection matrix by verifying that $S_{l+1}^{2}=S_{l+1}$ by direct calculation. Making use of the recursive definition of $S_{l+1}$ and the induction hypothesis that $S_{l}^{2}=S_{l}$ ,

[TABLE]

Second, we use the fact that a projection is orthogonal if and only if it is self-adjoint to show that $S_{l+1}$ is an orthogonal projection. By induction, because $S_{l}$ is an orthogonal projection, $S_{l}^{\prime}=S_{l}$ , and so

[TABLE]

Finally, let $v$ be in the range of $S_{l+1}$ and we can decompose $v$ into the components $u$ and $y$ such that $v=u+y$ , $0=u^{\prime}y$ and $y\in\mathcal{R}_{l+1}$ . We will show that $y=0$ , which characterizes the range of $S_{l+1}$ as being all vectors orthogonal to $\mathcal{R}_{l+1}$ . To show this note that because $S_{l+1}$ is a projection matrix, we have that

[TABLE]

By construction $\mathcal{R}_{l}\subset\mathcal{R}_{l+1}$ and so $u\in\mathcal{R}_{l}^{\perp}$ . Using the induction hypothesis, we then have that $S_{l}u=u$ . Moreover, because $u\in\mathcal{R}_{l+1}^{\perp}$ by construction, $u^{\prime}A^{\prime}w_{l}=0$ . Then, using the recursive definition of $S_{l+1}$ , we have that

[TABLE]

Therefore, $u=S_{l+1}u$ and, by Eq. 26, $y=S_{l+1}y$ . We now decompose $y$ into $y_{1}$ and $y_{2}$ where $y_{1}\in\mathcal{R}_{l}$ and $y_{2}\in\mathcal{R}_{l}^{\perp}\cap\mathcal{R}_{l+1}$ . By the induction hypothesis, $\mathcal{R}_{l}^{\perp}\cap\mathcal{R}_{l+1}=\mathrm{span}\left[S_{l}A^{\prime}w_{l}\right]$ . Therefore, $S_{l}y=y_{2}$ and $\exists\alpha\in\mathbb{R}$ such that $y_{2}=\alpha S_{l}A^{\prime}w_{l}$ . Finally, using the recursive formulation of $S_{l+1}$ and $S_{l}y=y=\alpha S_{l}A^{\prime}w_{l}$ ,

[TABLE]

Thus, we have shown that the range of $S_{l+1}$ is orthogonal to $\mathcal{R}_{l+1}$ . ∎

From Theorem 1, we see that our procedure is an orthogonalization procedure just like quasi-Newton methods (Nocedal and Wright, 2006, Ch. 8) and conjugated direction methods (Hestenes, 2012). As a consequence, we have the following common and insightful characterization of the iterates of such an orthogonalization procedure.

Corollary 1.

In addition to the setting of Theorem 1, let $x_{0}\in\mathbb{R}^{d}$ be arbitrary and let $\{x_{l}:l\in\mathbb{N}\}$ be defined according to Eq. 14. For any $l\geq 0$ , $x_{l+1}\in\mathrm{span}\left[x_{0},A^{\prime}w_{0},\ldots,A^{\prime}w_{l}\right]$ .

Proof.

We again proceed by induction. Because $S_{0}=I_{d}$ , the case of $x_{1}$ follows by recursion formula, Eq. 14. Now suppose that the result holds up to some $l>0$ . Note, by the recursion formula

[TABLE]

Therefore, $x_{l+1}\in\mathrm{span}\left[x_{l},S_{l}A^{\prime}w_{l}\right]$ . Now, using the induction hypothesis,

[TABLE]

Second, when $S_{l}A^{\prime}w_{l}=0$ , then $A^{\prime}w_{l}\in\mathcal{R}_{l}$ . Consequently,

[TABLE]

Now suppose $S_{l}A^{\prime}w_{l}\neq 0$ . By Theorem 1, $S_{l}$ is an orthogonal projection onto $\mathrm{span}\left[A^{\prime}w_{0},A^{\prime}w_{1},\ldots,A^{\prime}w_{l-1}\right]^{\perp}$ . Hence, $x_{l+1}\in\mathrm{span}\left[x_{l},S_{l}A^{\prime}w_{l}\right]$ , which is contained in $\mathrm{span}\left[x_{0},A^{\prime}w_{0},\ldots,A^{\prime}w_{l}\right].$ ∎

Corollary 1 demonstrates that, as is common with orthogonalization procedures, the iterates are in a subspace generated by the initial iterate and the search directions $\{A^{\prime}w_{0},\ldots,A^{\prime}w_{l}\}$ . For deterministic procedures, such a characterization is usually sufficient and the next step would be to demonstrate that the iterates are the closest points to the true solutions within the given subspace. However, for a procedure in which the subspace is randomly generated, there is substantially more nuance. In order to be conscientious of space, we will not go through the litany of issues, but rather skip to the appropriate definitions and characterizations.

First, we begin by defining the maximal possible subspace that can be generated by a random quantity $A^{\prime}w$ . Let $w\in\mathbb{R}^{n}$ be a random variable defined on a space $\Omega$ , and let

[TABLE]

Moreover, we define the subspace $\mathcal{V}(w)$ such that $\mathcal{V}(w)\perp\mathcal{R}(w)$ and $\mathcal{V}(w)+\mathcal{R}(w)=\mathrm{row}(A)$ (hence, $\mathcal{V}(w)\oplus\mathcal{R}(w)=\mathrm{row}(A)$ ). Correspondingly, let $P_{W}$ denote the orthogonal projection matrix onto a subspace $W\subset\mathbb{R}^{d}$ . The following result characterizes $\mathcal{R}(w)$ .

Lemma 2.

For $\mathcal{R}(w)$ as defined in Eq. 32, $\mathcal{R}(w)$ is the smallest subspace of $\mathbb{R}^{d}$ such that $\mathbb{P}\left[A^{\prime}w\in\mathcal{R}(w)\right]=1$ .

Proof.

First, we verify that $\mathbb{P}\left[A^{\prime}w\in\mathcal{R}(w)\right]=1$ . Suppose that $\mathbb{P}\left[A^{\prime}w\in\mathcal{R}(w)\right]<1$ . Then,

[TABLE]

However, we know that for any $z$ such that $z\perp\mathcal{R}(w)$ , $z\in\mathcal{N}(w)$ and $z^{\prime}A^{\prime}w=0$ with probability one, which is a contradiction. Hence, $\mathbb{P}\left[A^{\prime}w\in\mathcal{R}(w)\right]=1$ .

Now suppose there is a proper subspace of $\mathcal{R}(w)$ , $U$ , such that $\mathbb{P}\left[A^{\prime}w\in U\right]=1$ . Let $U^{\perp\mathcal{R}(w)}$ denote the subspace orthogonal to $U$ relative to $\mathcal{R}(w)$ . Then, $\mathbb{P}\left[z^{\prime}A^{\prime}w=0\right]=1$ for any $z\in U^{\perp\mathcal{R}(w)}$ , which implies that $U^{\perp\mathcal{R}(w)}\subset\mathcal{N}(w)$ . However, since $U^{\perp\mathcal{R}(w)}\subset\mathcal{R}(w)\perp\mathcal{N}(w)$ , $U^{\perp\mathcal{R}(w)}=\{0\}$ . Thus, $\mathcal{R}(w)$ is the smallest subspace such that $\mathbb{P}\left[A^{\prime}w\in\mathcal{R}(w)\right]=1$ . ∎

Second, we must define when the maximal possible subspace of $A^{\prime}w$ can be achieved by a sequence of random variables $\{A^{\prime}w_{0},\ldots,A^{\prime}w_{l}\}$ , which may or may not be related to $A^{\prime}w$ . Note, by not requiring a relationship between $\{A^{\prime}w_{0},\ldots,A^{\prime}w_{l}\}$ and $A^{\prime}w$ our next result is particularly general and applies to a variety of situations, from the case in which $\{w_{l}\}$ are independent copies of $w$ to the case where $\{w_{l}\}$ have complex dependencies. Now, let $\{w_{l}:l+1\in\mathbb{N}\}\subset\mathbb{R}^{n}$ be random variables defined on $\Omega$ , and let $T$ be a stopping time defined by

[TABLE]

Using this notation, we have the following fundamental characterization result of the limit points of $\{x_{l}\}$ .

Theorem 2.

Let $w$ be a random variable, and let $\mathcal{R}(w)$ , $\mathcal{N}(w)$ and $\mathcal{V}(w)$ be as defined above (see Eq. 32). Moreover, let $w_{0},w_{1},\ldots\in\mathbb{R}^{n}$ be random variables such that $\mathbb{P}\left[A^{\prime}w_{l}\in\mathcal{R}(w)\right]=1$ for all $l+1\in\mathbb{N}$ , and let $T$ be as defined in Eq. 34. Let $x_{0}\in\mathbb{R}^{d}$ be arbitrary and $S_{0}=I_{d}$ , and let $\{x_{l}:l\in\mathbb{N}\}$ and $\{S_{l}:l\in\mathbb{N}\}$ be defined as in Eqs. 14 and 13. On the event $\{T<\infty\}$ ,

For any $s\geq T+1$ , $S_{T+1}=S_{s}$ and $x_{T+1}=x_{s}$ . 2. 2.

If $Ax=b$ admits a solution $x^{*}$ (not necessarily unique), then

[TABLE]

Proof.

Recall that $\mathcal{R}_{k+1}=\mathrm{span}\left[A^{\prime}w_{0},\ldots,A^{\prime}w_{k}\right].$ Therefore, by the definition of $T$ , $\mathcal{R}_{T+1}=\mathcal{R}(w)$ on the event that $\{T<\infty\}$ . Therefore, by Theorem 1, $S_{T+1}$ is an orthogonal projection onto $\mathcal{N}(w)$ and its null space is $\mathcal{R}(w)$ .

We now proceed by induction. Because $\mathrm{ker}\left(S_{T+1}\right)=\mathcal{R}(w)$ and $A^{\prime}w_{T+1}\in\mathcal{R}(w)$ with probability one (by hypothesis), $S_{T+1}A^{\prime}w_{T+1}=0$ . Therefore, by the recursion equations, Eqs. 14 and 13, $S_{T+2}=S_{T+1}$ and $x_{T+2}=x_{T+1}$ . Suppose now that $S_{T+l}=S_{T+1}$ and $x_{T+l}=x_{T+1}$ for $l>1$ . Again, by hypothesis, $A^{\prime}w_{T+l}\in\mathcal{R}(w)=\mathrm{ker}\left(S_{T+l}\right)$ . Therefore, $S_{T+l}A^{\prime}w_{T+l}=0$ . By the recursion equations, Eqs. 14 and 13, $S_{T+l+1}=S_{T+l}=S_{T+1}$ and $x_{T+l+1}=x_{T+l}=x_{T+1}$ .

To establish the second part of the result, we must first establish that for any $l\geq 0$ ,

[TABLE]

We will prove this by induction. For $l=0$ ,

[TABLE]

by the recursion equations, Eq. 14. Noting that $S_{0}=I_{d}$ and by using Eq. 13, we conclude that $x_{1}-x^{*}=S_{1}(x_{0}-x^{*})$ . Now suppose that this relationship holds for some $l>0$ . Again, using Eq. 14,

[TABLE]

Using the induction hypothesis, $x_{l}-x^{*}=S_{l}(x_{0}-x^{*})$ and Eq. 13,

[TABLE]

With this result established and noting that $S_{T+1}$ is a projection onto $\mathcal{N}(w)$ (i.e., $P_{\mathcal{N}(w)}=S_{T+1}$ ), on the event $\{T<\infty\}$ ,

[TABLE]

∎

With Theorem 2 in hand, the natural subsequent question is when the limit point of the iterates is actually a solution to the original system. This question is addressed in the following corollary.

Corollary 2.

Under the setting of Theorem 2, on the event $\{T<\infty\}$ , $Ax_{T+1}=b$ if and only if $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ .

Proof.

Recall that $\mathrm{row}(A)\perp\mathrm{ker}\left(A\right)$ . Because $\mathcal{R}(w)\subset\mathrm{row}(A)$ , $\mathcal{N}(w)=\mathcal{V}(w)+\mathrm{ker}\left(A\right)$ . Moreover, by the definition of $\mathcal{V}(w)\subset\mathrm{row}(A)$ , $\mathcal{V}(w)\perp\mathrm{ker}\left(A\right)$ . Therefore, $P_{\mathcal{N}(w)}=P_{\mathrm{ker}\left(A\right)}+P_{\mathcal{V}(w)}$ . Now, using the characterization in Theorem 2,

[TABLE]

Similarly, because $I_{d}=P_{\mathrm{ker}\left(A\right)}+P_{\mathcal{V}(w)}+P_{\mathcal{R}(w)}$ ,

[TABLE]

Setting these two quantities equal to each other, we conclude that $Ax_{T+1}=b$ if and only if $AP_{\mathcal{V}(w)}x^{*}=AP_{\mathcal{V}(w)}x_{0}$ . Clearly, if $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ then $Ax_{T+1}=b$ . So, what we have left to show is that $AP_{\mathcal{V}(w)}x^{*}=AP_{\mathcal{V}(w)}x_{0}$ implies $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ .

Let $A^{+}$ denote the Moore-Penrose pseudo-inverse of $A$ , and recall that $A^{+}A$ is a projection onto $\mathrm{row}(A)$ . Moreover, $\mathrm{range}(P_{\mathcal{V}})\subset\mathrm{row}(A)$ . Therefore, since if $Ax_{T+1}=b$ then $AP_{\mathcal{V}(w)}x_{0}=AP_{\mathcal{V}(w)}x^{*}$ , if $Ax_{T+1}=b$ then

[TABLE]

∎

Corollary 2 provides criteria on the initial condition and on $\mathcal{V}(w)$ to determine when our procedure will solve the linear system. However, we would rarely have a way of choosing the initial condition apriori such that the requirement of Corollary 2 holds. Thus, the alternative is to design $w$ and $\{w_{l}\}$ so that $\mathcal{V}(w)=\{0\}$ , which would guarantee that $Ax_{T+1}=b$ on the event $\{T<\infty\}$ . It is worth reiterating that we have made very limited assumptions about the relationships between $w$ and $\{w_{l}\}$ and amongst $\{w_{l}\}$ . This is important because it allows us to apply the preceding results to a variety of common relationship patterns between $w$ and $\{w_{l}\}$ . In the next subsection, we explore some specific relationships and whether these relationships will result in $\mathcal{V}(w)=\{0\}$ .

3.2 Common Sampling Patterns

Theorem 2 supplies a general result about the behavior of any sampling methodology on the solution of the system using Eqs. 13 and 14, yet it does not suggest a precise sampling methodology. Generally, the sampling methodology choice will depend on both the hardware environment and the nature of the problem. For example, a random permutation sampling methodology will limit the parallelism achievable in Algorithm 5. On the other hand, a random permutation sampling methodology might be well-advised in a sequential setting where very little known is about the coefficient matrix $A$ . Thus, the precise sampling scheme should depend on the hardware environment and should exploit the structure of the problem.

Despite this, in practice, there are two general sampling schemes that form a basis for more problem and hardware specific sampling schemes: random permutation sampling and independent and identically distributed sampling. The former sampling pattern is exemplified by randomly permuting the equations of the linear system. More concretely, let $e_{1},\ldots,e_{n}\in\mathbb{R}^{n}$ be the standard basis; let $w$ be a random variable with nonzero probability on each element of the basis; let $\{w_{l}\}$ be random variables sampled from $\{e_{1},\ldots,e_{n}\}$ without replacement (until the set is exhausted, then we repopulate the set with its original elements and repeat the sampling without replacement). The following statement provides a simple characterization of this sampling scheme.

Lemma 3.

Let $\{W_{1},\ldots,W_{N}\}\subset\mathbb{R}^{n}$ . Let $w$ be a random variable such that

[TABLE]

Moreover, let $\{w_{l}:l+1\in\mathbb{N}\}$ be random variables sampled from $\{W_{1},\ldots,W_{N}\}$ without replacement (and once the set is exhausted, we repopulate the set with its original elements and repeat sampling without replacement). Then $T\leq N-1$ . Moreover, $Ax_{T+1}=b$ for every initialization if $\mathrm{span}\left[A^{\prime}W_{1},\ldots,A^{\prime}W_{N}\right]=\mathrm{row}(A)$ , which holds if $\mathrm{span}\left[W_{1},\ldots,W_{N}\right]=\mathbb{R}^{n}$ .

Proof.

First, note that $\mathcal{N}(w)=\{z\in\mathbb{R}^{d}:z^{\prime}A^{\prime}W_{j}=0,~{}\forall j=1,\ldots,N\}$ . Therefore,

[TABLE]

In turn, because $\{w_{0},\ldots,w_{N-1}\}=\{W_{1},\ldots,W_{N}\}$ , $T$ is at most $N-1$ .

By Corollary 2, $Ax_{T+1}=b$ if and only if $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ where $x^{*}$ satisfies $Ax^{*}=b$ . Now, given that $\mathcal{R}(w)+\mathcal{V}(w)=\mathrm{row}(A)$ and $\mathcal{R}(w)=\mathrm{span}\left[A^{\prime}W_{1},\ldots,A^{\prime}W_{N}\right]$ , if $\mathrm{span}\left[A^{\prime}W_{1},\ldots,A^{\prime}W_{N}\right]=\mathrm{row}(A)$ then $\mathcal{V}(w)=\{0\}$ . Therefore, $Ax_{T+1}=b$ for any initialization. The final claim is straightforward. ∎

The second sampling scheme, independent and identically distributed sampling, is exemplified by randomly sampling equations from the system with uniform discrete probability. However, we do not need to limit ourselves to sampling from a finite population of elements. As the next result shows, we can do much more.

Proposition 1.

Suppose that $w,w_{0},w_{1},\ldots$ are independent, identically distributed random variables. There exists a $\pi\in(0,1)$ such that

[TABLE]

Moreover, $T<\infty$ and $\mathbb{P}\left[T=k\right]\leq(k-r)^{r-1}(1-\pi)^{k-r}$ where $r=\dim(\mathcal{R}(w))$ and $k\geq r$ .

Proof.

First, we show that there exists $\pi>0$ such that for any nontrivial, proper subspace $V\subsetneq\mathcal{R}(w)$ , $\mathbb{P}[A^{\prime}w\not\in V]\geq\pi$ , which implies Eq. 46 when we take $V$ to be the relative orthogonal compliment to the span of a unit vector $v\in\mathcal{R}(w)$ . Suppose there is no such $\pi$ . Then, for every $p\in(0,1)$ , there is a nontrivial subspace $V\subsetneq\mathcal{R}(w)$ such that $\mathbb{P}\left[A^{\prime}w\in V\right]\geq 1-p$ . Let $r$ be the smallest integer between [math] and $\dim(\mathcal{R}(w))$ such that

[TABLE]

For $\epsilon>0$ , let $V_{1}\subsetneq\mathcal{R}(w)$ be an $r$ -dimension subspace with $\mathbb{P}[A^{\prime}w\in V_{1}]\geq 1-\epsilon/2$ . Note, by Lemma 2, $\mathbb{P}[A^{\prime}w\in V_{1}]<1$ . Therefore, let $V_{2}\subsetneq\mathcal{R}(w)$ be an $r$ -dimensional subspace with $\mathbb{P}[A^{\prime}w\in V_{2}]>\mathbb{P}[A^{\prime}w\in V_{1}]\geq 1-\epsilon/2$ . Given that $V_{1}$ and $V_{2}$ are distinct and the inclusion-exclusion principle,

[TABLE]

However, this is contradicts the minimality of $r$ since $\epsilon>0$ is arbitrary and $\dim(V_{1}\cap V_{2})<r$ . Thus, we conclude that such a $\pi$ exists.

It follow from Eq. 46 that for any k,

[TABLE]

Therefore, we can bound $\mathbb{P}[T=k]$ by a negative binomial distribution. In particular,

[TABLE]

∎

In light of the two preceding results, we may be convinced that there is a gap between the convergence properties between random permutation sampling and the independent and identically distributed sampling. However, by modifying the structure of the rank-one RPM, we can find more intermediate cases. The next result demonstrates this behavior with a somewhat contrived example, and we will leave more complex cases to future work.

Theorem 3.

Suppose $w,w_{0},w_{1},\ldots$ are i.i.d. random variables such that the entries of $A^{\prime}w$ are independent, identically distributed subgaussian random variables with mean zero and unit variance. Then, there exists a $\pi\in(0,1)$ depending only on the distribution of the entries of $A^{\prime}w$ such that $\mathbb{P}\left[T=k\right]\geq 1-\pi^{k}$ for $k\geq d$ .

Proof.

Let $H_{k}$ denote a $k\times d$ ( $k\geq d$ ) random matrix whose entries are independent and identically distributed subgaussian random variables with zero mean and unit variance. As a consequence of (Rudelson and Vershynin, 2009, Theorem 1.1), there exists a $\pi$ that depends on the distribution of the entries such that for all $k\geq d$ , $\mathbb{P}\left[\sigma_{\min}(H_{k})>0\right]\geq 1-\pi^{k}$ . At iteration $k$ , let $N_{k}$ denote the matrix whose rows are given by $w_{0},w_{1},\ldots$ . Then, by hypothesis, $N_{k}A$ has entries that are independent, identically distributed subgaussian random with zero mean and unit variance. Therefore, there exists a $\pi\in(0,1)$ depending only on the distribution of the entries in $A^{\prime}w$ such that $\mathbb{P}\left[T=k\right]=\mathbb{P}\left[\sigma_{\min}(N_{k}A)>0\right]\geq 1-\pi^{k}$ for $k\geq d$ . ∎

4 Convergence Theory for Base Methods

In the previous section, we proved convergence for the complete orthogonalization method (i.e., Algorithm 1) and explored some specific sampling patterns. Here, we will consider the extreme opposite of the complete orthogonalization method: the “base” randomized iterative approach (e.g., Randomized Kaczmarz). That is, we consider when $V_{k}$ is a rank one matrix of one of two general classes.

In the first class, we consider Algorithm 2 in the case $m=0$ . In this case, Eq. 14 supplies the simplified iteration scheme,

[TABLE]

which encompasses randomized Kaczmarz, when $w_{k}$ is a random draw from the standard basis vectors in $\mathbb{R}^{n}$ , as shown in Subsection 2.1.

Unfortunately, Eq. 51 would not include randomized Gauss-Seidel. This motivates the second class, which has the closely related iteration

[TABLE]

In this class, we recover randomized Gauss-Seidel if we choose $w_{k}$ randomly from the standard basis vectors in $\mathbb{R}^{d}$ , as shown in Subsection 2.1.

While these two classes are distinct, we will see that their analysis is nearly identical and is intimately related to the analysis of the complete orthogonalization method. Our analysis offers two highlights: (1) we can prove convergence with probability one for arbitrary sampling schemes—only the i.i.d. case is considered in Zouzias and Freris (2013); Gower and Richtárik (2015); Richtárik and Takác (2020); and (2) we can provide rates of convergence with probability one which complements the mean-squared-error results of Zouzias and Freris (2013); Gower and Richtárik (2015); Richtárik and Takác (2020). Our main approach is an extension of Meany’s inequality (see Subsection 4.1) combined with stopping time arguments, as derived in Subsections 4.2 and 4.3. We then explore some common, non-adaptive sampling patterns in Subsection 4.4. To conclude, we develop a general framework for the analysis of adaptive sampling schemes, and provide concrete examples from the literature (see Subsection 4.5).

4.1 An Extension of Meany’s Inequality

Here, we will derive an extension of Meany’s Inequality Meany (1969), which, under a different extension, has recently been used to study the convergence rate of row-action solvers including the a block-variant of the Kaczmarz method Bai and Liu (2013). We begin by stating a geometric lemma derived by Meany (1969), and follow it with the extension, which closely follows Meany’s original proof with several modifications.

Lemma 4 (Meany (1969)).

Let $f_{1},\ldots,f_{k}\in\mathbb{R}^{n}$ with $k\leq n$ . Write $f_{k}=f^{S}+f^{N}$ where $f^{S}$ belongs to the space $S$ spanned by $f_{1},\ldots,f_{k-1}$ and $f^{N}$ is perpendicular to $S$ . Let $\bar{F}$ be the matrix whose columns are $f_{1},\ldots,f_{k-1}$ , and let $F$ be the matrix whose columns are $f_{1},\ldots,f_{k}$ . Then,

[TABLE]

Theorem 4.

Let $v_{1},\ldots,v_{k}$ be unit vectors in $\mathbb{R}^{n}$ for some $k\in\mathbb{N}$ . Let $S=\mathrm{span}\left[v_{1},\ldots,v_{k}\right]$ . Let $\mathcal{F}$ denote all matrices $F$ where the columns of $F$ are the vectors $\{f_{1},\ldots,f_{r}\}\subset\{v_{1},\ldots,v_{k}\}$ that are a maximal linearly independent subset. Then

[TABLE]

where

[TABLE]

Proof.

The proof proceeds by induction. For the case $k=1$ , both sides of the inequality are zero and so the result holds. Now suppose that the result holds for $k=j-1$ . To prove the case $k=j$ , we need the following additional notation.

Let $\bar{S}=\mathrm{span}\left[v_{1},\ldots,v_{j-1}\right]$ ; let $\{f_{1},\ldots,f_{\bar{r}}\}$ denote a maximal linearly independent subset of the unit vectors $\{v_{1},\ldots,v_{j-1}\}$ that achieve the minimum determinant; let $\bar{F}$ be the matrix whose columns are $f_{1},\ldots,f_{\bar{r}}$ ; and let

[TABLE]

For a unit vector $y\in S$ , let $y^{\bar{S}}$ denote the component of $y$ in $\bar{S}$ , and let $y^{N}$ denote the component of $y$ orthogonal to $\bar{S}$ . Moreover, let $z=\bar{Q}y^{\bar{S}}$ . Then, by the induction hypothesis,

[TABLE]

Similarly, write $v_{j}=v^{\bar{S}}+v^{N}$ where $v^{\bar{S}}\in\bar{S}$ and $v^{N}$ is perpendicular to $\bar{S}$ .

Case A: Suppose that $S=\bar{S}$ . Then $y=y^{\bar{S}}$ . Moreover, since $\bar{F}\in\mathcal{F}$ ,

[TABLE]

Thus, the result holds when $S=\bar{S}$ .

Case B: Suppose that $S\supsetneq\bar{S}$ . Then,

[TABLE]

where we have made use of $v^{N}$ and $y^{N}$ are colinear, implying that their inner product is equal to the product of their norms. Finally, since $-2z^{\prime}v^{\bar{S}}\leq 2|z^{\prime}v^{\bar{S}}|$ ,

[TABLE]

Case B(1): Suppose that $\left\|v^{N}\right\|_{2}\leq\left\|y^{\bar{S}}\right\|_{2}$ . Then,

[TABLE]

where, in the last line, we use Lemma 4 and, since $S\neq\bar{S}$ , $f_{\bar{r}+1}=v_{j}$ , which, in turn, implies $f^{N}=v^{N}$ .

Case B(2): Suppose that $\left\|v^{N}\right\|_{2}>\left\|y^{\bar{S}}\right\|_{2}$ . Since $\left\|v_{j}\right\|_{2}=\left\|y\right\|_{2}=1$ , then $\left\|v^{\bar{S}}\right\|_{2}\leq\left\|y^{N}\right\|_{2}$ . Using these inequalities and Eq. 57,

[TABLE]

Therefore,

[TABLE]

Applying this relationship to Eq. 62,

[TABLE]

where, in the last line, we use Lemma 4 and, since $S\neq\bar{S}$ , $f_{\bar{r}+1}=v_{j}$ , which, in turn, implies $f^{N}=v^{N}$ .

Therefore, from Cases A, B(1) and B(2), we conclude that the result holds. ∎

4.2 Main Convergence Result for Row-Action Methods

Recall that $w\in\mathbb{R}^{n}$ is a random variable and $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ is a sequence of random variables taking value in $\mathbb{R}^{n}$ chosen such that $A^{\prime}w_{\ell}\in\mathcal{R}(w)$ .777Again, we can avoid this requirement and consider set inclusions below. However, this generalization will require additional, cumbersome notation and there is no practical reason for considering this case. We will now define a sequence of stopping times $\{\tau_{\ell}:\ell+1\in\mathbb{N}\}$ where $\tau_{0}=0$ ,

[TABLE]

and, if $\tau_{\ell-1}<\infty$ , we define

[TABLE]

else $\tau_{\ell}=\infty$ . As an aside, it is worthwhile to note the commonalities between the definition of $\{\tau_{\ell}\}$ and the stopping time $T$ from Eq. 34.

Moreover, whenever the stopping times are finite, we will define the collection, $\mathcal{F}_{\ell}$ , for $\ell\in\mathbb{N}$ , that contains all matrices $F$ whose columns are maximal linearly independent subsets of

[TABLE]

Moreover, define

[TABLE]

Note, it follows by Hadamard’s inequality that $\gamma_{\ell}\in[0,1)$ .

Theorem 5.

Suppose $Ax=b$ admits a solution $x^{*}$ (not necessarily unique). Let $w$ be a random variable valued in $\mathbb{R}^{n}$ , and let $\mathcal{R}(w),$ $\mathcal{N}(w)$ and $\mathcal{V}(w)$ be defined as above (see Eq. 32). Moreover, let $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ be random variables such that $\mathbb{P}\left[A^{\prime}w_{\ell}\in\mathcal{R}(w)\right]=1$ for all $\ell+1\in\mathbb{N}$ . Let $x_{0}\in\mathbb{R}^{d}$ be arbitrary and let $\{x_{k}:k\in\mathbb{N}\}$ be defined as in Eq. 51. Then, for any $\ell$ , on the event $\{\tau_{\ell}<\infty\}$ ,

[TABLE]

where $\gamma_{j}$ are defined in Eq. 69 and $\gamma_{j}\in[0,1)$ . Therefore, for any $k$ ,

[TABLE]

where $L(k)=\max\{\ell:k\geq\tau_{\ell}+1\}$ ; and where we are on the event $\{\tau_{L(k)}<\infty\}$ .

Proof.

From the basic iteration stated in Eq. 51, we have

[TABLE]

Iterating on this relationship, we conclude

[TABLE]

Moreover, by assumption, $A^{\prime}w_{\ell}\in\mathcal{R}(w)$ with probability one, which implies that $A^{\prime}w_{\ell}\perp\mathcal{N}(w)$ . Therefore,

[TABLE]

and $P_{\mathcal{N}(w)}(x_{k}-x^{*})=P_{\mathcal{N}(w)}(x_{0}-x^{*})$ .

Note, when $\tau_{1}$ is finite, then the span of $\{A^{\prime}w_{0},\ldots,A^{\prime}w_{\tau_{1}}\}$ is $\mathcal{R}(w)$ . Therefore, on the event $\tau_{1}<\infty$ , Theorem 4 implies that

[TABLE]

We now proceed by induction. Suppose Eq. 70 holds for some $\ell\in\mathbb{N}$ . Using Eq. 74, for $k>\tau_{\ell}$ ,

[TABLE]

Now, when $k=\tau_{\ell+1}+1$ , the conditions of Theorem 4 are satisfied. Therefore,

[TABLE]

By applying the induction hypothesis, we conclude that Eq. 70 holds on the event $\{\tau_{\ell+1}<\infty\}$ .

Now, for an orthogonal projection matrix, $I-vv^{\prime}$ , $\left\|I-vv^{\prime}\right\|_{2}=1$ . The bound on $x_{k}-x^{*}-P_{\mathcal{N}}(x_{0}-x^{*})$ follows by applying this fact and the definition of $L(k)$ . ∎

As an analogue of Corollary 2, we have the following characterization of whether $\lim_{k\to\infty}x_{k}$ solves the system $Ax=b$ .

Corollary 3.

Under the setting of Theorem 5, on the events $\bigcap_{\ell=0}^{\infty}\{\tau_{\ell}<\infty\}$ and $\{\lim_{\ell\to\infty}\prod_{j=0}^{\ell}\gamma_{j}=0\}$ , $\lim_{k\to\infty}Ax_{k}=b$ if and only if $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ .

Proof.

By Theorem 5, and on the events $\bigcap_{\ell=0}^{\infty}\{\tau_{\ell}<\infty\}$ and $\{\lim_{\ell\to\infty}\prod_{j=1}^{\ell}\gamma_{j}=0\}$ ,

[TABLE]

Therefore, $\lim_{k\to\infty}Ax_{k}=b+AP_{\mathcal{V}(w)}(x_{0}-x^{*})$ , which implies $\lim_{k\to\infty}Ax_{k}=b$ if and only if $AP_{\mathcal{V}(w)}x_{0}=AP_{\mathcal{V}(w)}x^{*}$ . Clearly, if $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ , then $AP_{\mathcal{V}(w)}x_{0}=AP_{\mathcal{V}(w)}x^{*}$ . Now, since $\mathcal{V}(w)\subset\mathrm{row}(A)$ , if $AP_{\mathcal{V}(w)}x_{0}=AP_{\mathcal{V}(w)}x^{*}$ , then $P_{\mathcal{V}(w)}x_{0}=P_{\mathcal{V}(w)}x^{*}$ follows from Eq. 43. ∎

4.3 Main Convergence Result for Column-Action Methods

For the family of methods specified by Eq. 52, we will follow an almost identical proof except on the residual rather than the error. Specifically, if we let $r_{k}=Ax_{k}-b$ , then Eq. 52 implies

[TABLE]

Thus, we will see two changes in the proof. First, we will see that see that $r_{k}$ for column-action methods will take the place of $x_{k}-x^{*}$ for row-action methods. Second, we already see that $Aw_{k}$ in Eq. 79 has taken the place of $A^{\prime}w_{k}$ in Eq. 72. Owing to this latter issue, we will need to specify analogues of $\mathcal{R}(w)$ , $\mathcal{N}(w)$ and $\mathcal{V}(w)$ .

Let $w\in\mathbb{R}^{d}$ be a random variable, and let

[TABLE]

Just as $\mathcal{N}(w)$ generalized the null space of $A$ under the action of an $n$ -dimensional random variable from the left, we see that $\mathcal{L}(w)$ is a generalization of the left null space of $A$ under the action of a $d$ -dimensional random variable from the right. Analogously, just as $\mathcal{R}(w)$ restricted the row space of $A$ under the action of an $n$ -dimensional random variable from the left, we see that $\mathcal{C}(w)$ is a restriction of the column space of $A$ under the action of a $d$ -dimensional random variable from the right. Finally, we let $\mathcal{E}(w)$ denote the subspace that is orthogonal to $\mathcal{C}(w)$ such that $\mathcal{E}(w)\oplus\mathcal{C}(w)$ is the column space of $A$ .

With these new definitions, we may proceed just as we do in Subsection 4.2. For a random variable $w\in\mathbb{R}^{d}$ , let $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ be a sequence of random variables in $\mathbb{R}^{d}$ such that $Aw_{\ell}\in\mathcal{C}(w)$ . We will now define a sequence of stopping times $\{\tau_{\ell}:\ell+1\in\mathbb{N}\}$ where $\tau_{0}=0$ ,

[TABLE]

and, if $\tau_{\ell-1}<\infty$ , we define

[TABLE]

else $\tau_{\ell}=\infty$ .

Moreover, whenever the stopping times are finite, we will define a collection, $\mathcal{F}_{\ell},$ for $\ell\in\mathbb{N}$ , that contains all matrices $F$ whose columns are maximal linearly independent subsets of

[TABLE]

We can then define $\gamma_{\ell}$ just as we do in Eq. 69. For completeness, we will define it again here so that we reference the appropriate definitions. Define

[TABLE]

Theorem 6.

Suppose $Ax=b$ admits a solution $x^{*}$ (not necessarily unique). Let $w$ be a random variable valued in $\mathbb{R}^{d}$ , and let $\mathcal{C}(w),\mathcal{L}(w)$ and $\mathcal{E}(w)$ be defined as above (see Eq. 80). Moreover, let $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ be random variables such that $\mathbb{P}\left[Aw_{\ell}\in\mathcal{C}(w)\right]=1$ for all $\ell+1\in\mathbb{N}$ . Let $x_{0}\in\mathbb{R}^{d}$ be arbitrary, let $\{x_{k}:k\in\mathbb{N}\}$ be defined as in Eq. 52, and define $r_{k}=Ax_{k}-b$ for $k+1\in\mathbb{N}$ . Then, for any $\ell$ , on the event $\{\tau_{\ell}<\infty\}$ ,

[TABLE]

where $\gamma_{j}$ are defined in Eq. 84 and $\gamma_{j}\in[0,1)$ . Therefore, for any $k$ ,

[TABLE]

where $L(k)=\max\{\ell:k\geq\tau_{\ell}+1\}$ ; and where we are on the event $\{\tau_{L(k)}<\infty\}$ .

Proof.

Iterating on Eq. 79, we conclude

[TABLE]

Moreover, by assumption, $Aw_{\ell}\in\mathcal{C}(w)$ with probability one, which implies $Aw_{\ell}\perp\mathcal{L}(w)$ . Therefore,

[TABLE]

and $P_{\mathcal{L}(w)}r_{k}=P_{\mathcal{L}(w)}r_{0}.$

Note, when $\tau_{1}$ is finite, then the span of $\{Aw_{0},\ldots,Aw_{\tau_{1}}\}$ is $\mathcal{C}(w)$ . Therefore, on the event $\tau_{1}<\infty$ , Theorem 4 implies that

[TABLE]

We now proceed by induction. Suppose Eq. 85 holds for some $\ell\in\mathbb{N}$ . Using Eq. 88, for $k>\tau_{\ell}$ ,

[TABLE]

Now, when $k=\tau_{\ell+1}+1$ , the conditions of Theorem 4 are satisfied. Therefore,

[TABLE]

By applying the induction hypothesis, we conclude that Eq. 85 holds on the event $\{\tau_{\ell+1}<\infty\}$ . The second part of the result follows readily. ∎

We have the following characterization of whether $\lim_{k\to\infty}x_{k}$ solves the system $Ax=b$ .

Corollary 4.

Under the setting of Theorem 6, on the events $\{\bigcap_{\ell=0}^{\infty}\tau_{\ell}<\infty\}$ , and $\{\lim_{\ell\to\infty}\prod_{j=0}^{\ell}\gamma_{j}=0\}$ , $\lim_{k\to\infty}Ax_{k}=b$ if and only if $P_{\mathcal{E}(w)}r_{0}=0$ .

Proof.

On the events $\{\bigcap_{\ell=0}^{\infty}\tau_{\ell}<\infty\}$ and $\{\lim_{\ell\to\infty}\prod_{j=0}^{\ell}\gamma_{j}=0\}$ , Theorem 6 implies

[TABLE]

It straightforwardly follows that $\lim_{k\to\infty}Ax_{k}=b$ if and only if $P_{\mathcal{L}(w)}r_{0}=0$ .

Moreover, by construction of $\mathcal{L}(w)$ , we have that $\mathcal{L}(w)=\mathcal{E}(w)\oplus\ker(A^{\prime})$ . Thus,

[TABLE]

Since the left null space of $A$ is orthogonal to the column space of $A$ , and $r_{0}$ is in the column space of $A$ because $Ax=b$ is consistent, we have that $P_{\mathcal{L}(w)}r_{0}=P_{\mathcal{E}(w)}r_{0}$ . ∎

4.4 Common, Non-Adaptive Sampling Patterns

Just as for Theorem 2, Theorems 5 and 6 are general results that characterizes convergence for any sampling scheme. Following the discussion in Subsection 3.2, the sampling scheme should depend on the hardware environment and the problem setting. Despite this, the two sampling patterns studied in Subsection 3.2 form a foundation for most sampling schemes in practice and warrant a precise analysis. After this analysis, certain adaptive schemes have become popular and are also analyzed in a generic manner. We will focus on the case of row-action methods (corresponding to Theorem 5) as the column-action results (corresponding to Theorem 6) are nearly identical.

The first result provides a proof of convergence when we sample without replacement from a finite population. We note that the result is quite general and does not depend on the nature of the sampling without replacement or the dependency of the samples whenever the finite population is exhausted. As a result, the bounds are loose, which may be unsatisfying. Should particular sampling patterns become sufficiently important to warrant a more detailed analysis, we will do so in future work.

Proposition 2.

Let $w$ and $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ be defined as in Lemma 3. Then, under the setting of Theorem 5,

$\tau_{\ell}-\tau_{\ell-1}\leq 2N$ * for all $\ell\in\mathbb{N}$ , and* 2. 2.

$\lim_{\ell\to\infty}\prod_{j=1}^{\ell}\gamma_{j}=0$ .

Moreover, $\gamma_{j}$ are uniformly bounded by $\gamma\in[0,1)$ that depends on $\{A^{\prime}W_{1},\ldots,A^{\prime}W_{N}\}$ . Therefore, with probability one,

[TABLE]

Proof.

By the definition of $w$ in Lemma 3, $\mathcal{R}(w)=\mathrm{span}\left[A^{\prime}W_{1},\ldots,A^{\prime}W_{N}\right]$ . Moreover, by the definitions of $\{w_{\ell}\}$ , we are sampling from $W_{1},\ldots,W_{N}$ without replacement. Then, we are guaranteed that $\{A^{\prime}w_{\tau_{\ell-1}+1},\ldots,A^{\prime}w_{\tau_{\ell}}\}$ spans $\mathcal{R}(w)$ if $\{W_{1},\ldots,W_{N}\}\subset\{w_{\tau_{\ell-1}+1},\ldots,w_{\tau_{\ell}}\}$ . Now, suppose that at iteration $\tau_{\ell-1}$ , $\mathcal{W}\subset\{W_{1},\ldots,W_{N}\}$ are exhausted. Then, to ensure that $\{W_{1},\ldots,W_{N}\}$ is contained in $\{w_{\tau_{\ell-1}+1},\ldots,w_{\tau_{\ell}}\}$ , we need to exhaust $\mathcal{W}^{c}$ and then the entire set $\{W_{1},\ldots,W_{N}\}$ . Since $|\mathcal{W}^{c}|\leq N$ , we need at most $2N$ more iterations from $\tau_{\ell-1}$ to achieve $\tau_{\ell}$ . Therefore, $\tau_{\ell}-\tau_{\ell-1}\leq 2N$ . Now, let $\mathcal{F}$ denote all matrices whose columns are maximal linearly independent subsets of

[TABLE]

Then, $\mathcal{F}_{\ell}\subset\mathcal{F}$ . Therefore,

[TABLE]

It is clear, by Hadamard’s inequality, that $\gamma\in[0,1)$ . Hence, $\lim_{\ell\to\infty}\prod_{j=1}^{\ell}\gamma_{j}\leq\lim_{\ell\to\infty}\gamma^{\ell}=0$ . The result follows by Theorem 5. ∎

It is worth pausing here to compare our approach in Proposition 2 to previous results for cyclic row-action methods (e.g., Kaczmarz (1993),888This is a translated copy of Kaczmarz’s original article, which is published in German (Karczmarz, 1937). algebraic reconstruction technique (Gordon et al., 1970), cyclic block Kaczmarz). Our use of Meany’s inequality to analyze such methods is not novel: Meany’s inequality has been used previously to analyze deterministic row-action methods (Galántai, 2005; Bai and Liu, 2013; Wallace and Sekmen, 2014) with even more sophisticated refinements of Meany’s inequality than what we have here, and a detailed comparison of Meany’s inequality and other approaches to analyzing these deterministic variants can be found in Dai and Schön (2015). However, our use of Meany’s inequality generalizes these deterministic approaches as it (1) allows for an arbitrary transformation (via $\{W_{1},\ldots,W_{N}\}$ ) of the original system, which has borne out to be a fruitful approach vis-à-vis matrix sketching Woodruff (2014); and (2) allows for the benefits of random cyclic sampling, which many have observed to be the most productive route in practice and there is mounting evidence in adjacent fields that random cyclic sampling does indeed have practical benefits (Lee and Wright, 2019; Wright and Lee, 2020). While our generalizations are valuable, further improvements are to be found by marrying our randomization framework with the more nuanced refinements of Meany’s inequality found in Galántai (2005) and Bai and Liu (2013), which we leave to future efforts.

The next result revisits the case of independent and identically distributed sampling. The result makes intuitive sense as, for such a situation, we should expect the difference in the stopping times to be independent and identically distributed, which, results in the natural conclusion that $\gamma_{\ell}$ are also independent and identically distributed. Moreover, we show that eventually, the rate of convergence is almost controlled by $\mathbb{E}\left[\gamma_{1}\right]$ with probability one. We again stress here that the generality of the results naturally makes them quite loose, and we discuss this further after the result.

Proposition 3.

Let $w$ and $\{w_{\ell}:\ell+1\in\mathbb{N}\}$ be defined as in Proposition 1. Then, under the setting of Theorem 5, $\tau_{\ell}<\infty$ almost surely for all $\ell\in\mathbb{N}$ , and $\{\gamma_{\ell}:\ell\in\mathbb{N}\}$ are independent and identically distributed such that $\mathbb{E}\left[\gamma_{1}\right]=1-\mathbb{E}\left[\min_{F\in\mathcal{F}_{1}}\det(F^{\prime}F)\right]<1$ . Hence, for all $\ell\in\mathbb{N}$ and $\delta>1$ ,

[TABLE]

where $\mathbb{E}\left[\gamma_{\ell}\right]\in[0,1)$ . Moreover, $\lim_{\ell\to\infty}\tau_{\ell}/\ell=\mathbb{E}\left[\tau_{1}\right]$ .

Remark 2.

In the proof below, we also compute the probability for each $j$ for which the conclusion of the preceding result holds. Thus, we can also make the usual “high-probability” statements without any additional effort.

Proof.

Again, our main workhorse will be (Durrett, 2010, Theorem 4.1.3). By this result, conditioned on $\tau_{\ell-1}$ , $\{A^{\prime}w_{\tau_{\ell-1}+1},A^{\prime}w_{\tau_{\ell-1}+2},\ldots\}$ are independent and identically distributed. By this property, conditioned on $\tau_{\ell-1}$ , $\tau_{\ell}-\tau_{\ell-1}$ is independent of $\tau_{\ell-1}$ and have the same distribution for all $\ell\in\mathbb{N}$ . We conclude then that since $\gamma_{\ell}$ is a function of $\{A^{\prime}w_{\tau_{\ell-1}+1},\ldots,A^{\prime}w_{\tau_{\ell}}\}$ , then $\gamma_{\ell}$ are independent and identically distributed. We now conclude that Eq. 70 holds with probability one by applying Theorem 5. For any $\delta>1$ , by Markov’s inequality and independence,

[TABLE]

Since $\mathbb{E}\left[\gamma_{1}\right]^{1-\frac{1}{\delta}}<1$ , the Borel-Cantelli lemma implies that the probability that the product of $\gamma_{j}$ is eventually less than $\mathbb{E}\left[\gamma_{1}\right]^{k/\delta}$ is one. ∎

Here, we again take a moment to compare this result to the results of Richtárik and Takác (2020). Namely, we are interested in how the rate of convergence of Proposition 3 compares with the rate of convergence result in Richtárik and Takác (2020). To make this comparison, we numerically estimate the theoretical rates of convergence proposed by our result and the result of Richtárik and Takác (2020) on five matrices from the MatrixDepot (as described in Section 5). We show these comparisons in Table 2. We show these comparisons in Table 2. As expected, the results of Richtárik and Takác (2020), which are specialized to the i.i.d. case and apply on average, are much tighter than our general results that apply to more than just i.i.d. case and hold with probability one.

4.5 Adaptive Sampling Schemes

To bookend this section, we discuss how our results can be applied to a broad set of adaptive methods that make use of the residual information at a given iterate whether deterministically (e.g., Motzkin and Schoenberg (1954); Gubin et al. (1967); Lent (1976); Censor (1981)) or randomly (e.g., Nutini et al. (2016); Bai and Wu (2018); Haddock and Ma (2019)). In Subsection 4.5.1, we will begin with some formalism to establish a general class of adaptive methods, and we then prove convergence and a rate of convergence for such methods. In Subsection 4.5.2, we provide concrete examples at the end.

4.5.1 A General Class and Analysis of Adaptive Methods

To be rigorous, let $x_{0}\in\mathbb{R}^{d}$ and let $\varphi:(A,b,\{x_{j}:j\leq k\})\mapsto w_{k}$ be an adaptive procedure for generating $\{w_{k}\}$ according to the following procedure: for $k+1\in\mathbb{N}$ ,

[TABLE]

Remark 3.

While we will focus on the base methods of type Eq. 51, methods of the type Eq. 52 can be handled analogously.

While Eq. 99 is quite general, the vast majority of adaptive schemes make further restrictions that we abstract in the following definitions.

Definition 1 (Markovian).

For a fixed integer $\eta$ , an adaptive procedure, $\varphi$ , is $\eta$ -Markovian if the conditional distribution of $\varphi(A,b,\{x_{j}:j\leq k\})$ given $\{x_{j}:j\leq k\}$ is equal to the conditional distribution of $\varphi(A,b,\{x_{j}:j\leq k\})$ given $\{x_{j}:k-\eta<j\leq k\}$ . If a procedure is $1$ -Markovian, we will frequently call it Markovian.

A consequence of the $\eta$ -Markovian property is that we can write $\varphi(A,b,\{x_{j}:j\leq k\})$ as $\varphi(A,b,\{x_{j}:k-\eta<j\leq k\})$ . In the case of a $1$ -Markovian adaptive procedure, we will simply write $\varphi(A,b,x_{k})$ . The $1$ -Markovian property is readily satisfied for a number of common procedures analyzed in the literature (e.g., maximum residual, maximum distance, etc.), which may suggest that the $\eta$ -Markovian notion is irrelevant for general $\eta$ . We contend though, that procedures that are memory-sensitive may be more apt to make use of the $\eta$ -Markovian property for $\eta>1$ . For example, to demonstrate its potential value, consider a procedure that selects the equations with the top $\eta$ residuals, pulls them into memory, and simply cycles through them deterministically or randomly. Then this simple procedure would be $\eta$ -Markovian. However, owing to the lack of such procedures in the literature, we will focus on the $1$ -Markovian case for which we can write $\varphi(A,b,x)$ , and note that the results and definitions are readily extendable.

The next definition establishes another key property of these adaptive schemes that rely on residuals.

Definition 2 (Magnitude Invariance).

Let $H$ represent the set of solutions to $Ax=b$ , and let $P_{H}:\mathbb{R}^{d}\to H$ represent the projection of a vector onto $H$ ,999Since $H$ is a flat, $P_{H}$ is not guaranteed to be a linear operator. then an adaptive procedure, $\varphi$ , is magnitude invariant if, for any $x\not\in H$ and any $\lambda>0$ , the distribution of $\varphi(A,b,x)$ is equal to the distribution of

[TABLE]

The magnitude invariance of a number of adaptive methods often follows from the following simple calculation that we state as a lemma for future reference.

Lemma 5.

Let $x\in\mathbb{R}^{d}$ and let $v_{1},v_{2}\in\mathbb{R}^{n}$ . Then, for any $\lambda>0$ , if $|v_{1}^{\prime}(Ax-b)|\geq|v_{2}^{\prime}(Ax-b)|$ then

[TABLE]

If the hypothesis holds with a strict inequality, then so does the conclusion.

Proof.

Note, $AP_{H}(x)=b$ . Therefore, $A(P_{H}(x)+\lambda[x-P_{H}(x)])-b=\lambda(Ax-b)$ . From the hypothesis and $\lambda>0$ , $\lambda|v_{1}^{\prime}(Ax-b)|\geq\lambda|v_{2}^{\prime}(Ax-b)|$ . Also owing to $\lambda>0$ , we can replace the inequalities with strict inequalities. ∎

Furthermore, the magnitude invariance property has hidden within it an additional feature: the projection of $x$ onto the null space is irrelevant (as we might expect for a procedure depending on the residual). As a result, we can, without losing generality, focus our discussion to $x$ that are in the row space of $A$ , which has a unique intersection with $H$ at a point that we denote $x_{\operatorname{row}}^{*}$ . Furthermore, the magnitude invariance property allows us to focus specifically on the Euclidean unit sphere around $x_{\operatorname{row}}^{*}$ , which we denote by $\mathbb{S}(x_{\operatorname{row}}^{*})$ . This will be essential to the next definition.

The final definition ensures that if Eq. 99 makes too much progress along one particular subspace, then it must have a nonzero probability of exploring an orthogonal subspace relative to, roughly, the row space of $A$ . Before stating this definition, we need to be slightly careful here with using the row space of $A$ : if the rows of $A$ can be partitioned into two sets that are mutually orthogonal and $x_{0}$ is initialized in the span of one of these subsets, then we will never need to visit the other set and, consequently, we will never observe the entire row space of $A$ . To account for this, we can focus on the restricted row space,

[TABLE]

This definition may seem unnecessary as we can account for this (more generally) via $\mathcal{R}(w)$ by an appropriate choice of $w$ . However, in our previous statements, we defined $w$ before specifying $x_{0}$ . Here, we would need to know $x_{0}$ in order to define $w$ and, thus, $\mathcal{R}(w)$ appropriately. Fortunately, an examination of the preceding results shows that this ordering is not important and the results hold even if $w$ is defined given $x_{0}$ or even future iterates. With this explanation in hand, we can now state the final definition.

Definition 3 (Exploratory).

Let $x_{0}\in\mathbb{R}^{d}$ and define $\operatorname{rrow}(A)$ accordingly. An adaptive procedure, $\varphi$ , is exploratory if for any proper subspace $V\subsetneq\operatorname{rrow}(A)$ , there exists $\pi\in(0,1]$ such that

[TABLE]

Remark 4.

If magnitude invariance does not hold, then we could specify the exploratory property to hold for any point in $V$ that is distinct from $x_{\operatorname{row}}^{*}$ . For this modified definition of the exploratory property, the results below would still hold. Then, why should we keep the magnitude invariance property? It is out of practicality. The magnitude invariance property allows us to restrict the verification of the exploratory property to the unit ball, and then we can apply it to any iterate regardless of its distance to the solution.

For a Markovian, magnitude invariant and exploratory adaptive scheme, $\varphi$ , we will need one assumption before stating the result.

Assumption 1.

Let $\mathcal{F}$ denote the set of matrices whose columns are normalized, maximal linearly independent subsets of

[TABLE]

where $x_{1},\ldots,x_{d}\in\mathbb{R}^{d}$ are arbitrary vectors. Suppose, for this choice of $\varphi$ ,

[TABLE]

Remark 5.

As we will see, Assumption 1 is sufficient for us to uniformly treat the many examples in the literature that are selecting equations or, more generally, are of the form in Lemma 3, rather than generating linear combinations of them. In the case of linear combinations, we could refine this assumption to account for the nature of the linear combinations as we do in Proposition 3.

Theorem 7.

Suppose $Ax=b$ admits a solution $x^{*}$ (not necessarily unique); let $H$ denote the set of all solution, and $P_{H}$ be the projection onto this flat. Let $x_{0}\in\mathbb{R}^{d}$ and let $\operatorname{rrow}(A)$ be defined as above (see Eq. 102). Moreover, let $\varphi$ be a $1$ -Markovian, magnitude invariant and exploratory adaptive procedure satisfying Assumption 1 that generates $\{x_{k}\}$ and $\{w_{k}\}$ according to Eq. 99 and so that $\mathbb{P}\left[A^{\prime}w_{k}\in\operatorname{rrow}(A)\right]=1$ for all $k+1\in\mathbb{N}$ . Then, there exist an increasing sequence of stopping times $\{\tau_{\ell}:\ell\in\mathbb{N}\}$ such that $\mathbb{P}\left[E_{1}\cup E_{2}\right]=1$ , where:

$E_{1}$ * is the event of iterates that terminate finitely to a solution of $Ax=b$ ; that is,*

[TABLE] 2. 2.

$E_{2}$ * is the event of iterates that infinitely converge to a solution of $Ax=b$ ; that is,*

[TABLE]

Moreover, on $E_{1}$ , $\tau_{\ell}$ has finite expectation for $\ell$ such that $x_{\tau_{\ell}+1}\in H$ . Similarly, on $E_{2}$ , $\tau_{\ell}$ has finite expectation for all $\ell$ .

Proof.

Without loss of generality, we will assume $x_{0}\in\operatorname{row}(A)$ . We will consider the nontrivial case where $x_{0}\neq x_{\operatorname{row}}^{*}$ . Note, by the construction of $\operatorname{rrow}(A)$ , it must hold then $x_{0}-x_{\operatorname{row}}^{*}\in\operatorname{rrow}(A)$ . To prove the result, we will make three claims of the following rough nature and purpose, which we will make precise below.

Finite termination can only occur at a point $x_{k+1}$ if and only if $A^{\prime}\varphi(A,b,x_{k})$ is parallel to $x_{k}-x_{\operatorname{row}}^{*}$ . We will use this claim to specify the set $E_{1}$ . 2. 2.

For the first time the span of the iterate errors, $\mathrm{span}[\{x_{k}-x_{\operatorname{row}}^{*}\}]$ , fails to (non-trivially) increase in dimension, the corresponding $\{A^{\prime}w_{k}\}$ up to this iterate span the subspace. As a result, with an appropriate definition of $\mathcal{R}(w)$ , we will apply Theorem 5 to prove a multiplicative decrease in the iterate errors by a factor of $\gamma$ . 3. 3.

Finally, we show that the first time that the span of the iterate errors fails to (non-trivially) increase in dimension must be finite with probability one and have bounded expectation. By combining the first claim with this claim, we have the property specified by the event $E_{1}$ . By combining this claim with the second claim, we have the property specified by the event $E_{2}$ . By this claim alone, we have that $\mathbb{P}\left[E_{1}\cup E_{2}\right]=1$ .

To establish our claims, we need some additional notation. Let $\xi$ be an arbitrary finite stopping time and define

[TABLE]

and $V_{k}^{0}=\mathrm{span}\left[x_{\xi+k}-x_{\operatorname{row}}^{*}\right]$ . Furthermore, define

[TABLE]

Note, $\nu$ corresponds to the first time that the span of the iterate errors, starting at $\xi$ , fails to non-trivially increase in dimension. It will often be more succinct to specify the non-trivial cases by an indicator variable given by

[TABLE]

By Eq. 99, we can readily replace $x_{\xi+k+1}\neq x_{\xi+k}$ in the definition of $\nu$ with $\chi_{\xi+k}=1$ . We now state and prove our claims precisely.

Claim 1: Suppose $x_{\xi}-x_{\operatorname{row}}^{*}\neq 0$ . We claim that $x_{\xi+1}=x_{\operatorname{row}}^{*}$ if and only if $A^{\prime}\varphi(A,b,x_{\xi})\in V_{0}\setminus\{0\}$ .

Note, this claim readily follows from

[TABLE]

which, in turn, follows from Eq. 99.

Claim 2: Suppose $\nu$ is finite and define $V_{\nu}$ . We claim that

[TABLE]

We first note that $A^{\prime}\varphi(A,b,x_{\xi+k})\chi_{\xi+k}\in V_{\nu}$ for any $k\in[0,\nu]$ by Eq. 99. Therefore, we see that the span of $\Phi=\{A^{\prime}\varphi(A,b,x_{\xi})\chi_{\xi},\ldots,A^{\prime}\varphi(A,b,x_{\xi+\nu})\chi_{\xi+\nu}\}$ is contained in $V_{\nu}$ . To show that $V_{\nu}$ is included in the span of $\Phi$ , note that, by the definition of $V_{\nu}$ and by Eq. 99,

[TABLE]

Moreover, the nonzero terms on the generating set on the right hand side of Eq. 113 must be linearly independent, as anything else would contradict the minimality of $\nu$ . We are left to show that $x_{\xi+\nu}-x_{\operatorname{row}}^{*}$ is in the span of $\Phi$ . To do this, we perform Gram-Schmidt on the generating set in Eq. 113 starting with $x_{\xi+\nu}-x_{\operatorname{row}}^{*}$ . Denote the remaining vectors in this set $\phi_{1},\ldots,\phi_{r-1}$ where $r=\dim(V_{\nu})$ . Then, by the definition of $\nu$ , $x_{\xi+\nu+1}-x_{\operatorname{row}}^{*}\in V_{\nu}$ . Therefore, there exist constants $c_{0},\ldots,c_{r-1}$ such that

[TABLE]

If $c_{0}\neq 1$ , we see that the claim follows. For a contradiction, suppose that $c_{0}=1$ . Then $A^{\prime}\varphi(A,b,x_{\xi+\nu})$ can be written as a linear combination of vectors that are orthogonal to $x_{\xi+\nu}-x_{\operatorname{row}}^{*}$ . This would imply then that $\chi_{\xi+\nu}=0$ , which contradicts the definition of $\nu$ . Hence, we see that the claim holds.

Claim 3: For any finite stopping time $\xi$ , $\nu$ is finite with probability one and has bounded expectation.

To show this, we define a sequence of stopping times. Define

[TABLE]

and

[TABLE]

By the definition of $\nu$ , $\nu$ can only take values in $\{\sum_{i=1}^{j}s_{i}:j\in\mathbb{N}\}$ . Moreover, at each $s_{j}$ , we must either observe $\{\dim(V_{\xi+s_{1}+\cdots+s_{j}+1})=\dim(V_{\xi+s_{1}+\cdots+s_{j}})+1\}$ or $\{\nu\leq\sum_{i=1}^{j}s_{i}\}$ . Hence, at most, we see that $\nu$ can only take values in $\{\sum_{i=1}^{j}s_{i}:j=1,\ldots,r\}$ where $r=\dim(\operatorname{rrow}(A))$ . Thus, if we show that each $s_{j}$ is finite and has bounded expectation, then $\nu$ must be finite and have bounded expectation. By the magnitude invariance, Markovian and exploratory properties, we conclude that

[TABLE]

Therefore, we see that $s_{j}$ is finite and has bounded expectation.

Conclusion: From these three claims we can now prove the result by induction.

Base Case

Define $\mathfrak{E}_{0}^{c}=\{x_{0}\neq x_{\operatorname{row}}^{*}\}$ . On this event, if we take $\xi=0$ and define $\tau_{1}$ to be the corresponding $\nu$ . On $\mathfrak{E}_{0}^{c}$ , $\tau_{1}$ is finite and has finite expectation by Claim 3. Then, we can define, as a subset of $\mathfrak{E}_{0}^{c}$ ,

[TABLE]

and $\mathfrak{E}_{1}^{c}$ to be its relative complement on $\mathfrak{E}_{0}$ .

Note,

By Claim 1, $\mathfrak{E}_{1}$ is equivalent to the event $x_{\tau_{1}+1}=x_{\operatorname{row}}^{*}$ up to a measure zero set. 2. 2.

By Claim 2, Theorem 5 with $\mathcal{R}(w)=V_{\tau_{1}}$ , and Assumption 1, $\mathfrak{E}_{1}^{c}$ is contained in the event on which

[TABLE]

up to a measure zero set.

Induction Hypothesis

Let $\ell\in\mathbb{N}$ . On the event $\mathfrak{E}_{\ell-1}^{c}$ , we let $\xi=\tau_{\ell-1}+1$ and, for the correspondingly defined $\nu$ , we can define $\tau_{\ell}=\tau_{\ell-1}+1+\nu$ . Furthermore, on $\mathfrak{E}_{\ell-1}^{c}$ , $\tau_{\ell}$ is finite and has finite expectation. We can define, as a subset of $\mathfrak{E}_{\ell-1}^{c}$ ,

[TABLE]

and $\mathfrak{E}_{\ell}^{c}$ to be its relative complement on $\mathfrak{E}_{\ell-1}^{c}$ .

Further,

$\mathfrak{E}_{\ell}$ is equivalent to the event $x_{\tau_{\ell}+1}=x_{\operatorname{row}}^{*}$ up to a measure zero set. 2. 2.

$\mathfrak{E}_{\ell}^{c}$ is contained in the event on which

[TABLE]

up to a measure zero set.

Generalization

On the event $\mathfrak{E}_{\ell}^{c}$ , we let $\xi=\tau_{\ell}+1$ and, for the correspondingly defined $\nu$ , we can define $\tau_{\ell+1}=\tau_{\ell}+1+\nu$ . On $\mathfrak{E}_{\ell}^{c}$ , $\tau_{\ell+1}$ is finite and has finite expectation by Claim 3. Then, we can define, as a subset of $\mathfrak{E}_{\ell}^{c}$ ,

[TABLE]

and $\mathfrak{E}_{\ell+1}^{c}$ to be its relative complement on $\mathfrak{E}_{\ell}^{c}$ .

By Claim 1, $\mathfrak{E}_{\ell+1}$ is equivalent to the event $x_{\tau_{\ell+1}+1}=x_{\operatorname{row}}^{*}$ up to a measure zero set. 2. 2.

By Claim 2, Theorem 5 with $\mathcal{R}(w)=V_{\tau_{\ell+1}}$ , and Assumption 1, $\mathfrak{E}_{\ell+1}^{c}$ is contained in the event on which

[TABLE]

up to a measure zero set.

Therefore, by the induction claims,

[TABLE]

and

[TABLE]

and $\mathbb{P}\left[E_{1}\cup E_{2}\right]=1$ . ∎

4.5.2 Applying our General Theory to Specific Adaptive Schemes

To demonstrate the utility of Theorem 7, we show that a number of classical and recent methods satisfy Definitions 2, 1, 3 and 1. In fact, we will show that a stronger version of Definition 3 holds for these methods, which allows us to explicitly upper bound the elements of $\{\mathbb{E}\left[\tau_{\ell}\right]:\ell\in\mathbb{N}\}$ (when they are defined).

Proposition 4.

Suppose $Ax=b$ admits a solution $x^{*}$ . Let $x_{0}\in\mathbb{R}^{d}$ and let $\operatorname{rrow}(A)$ be defined as above (see Eq. 102). Suppose that we define $\{x_{k}\}$ and $\{w_{k}\}$ according to Eq. 99 for the following adaptive methods

the maximum residual method (see Agmon, 1954, Section 4)); 2. 2.

the maximum distance method (see Agmon, 1954, Section 3); 3. 3.

the Greedy Randomized Kaczmarz method (see Bai and Wu, 2018, Method 2); 4. 4.

the Sampling Kaczmarz-Motzkin method (see Haddock and Ma, 2019, Page 4).

Then, for each of the above methods, there exists a $\gamma\in[0,1)$ such that the conclusions of Theorem 7 hold. Moreover, there exists a constant $\kappa$ such that for any finite $\tau_{\ell}$ (as specified in Theorem 7), $\mathbb{E}\left[\tau_{\ell}\right]\leq\ell\kappa$ .

Remark 6.

Greedy Randomized Kaczmarz is an example of methods that deterministically determine a threshold over residuals; select the equations whose residuals surpass this threshold; and then randomly select from this set. For this more general class, so long as the threshold satisfies the magnitude invariance property and the random selection does not give any equation less than zero probability, then the result applies to this more general class. Similarly, Sampling Kaczmarz-Motzkin is an example of methods that randomly determine a set of equations; and then deterministically select from this subset of equations based on the residual values. So long as the random subset of equations does not give any equation less than zero probability (that is not already satisfied), then the result will apply to this more general class as well.

Remark 7.

Our partial orthogonalization methods (see Algorithm 2) do not satisfy the $\eta$ -Markovian property, as the partial orthogonalizations have a dependence on every preceding iterate.

For each method, we show that it satisfies Definitions 2, 1, 3 and 1. In fact, for each method, we will show that a stronger version of Definition 3 holds. We will start by establishing several general facts that will be useful in the discussion of each method.

Lemma 6.

Let $x_{0}\in\operatorname{row}(A)$ and define $\operatorname{rrow}(A)$ as in Eq. 102. Then,

[TABLE]

where $\mathbb{S}(0)$ is the Euclidean unit sphere around the zero vector.

Proof.

For each $v\in\operatorname{rrow}(A)\cap\mathbb{S}(0)$ , we see that

[TABLE]

else $v\perp\operatorname{rrow}(A)$ and $v\in\operatorname{rrow}(A)\cap\mathbb{S}(0)\subset\operatorname{rrow}(A)$ and we would have a contradiction since $v\neq 0$ . By continuity, we see that we can construct an open ball around each $v\in\operatorname{rrow}(A)\cap\mathbb{S}(0)$ , $D_{v}$ , such that

[TABLE]

for all $\tilde{v}\in D_{v}\cap\mathbb{S}(0)$ . Now, $\{D_{v}:v\in\operatorname{rrow}(A)\cap\mathbb{S}(0)\}$ is an open cover of $\operatorname{rrow}(A)\cap\mathbb{S}(0)$ , which is a compact space. Hence, there is a finite subcover given by $\{D_{v_{1}},\ldots,D_{v_{K}}\}$ . It follows that since each $v\in\operatorname{rrow}(A)\cap\mathbb{S}(0)$ belongs to one of the elements in the subcover, then

[TABLE]

Therefore $c>0$ . ∎

Lemma 7.

Let $x_{0}\in\operatorname{row}(A)$ and define $\operatorname{rrow}(A)$ as in Eq. 102. Let $\Phi=\left\{A_{i,\cdot}:A_{i,\cdot}\in\operatorname{rrow}(A)\right\}$ . Let $\mathcal{F}$ be the matrices whose columns are normalized, maximal linearly independent vectors from $\Phi$ . Then

[TABLE]

Proof.

There are only a finite number of matrices in $\mathcal{F}$ up to column permutations. Therefore, we can choose the $F\in\mathcal{F}$ that minimizes $\det(F^{\prime}F)$ . By Hadarmard’s inequality, $\det(F^{\prime}F)\in(0,1]$ , which implies that $\gamma\in[0,1)$ . ∎

Maximum Residual Method.

In the maximum residual method, $\varphi(A,b,x)$ is the standard basis vector in $\mathbb{R}^{n}$ , $\{e_{1},\ldots,e_{n}\}$ , that solves

[TABLE]

$1$ -Markovian: It follows from the definition of the maximum residual method that it only relies on the current iterate to evaluate $\varphi$ . Therefore, it is $1$ -Markovian.

Magnitude Invariance: By Lemma 5, it follows that $\varphi(A,b,x)$ is magnitude invariant.

Exploratory: Consider any $A_{i,\cdot}\perp V$ . Then, $0=A_{i,\cdot}^{\prime}(x-x_{\operatorname{row}}^{*})=A_{i,\cdot}^{\prime}x-b_{i}$ . Therefore, we have that the only equations whose residuals are non-zero are the ones such that $P_{V}A_{i,\cdot}\neq 0$ , and there is at least one such equation by Lemma 6. Therefore,

[TABLE]

That is, we satisfy the exploratory property in a stronger manner:

[TABLE]

With these three properties verified and by Lemma 7, the conditions of Theorem 7 are satisfied and the result holds. The only thing left to show is that $\mathbb{E}\left[\tau_{\ell}\right]$ are bounded by some $\ell\kappa$ . By the proof of Theorem 7, it is enough to bound the conditional expectations of $s_{j}$ in Eq. 117. Given that $\pi=1$ for all $V\subsetneq\operatorname{rrow}(A)$ ,

[TABLE]

Hence, $\nu\leq\dim(\operatorname{rrow}(A))$ . Thus, $\mathbb{E}\left[\tau_{\ell}\right]\leq\ell\dim(\operatorname{rrow}(A))$ . $\quad\blacksquare$

Maximum Distance Method.

In the maximum distance method, $\varphi(A,b,x)$ is the standard basis vector in $\mathbb{R}^{n}$ that solves

[TABLE]

$1$ -Markovian: It follows from the definition of the maximum distance method that it only relies on the current iterate to evaluate $\varphi$ . Therefore, it is $1$ -Markovian.

Magnitude Invariance: Note, Lemma 5 still holds if we were to divide by the norm squared of $A_{i,\cdot}$ . It follows that the maximum distance method is magnitude invariant.

Exploratory: Just as in the maximum residual method, if $A_{i,\cdot}$ that is orthogonal to a subspace $V$ , then $A_{i,\cdot}^{\prime}x-b_{i}=0$ for any $x\in V\cap\mathbb{S}(x_{\operatorname{row}}^{*})$ . Moreover, by Lemma 6, there is at least one equation such that $A_{j,\cdot}^{\prime}x-b\neq 0$ for all $x\in V\cap\mathbb{S}(x_{\operatorname{row}}^{*})$ . Hence, the maximum distance method satisfies a stronger version of the exploratory condition, namely,

[TABLE]

By the same argument as above, Theorem 7 follows. Similarly, $\mathbb{E}\left[\tau_{\ell}\right]\leq\ell\dim(\operatorname{rrow}(A))$ . $\quad\blacksquare$

Greedy Randomized Kaczmarz.

In Bai and Wu (2018) (Method 2), a residual threshold is selected given by

[TABLE]

Then, from the set of equations whose residual surpasses this threshold (which is shown to at least contain the equation selected by the maximum distance method), an equation is selected by a probability proportional to the equation’s residual squared.

$1$ -Markovian: Given that the threshold relies only on the current iterate value and that the random selection criteria only relies on the current residual, it follows that the Greedy Randomized Kaczmarz method is $1$ -Markovian.

Magnitude Invariance: Suppose $x\not\in H$ . For $\lambda>0$ , let $x(\lambda)=P_{H}(x)+\lambda(x-P_{H}(x))$ . Then, by Lemma 5,

[TABLE]

which implies that the threshold is magnitude invariant. Similarly, we can show that the selection probabilities are magnitude invariant (we look at the preceding calculation, but only for a nonempty subset of the equations).

Exploratory: Let $V\subsetneq\operatorname{rrow}(A)$ be a nontrivial subspace. Then for any $x\in\mathbb{S}(x_{\operatorname{row}}^{*})\cap V$ , we saw that any equations for which $P_{V}A_{i,\cdot}=0$ have a zero residual. Therefore, the only equations with nonzero residuals are those that not orthogonal to $V$ . Since the threshold is bounded away from zero, only equations that are not orthogonal to $V$ can be in the subset. Therefore,

[TABLE]

By the same argument as above, Theorem 7 follows. Similarly, $\mathbb{E}\left[\tau_{\ell}\right]\leq\ell\dim(\operatorname{rrow}(A))$ . $\quad\blacksquare$

Sampling Kaczmarz-Motzkin.

In Haddock and Ma (2019) (Page 4), a subset of equations are randomly selected, and then the equation with the maximum residual is selected from this subset.

$1$ -Markovian: The Sampling Kaczmarz-Motzkin method only relies on the current residual to sample. As a result, it is $1$ -Markovian.

Magnitude Invariance: The distribution of the initial subsetting is independent and identical at each iteration. Therefore, conditioned on a given subset, we choose the maximum residual. By Lemma 5, this last step is magnitude invariant. Moreover, since the random subsetting is independent and identical at each iteration, it too is magnitude invariant. Therefore, the entire procedure is magnitude invariant.

Exploratory: Let $V\subsetneq\operatorname{rrow}(A)$ be a nontrivial subspace. Then, for any $x\in\mathbb{S}(x_{\operatorname{row}}^{*})\cap V$ , we have shown that there exists a $j$ such that $A_{j,\cdot}^{\prime}x-b_{j}\neq 0$ . Therefore, so long as the probability of selecting this equation is nonzero, then we are guaranteed that there is some choice of $\varphi(A,b,x)$ such that

[TABLE]

Let $\pi$ be the smallest inclusion probability for any equation in the random subset. Then, it follows that

[TABLE]

For the Sampling Kaczmarz-Motzkin method, the minimum inclusion probability is at least $\psi/n$ , which corresponds to random sampling without replacement of subsets of size $\psi$ .

With these three properties verified and by Lemma 7, the conditions of Theorem 7 are satisfied and the result holds. The only thing left to show is that $\mathbb{E}\left[\tau_{\ell}\right]$ are bounded by some $\ell\kappa$ . By the proof of Theorem 7, it is enough to bound the conditional expectations of $s_{j}$ in Eq. 117. Supposing that $\pi$ for all $V\subsetneq\operatorname{rrow}(A)$ ,

[TABLE]

Hence, $\mathbb{E}\left[\tau_{\ell}\right]\leq\ell n\dim(\operatorname{rrow}(A))/\psi$ . $\quad\blacksquare$

5 Numerical Experiments

Here, we present a variety of numerical experiments to study the practicality of our approach in a sequential computing environment. Specifically, we test forty-nine systems with five hundred equations and five hundred unknowns. The coefficients are generated from forty-nine built-in matrices found in the MatrixDepot package for the Julia programming language (Zhang and Higham, 2016). The solution to the equation is then generated using a standard, multi-variate normal vector. The constant vector is generated by the product of the two. Then, using the generated coefficient matrix and the generated constant vector, we solve the systems by varying the sample-generation method (i.e., the generation of $w$ and $\{w_{l}\}$ ) and the solver. The sample generation method is either produced by the count-sketch approach, the Gaussian approach, by uniformly sampling the equations of the matrix with replacement, or by uniformly sampling the equations of the matrix without replacement. The solver is either a base method, the complete method, an intermediate method with $m=5$ , or an intermediate method with $m=10$ . Finally, we initialize $x_{0}=0$ .

We recorded the wall clock time to improve the initial residual norm by a factor of ten with an upper bound of three seconds (note, a single iteration of a base method requires approximately $10^{-6}$ seconds, which allows the base method on the order of $10^{6}$ iterations on a $500\times 500$ system). If the temporal upper bound is reached before a ten fold improvement in the initial residual norm is observed, the wall clock times is reported as $10^{99}$ . Inherently, this metric results in a disadvantage for complete orthogonalization methods as such methods pay more for marginal improvements, but generate precise solutions with fewer iterations. However, with an eye towards solving much larger systems that require using a parallel or distributed environment, this metric of time-to-ten-fold improvement is the appropriate choice as the complete method would not be appropriate in such environments owing to the high communication costs that would be incurred. For the count-sketch sampling method, the wall clock times are reported in Section 5. For the remaining sampling approaches, the wall clock times are reported in the appendix.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agmon [1954] Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics , 6:382–392, 1954.
2Bai and Liu [2013] Zhong-Zhi Bai and Xin-Guo Liu. On the meany inequality with applications to convergence analysis of several row-action iteration methods. Numerische Mathematik , 124(2):215–236, 2013.
3Bai and Wu [2018] Zhong-Zhi Bai and Wen-Ting Wu. On greedy randomized kaczmarz method for solving large sparse linear systems. SIAM Journal on Scientific Computing , 40(1):A 592–A 606, 2018.
4Censor [1981] Yair Censor. Row-action methods for huge and sparse systems and their applications. SIAM review , 23(4):444–466, 1981.
5Chen and Powell [2012] Xuemei Chen and Alexander M Powell. Almost sure convergence of the kaczmarz algorithm with random measurements. Journal of Fourier Analysis and Applications , 18(6):1195–1214, 2012.
6Clarkson and Woodruff [2017] Kenneth L Clarkson and David P Woodruff. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM) , 63(6):1–45, 2017.
7Cormode and Muthukrishnan [2005] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms , 55(1):58–75, 2005.
8Dai and Schön [2015] Liang Dai and Thomas B Schön. On the exponential convergence of the kaczmarz algorithm. IEEE Signal Processing Letters , 22(10):1571–1574, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

An Implicit Representation and Iterative Solution of Randomly Sketched Linear Systems

Abstract

1 Introduction

2 Our Procedure

2.1 A Brief Overview

Randomized Kaczmarz.

Randomized Gauss-Seidel.

Randomized Block Coordinate Descent.

Sketch-and-Project.

2.2 A Heuristic Derivation

Choosing an Error Measure.

Compensating for the Unknown Solution.

Our Procedure.

2.3 Rank-One Refinements and Random Sketching

Randomized Kaczmarz with Orthogonalization.

Randomized Gauss-Seidel with Orthogonalization.

Random Gaussian Sketch.

Count Sketch.

2.4 Algorithmic Refinements Considering the Computing Platform

2.4.1 Asynchronous Parallelization on Shared and Distributed Memory Platforms

2.4.2 Memory-Reduced Procedure

Remark 1**.**

2.4.3 Optimizing Communication Overhead. Structured Systems

Lemma 1**.**

Proof.

3 Convergence Theory for Orthogonalization

3.1 Core Results

Theorem 1**.**

Proof.

Corollary 1**.**

Proof.

Lemma 2**.**

Proof.

Theorem 2**.**

Proof.

Corollary 2**.**

Proof.

3.2 Common Sampling Patterns

Lemma 3**.**

Proof.

Proposition 1**.**

Proof.

Theorem 3**.**

Proof.

4 Convergence Theory for Base Methods

4.1 An Extension of Meany’s Inequality

Lemma 4** (Meany (1969)).**

Theorem 4**.**

Proof.

4.2 Main Convergence Result for Row-Action Methods

Theorem 5**.**

Proof.

Corollary 3**.**

Proof.

4.3 Main Convergence Result for Column-Action Methods

Theorem 6**.**

Proof.

Corollary 4**.**

Proof.

4.4 Common, Non-Adaptive Sampling Patterns

Proposition 2**.**

Proof.

Proposition 3**.**

Remark 2**.**

Proof.

4.5 Adaptive Sampling Schemes

4.5.1 A General Class and Analysis of Adaptive Methods

Remark 3**.**

Definition 1** (Markovian).**

Definition 2** (Magnitude Invariance).**

Lemma 5**.**

Proof.

Definition 3** (Exploratory).**

Remark 4**.**

Remark 1.

Lemma 1.

Theorem 1.

Corollary 1.

Lemma 2.

Theorem 2.

Corollary 2.

Lemma 3.

Proposition 1.

Theorem 3.

Lemma 4 (Meany (1969)).

Theorem 4.

Theorem 5.

Corollary 3.

Theorem 6.

Corollary 4.

Proposition 2.

Proposition 3.

Remark 2.

Remark 3.

Definition 1 (Markovian).

Definition 2 (Magnitude Invariance).

Lemma 5.

Definition 3 (Exploratory).

Remark 4.

Assumption 1.

Remark 5.

Theorem 7.

Proposition 4.

Remark 6.

Remark 7.

Lemma 6.

Lemma 7.