A twin error gauge for Kaczmarz's iterations

Bart S. van Lith; Per Christian Hansen; Michiel E. Hochstenbach

arXiv:1906.07470·math.NA·January 29, 2021

A twin error gauge for Kaczmarz's iterations

Bart S. van Lith, Per Christian Hansen, Michiel E. Hochstenbach

PDF

TL;DR

This paper introduces two novel algebraic reconstruction methods based on Kaczmarz's iterations that utilize a twin error gauge to improve regularized solutions in noisy tomography, outperforming standard stopping rules.

Contribution

The paper presents a new error gauge using paired up- and down-sweep Kaczmarz methods, enabling better stopping criteria and improved reconstruction accuracy.

Findings

01

Lower two-norm error in most test cases compared to standard methods.

02

Computational cost is slightly less than standard Kaczmarz with statistical stopping.

03

Effective regularization in noisy tomography problems.

Abstract

We propose two new algebraic reconstruction techniques based on Kaczmarz's method that produce a regularized solution to noisy tomography problems. Tomography problems exhibit semi-convergence when iterative methods are employed, and the aim is therefore to stop near the semi-convergence point. Our approach is based on an error gauge that is constructed by pairing standard down-sweep Kaczmarz's method with its up-sweep version; we stop the iterations when this error gauge is minimal. The reconstructions of the new methods differ from standard Kaczmarz iterates in that our final result is the average of the stopped up- and down-sweeps. Even when Kaczmarz's method is supplied with an oracle that provides the exact error -- and is thereby able to stop at the best possible iterate -- our methods have a lower two-norm error in the vast majority of our test cases. In terms of computational…

Tables2

Table 1. Table 1: Work load of the various methods for a 2D problem; for a 3D problem replace n 𝑛 \sqrt{n} with n 3 3 𝑛 \sqrt[3]{n} . A work unit is defined as the work in a single Kaczmarz sweep.

Method	Operations	Work units
Standard Kaczmarz	$4 m \sqrt{n}$	1
Idem + trace-estimate stopping rules	$10 m \sqrt{n} + 2 n + 3 m$	$\frac{5}{2} + \frac{1}{2} \frac{\sqrt{n}}{m} + \frac{3}{4} \frac{1}{\sqrt{n}}$
Twin Method	$8 m \sqrt{n} + 3 n$	$2 + \frac{3}{4} \frac{\sqrt{n}}{m}$
Mutual-Step Method	$8 m \sqrt{n} + 11 n$	$2 + \frac{11}{4} \frac{\sqrt{n}}{m}$

Table 2. Table 2: Average relative errors, work units and score of 100 instances for each of seven phantoms. TA is the Twin Algorithm, MSA is the Mutual-Step Algorithm and KO is the Kaczmarz+Oracle algorithm.

	Relative errors			Work units			Score
Phantom	TA	MSA	KO	TA	MSA	KO	TA	MSA	KO
shepplogan	0.166	0.175	0.169	36.6	16.0	20.7	74.5	14.0	61.5
smooth	0.194	0.105	0.163	28.9	17.1	17.4	12.5	99.0	38.5
binary	0.215	0.222	0.209	26.5	15.6	23.7	54.5	12.0	83.5
threephases	0.147	0.140	0.156	35.1	15.8	10.9	47.0	88.5	14.5
threephasessmooth	0.132	0.110	0.142	34.0	17.2	10.4	32.0	100	18.0
fourphases	0.190	0.202	0.193	38.0	16.2	20.3	81.5	7.5	61.0
grains	0.134	0.092	0.149	40.1	16.4	15.3	41.5	100	8.5
Average	0.168	0.149	0.169	34.2	16.3	17.0	49.1	60.1	40.8

Equations112

A x = b, A \in R^{m \times n}, b \in R^{m}, x \in R^{n} .

A x = b, A \in R^{m \times n}, b \in R^{m}, x \in R^{n} .

x_{k}^{(0)}

x_{k}^{(0)}

x_{k}^{(i)}

x_{k}

x_{k + 1} = x_{k} + A^{T} L^{- 1} (b - A x_{k}), L = slt (A A^{T}) + \frac{1}{ω} D,

x_{k + 1} = x_{k} + A^{T} L^{- 1} (b - A x_{k}), L = slt (A A^{T}) + \frac{1}{ω} D,

A^{T} L^{- 1} A x = A^{T} L^{- 1} b,

A^{T} L^{- 1} A x = A^{T} L^{- 1} b,

(A^{T} L^{- 1} A)^{†} = (A^{T} L^{- 1} A ∣_{V})^{- 1} P_{V},

(A^{T} L^{- 1} A)^{†} = (A^{T} L^{- 1} A ∣_{V})^{- 1} P_{V},

x := (A^{T} L^{- 1} A ∣_{V})^{- 1} A^{T} L^{- 1} b .

x := (A^{T} L^{- 1} A ∣_{V})^{- 1} A^{T} L^{- 1} b .

x_{k + 1} = G x_{k} + A^{T} L^{- 1} b, G := I - A^{T} L^{- 1} A,

x_{k + 1} = G x_{k} + A^{T} L^{- 1} b, G := I - A^{T} L^{- 1} A,

x_{k + 1} - x = G x_{k} - x + A^{T} L^{- 1} A x = G (x_{k} - x),

x_{k + 1} - x = G x_{k} - x + A^{T} L^{- 1} A x = G (x_{k} - x),

e_{k} = x_{k} - x^{⋆} = noise error x_{k} - x_{k} + iteration error x_{k} - x^{⋆} .

e_{k} = x_{k} - x^{⋆} = noise error x_{k} - x_{k} + iteration error x_{k} - x^{⋆} .

G = G^{T} .

G = G^{T} .

D = diag (J A A^{T} J) = J diag (A A^{T}) J,

D = diag (J A A^{T} J) = J diag (A A^{T}) J,

slt (J A A^{T} J) = J sut (A A^{T}) J = J slt (A A^{T})^{T} J .

slt (J A A^{T} J) = J sut (A A^{T}) J = J slt (A A^{T})^{T} J .

\widetilde{\boldsymbol{L}}=\boldsymbol{J}\big{(}\textsf{slt}(\boldsymbol{A}\boldsymbol{A}^{T})^{T}+\tfrac{1}{\omega}\boldsymbol{D}\big{)}\boldsymbol{J}=\boldsymbol{J}\boldsymbol{L}^{T}\boldsymbol{J},

\widetilde{\boldsymbol{L}}=\boldsymbol{J}\big{(}\textsf{slt}(\boldsymbol{A}\boldsymbol{A}^{T})^{T}+\tfrac{1}{\omega}\boldsymbol{D}\big{)}\boldsymbol{J}=\boldsymbol{J}\boldsymbol{L}^{T}\boldsymbol{J},

\widetilde{\boldsymbol{G}}:=\boldsymbol{I}-(\boldsymbol{J}\boldsymbol{A})^{T}\big{(}\boldsymbol{J}\boldsymbol{L}^{-T}\boldsymbol{J}\big{)}(\boldsymbol{J}\boldsymbol{A})=\boldsymbol{I}-\boldsymbol{A}^{T}\boldsymbol{L}^{-T}\boldsymbol{A}.

\widetilde{\boldsymbol{G}}:=\boldsymbol{I}-(\boldsymbol{J}\boldsymbol{A})^{T}\big{(}\boldsymbol{J}\boldsymbol{L}^{-T}\boldsymbol{J}\big{)}(\boldsymbol{J}\boldsymbol{A})=\boldsymbol{I}-\boldsymbol{A}^{T}\boldsymbol{L}^{-T}\boldsymbol{A}.

p_{k} = b^{⋆} - A x_{k} .

p_{k} = b^{⋆} - A x_{k} .

E (ξ_{0}^{T} ξ_{k}) = m - tr (A A_{k}^{#}) .

E (ξ_{0}^{T} ξ_{k}) = m - tr (A A_{k}^{#}) .

∥ b - A x_{k} ∥^{2} + 2 σ^{2} tr (A A_{k}^{#}) - σ^{2} m .

∥ b - A x_{k} ∥^{2} + 2 σ^{2} tr (A A_{k}^{#}) - σ^{2} m .

\frac{\|\boldsymbol{b}-\boldsymbol{A}\,\boldsymbol{x}_{k}\|^{2}}{\bigl{(}m-\mathrm{tr}(\boldsymbol{A}\boldsymbol{A}^{\#}_{k})\bigr{)}^{2}}.

\frac{\|\boldsymbol{b}-\boldsymbol{A}\,\boldsymbol{x}_{k}\|^{2}}{\bigl{(}m-\mathrm{tr}(\boldsymbol{A}\boldsymbol{A}^{\#}_{k})\bigr{)}^{2}}.

\|\boldsymbol{b}-\boldsymbol{A}\,\boldsymbol{x}_{k}\|^{2}\leq\sigma^{2}\,\bigl{(}m-\mathrm{tr}(\boldsymbol{A}\boldsymbol{A}^{\#}_{k})\bigr{)}\ .

\|\boldsymbol{b}-\boldsymbol{A}\,\boldsymbol{x}_{k}\|^{2}\leq\sigma^{2}\,\bigl{(}m-\mathrm{tr}(\boldsymbol{A}\boldsymbol{A}^{\#}_{k})\bigr{)}\ .

∥ x_{k} - x_{k} ∥^{2}

∥ x_{k} - x_{k} ∥^{2}

= ∥ e_{k} - e_{k} ∥^{2}

= ∥ e_{k} ∥^{2} + ∥ e_{k} ∥^{2} - 2 cos (φ_{k}) ∥ e_{k} ∥ ∥ e_{k} ∥,

κ (λ) = ∣ v^{H} v ∣^{- 1},

κ (λ) = ∣ v^{H} v ∣^{- 1},

∣ λ_{n} ∣ \leq \dots \leq ∣ λ_{2} ∣ < ∣ λ_{1} ∣ < 1.

∣ λ_{n} ∣ \leq \dots \leq ∣ λ_{2} ∣ < ∣ λ_{1} ∣ < 1.

\frac{∥ x _{k} - x _{k} ∥}{∥ e _{k} ∥}

\frac{∥ x _{k} - x _{k} ∥}{∥ e _{k} ∥}

\frac{∥ x _{k} - x _{k} ∥}{∥ e _{k} ∥}

e_{k + 1} = G e_{k} and e_{k + 1} = G^{T} e_{k} .

e_{k + 1} = G e_{k} and e_{k + 1} = G^{T} e_{k} .

\boldsymbol{e}_{k}=\lambda_{1}^{k}\,\gamma_{1}\,\boldsymbol{v}_{1}+\sum_{j=2}^{n}\lambda_{j}^{k}\,\gamma_{j}\,\boldsymbol{v}_{j}=\lambda_{1}^{k}\,\Big{(}\gamma_{1}\boldsymbol{v}_{1}+\sum_{j=2}^{n}\left(\frac{\lambda_{j}}{\lambda_{1}}\right)^{k}\gamma_{j}\boldsymbol{v}_{j}\Big{)}.

\boldsymbol{e}_{k}=\lambda_{1}^{k}\,\gamma_{1}\,\boldsymbol{v}_{1}+\sum_{j=2}^{n}\lambda_{j}^{k}\,\gamma_{j}\,\boldsymbol{v}_{j}=\lambda_{1}^{k}\,\Big{(}\gamma_{1}\boldsymbol{v}_{1}+\sum_{j=2}^{n}\left(\frac{\lambda_{j}}{\lambda_{1}}\right)^{k}\gamma_{j}\boldsymbol{v}_{j}\Big{)}.

\|\boldsymbol{e}_{k}\|=|\lambda_{1}|^{k}\,\big{(}|\gamma_{1}|+\mathcal{O}\big{(}\big{|}\tfrac{\lambda_{2}}{\lambda_{1}}\big{|}^{k}\big{)}\big{)}.

\|\boldsymbol{e}_{k}\|=|\lambda_{1}|^{k}\,\big{(}|\gamma_{1}|+\mathcal{O}\big{(}\big{|}\tfrac{\lambda_{2}}{\lambda_{1}}\big{|}^{k}\big{)}\big{)}.

\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|=\|\boldsymbol{e}_{k}-\widetilde{\boldsymbol{e}}_{k}\|=|\lambda_{1}|^{k}\,\left(\|\gamma_{1}\boldsymbol{v}_{1}-\widetilde{\gamma_{1}}\widetilde{\boldsymbol{v}}_{1}\|+\mathcal{O}\big{(}\big{|}\tfrac{\lambda_{2}}{\lambda_{1}}\big{|}^{k}\big{)}\right).

\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|=\|\boldsymbol{e}_{k}-\widetilde{\boldsymbol{e}}_{k}\|=|\lambda_{1}|^{k}\,\left(\|\gamma_{1}\boldsymbol{v}_{1}-\widetilde{\gamma_{1}}\widetilde{\boldsymbol{v}}_{1}\|+\mathcal{O}\big{(}\big{|}\tfrac{\lambda_{2}}{\lambda_{1}}\big{|}^{k}\big{)}\right).

\frac{∥ x _{k} - x _{k} ∥}{∥ e _{k} ∥}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newsiamremark

remarkRemark \newsiamthmassumptionsAssumption

\headersA twin error gauge for Kaczmarz’s iterationsVan Lith, Hansen, and Hochstenbach

A twin error gauge for Kaczmarz’s iterations††thanks: Version .

B. S. van Lith Department of Applied Mathematics and Computer Science, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark (). [email protected], [email protected]

P. C. Hansen22footnotemark: 2

M. E. Hochstenbach Department of Mathematics and Computer Science, TU Eindhoven, http://www.win.tue.nl/h̃ochsten/.

Abstract

We propose two new algebraic reconstruction techniques based on Kaczmarz’s method that produce a regularized solution to noisy tomography problems. Tomography problems exhibit semi-convergence when iterative methods are employed, and the aim is therefore to stop near the semi-convergence point. Our approach is based on an error gauge that is constructed by pairing standard down-sweep Kaczmarz’s method with its up-sweep version; we stop the iterations when this error gauge is minimal. The reconstructions of the new methods differ from standard Kaczmarz iterates in that our final result is the average of the stopped up- and down-sweeps. Even when Kaczmarz’s method is supplied with an oracle that provides the exact error–and is therefore able to stop at the best possible iterate – our methods have a lower two-norm error in the vast majority of our test cases. In terms of computational cost, our methods are a little cheaper than standard Kaczmarz equipped with a statistical stopping rule.

keywords:

Computed tomography, ART, Kaczmarz, stopping rules, error estimation, semi-convergence.

{AMS}

65F22, 65F10, 65R32, 65F15

1 Introduction

The image reconstruction problem in X-ray tomography can be formulated as a large, sparse linear system of equations, i.e.,

[TABLE]

Here, the vectors $\boldsymbol{b}$ and $\boldsymbol{x}$ represent the measured data (the sinogram) and the image to be reconstructed, respectively. The system matrix $\boldsymbol{A}$ represents a discretization of the forward problem [12]; there are no restrictions on its dimensions $m$ and $n$ . There is inherently some noise present in the data $\boldsymbol{b}$ , and one of the key challenges in tomographic reconstruction is to compute a good reconstruction in the presence of these errors.

Tomographic reconstruction problems are a type of inverse problems where the forward operator, in the continuous formulation, is a smoothing operation known as the Radon transform in the case of 2D parallel-beam scanning. The continuous problem is mildly ill-posed [2], which leads to a poorly conditioned matrix $\boldsymbol{A}$ , especially when the system is large.

The system (1) is usually too large to solve by factorization methods, and iterative linear solvers are used. These solvers exhibit semi-convergence [17] in the presence of noise, meaning that initially the reconstruction error decreases but eventually it increases. The error consists of two parts, the iteration error and the noise error. The iteration error decreases steadily, and in the case of error-free data the classical asymptotic convergence theory applies. Initially the noise error is small but it steadily increases until the iterative method has “inverted the noisy data” rather than the clean data; see, e.g., [6, 7] for more details.

To obtain meaningful solutions to noisy problems we need to stop the iterations at the semi-convergence point where the reconstruction error is at a minimum. The iteration number therefore acts as a regularization parameter. There are various methods that estimate a good optimal regularization parameter and these can be used as stopping rules for the iterations [1, 20].

Many of the parameter-choice/stopping rules are based on statistical properties of the noise, for instance when using generalized cross-validation [30] and unbiased predictive risk estimation [29, Sec. 7.1] and when using variations of the discrepancy principle [13, Sec. 7.2].

All these stopping rules may work well for simultaneous iterative reconstruction techniques such as Landweber or Cimmino [14]. The reason is that these methods tend to produce error histories that are very flat around the minimum. A few hundred iterations more or less often does not really matter, and the quality of the reconstruction is only little affected. This is not so for Kaczmarz’s method which tends to converge much faster [7] and thus has a fairly narrow window of opportunity around the minimum of the error history; eager readers see Figure 2 in Section 5.

We propose a completely different approach, one without any statistical assumptions on the noise, which is based on error estimation. Using several numerical examples, we show that our stopping rules and algorithms perform very well. To illustrate the point, we compare our methods with an oracle that provides the true error and is therefore able to stop at the best possible iterate.

In our analysis of the proposed methods, we show that our approach is theoretically sound for consistent systems. In the numerical examples, however, we only consider the noisy case.

The rest of the paper is organized as follows. Section 2 introduces the necessary background theory for Kaczmarz’s method. In Section 3 we analyze the errors and use this analysis to propose a new way to estimate the error. The insight then leads to a new algorithm that is presented in Section 4. Finally, in Section 5 we present numerical examples that illustrate our theory and compare our new algorithm with existing ones. Throughout this work, we will use the following common notations:

•

A column vector is denoted by a bold lower-case character, while a bold upper-case character is a matrix. Normal font is used for scalars.

•

We will use the two-norm and denote it by $\|\cdot\|$ .

•

The expectation operator is written as $\mathbb{E}$ .

•

Exact quantities are marked with a superscript $\star$ , while objects furnished with a tilde ( $\sim$ ) are related to an alternative (up-sweep) version of Kaczmarz’s methods, cf. Section 3.1.

•

The symbol $\leftarrow$ in pseudocode means assignment.

2 Background theory

Here we set the stage by summarizing some basic results pertaining to Kaczmarz’s method, also known as the algebraic reconstruction technique (ART) [11, 28].

2.1 Kaczmarz and its convergence

Let $k=1,2,3,\dots$ denote full sweeps through the rows of $\boldsymbol{A}$ , and let $\boldsymbol{x}_{0}$ denote the starting vector. Let $\omega\in(0,2)$ be a relaxation parameter, let $\boldsymbol{a}_{j}^{T}$ denote the $j$ th row of $\boldsymbol{A}$ , and let $b_{j}$ denote the $j$ th element of $\boldsymbol{b}$ . For every update $i$ , a row index $j$ needs to be chosen. There are various strategies for picking the row index $j$ , for instance randomized or cyclic [19]; here we consider only cyclic down-sweeps, $j=i$ , and up-sweeps, $j=m-i+1$ . Starting from the common choice $\boldsymbol{x}_{0}=\boldsymbol{0}$ , in the $k$ th sweep we sequentially perform the updates

[TABLE]

If there are rows with zero norm, it is most natural to skip them. This is effectively the same as deleting zero rows from $\boldsymbol{A}$ and the corresponding entries from $\boldsymbol{b}$ . When $\omega=1$ , Kaczmarz’s method has a nice geometrical interpretation: each update projects onto the hyperplane represented by the $j$ th equation.

For the down-sweep version, an entire sweep (2) of all equations in Kaczmarz’s method can be written as

[TABLE]

where $\textsf{slt}(\cdot)$ extracts the strictly lower triangular part and $\boldsymbol{D}=\mathrm{diag}(\boldsymbol{A}\boldsymbol{A}^{T})$ ; as proved for the first time by Elfving and Nikazad [8].

There are some more elaborate versions of Kaczmarz method available, for instance using block row partitioning or variable relaxation parameters [7, 10]. In this paper we propose two new methods where we exploit the sequential version (2) of Kaczmarz with a fixed relaxation parameter. We believe the techniques may be generalized to the mentioned Kaczmarz schemes, but this is outside the scope of this work. In contrast, we will see in Section 5.4 that it is not obvious to combine the proposed methods with a randomized Kaczmarz scheme.

From (3), Kaczmarz’s method can be seen as an attempt to solve the system

[TABLE]

which is always consistent, whether or not (1) is. Indeed, supposing $\boldsymbol{x}_{k+1}$ approaches $\boldsymbol{x}_{k}$ in the limit of large $k$ , we end up with $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}(\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x})=\boldsymbol{0}$ , which is equivalent to (4). We have not come across of the following two lemmas in the literature.

Lemma 2.1.

*The nullspaces of $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}$ and its transpose are identical, and equal to the nullspace of $\boldsymbol{A}$ . *

Proof 2.2.

*While it is straightforward that $\mathcal{N}(\boldsymbol{A})\subseteq\mathcal{N}(\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A})$ , it is also true that $\mathcal{N}(\boldsymbol{A})\supseteq\mathcal{N}(\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A})$ by the following argument. In view of (3), $\boldsymbol{L}+\boldsymbol{L}^{T}$ is symmetric positive definite (SPD) for $\omega\in(0,2)$ , and therefore so is the congruence transform $\boldsymbol{L}^{-1}(\boldsymbol{L}+\boldsymbol{L}^{T})\boldsymbol{L}^{-T}=\boldsymbol{L}^{-1}+\boldsymbol{L}^{-T}$ ; it follows from Sylvester’s law of inertia that all eigenvalues remain positive under such transformations. Let $\boldsymbol{x}\neq\boldsymbol{0}$ be such that $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}\boldsymbol{x}=\boldsymbol{0}$ . Left-multiplication by $\boldsymbol{x}^{T}$ and using the fact that the symmetric part of $\boldsymbol{L}^{-1}$ is SPD shows that $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{0}$ . The statement for the transpose follows from an identical reasoning with $\boldsymbol{L}^{T}$ taking the role of $\boldsymbol{L}$ . *

Lemma 2.3.

*If (1) is consistent, (4) is equivalent to (1). *

Proof 2.4.

*It is obvious that (1) implies (4), the opposite follows the fact that there exists some $\boldsymbol{y}$ such that $\boldsymbol{A}\boldsymbol{y}=\boldsymbol{b}$ . From (4) we find $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{y})=\boldsymbol{0}$ . Applying Lemma 2.1 shows that $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{A}\boldsymbol{y}=\boldsymbol{b}$ . *

Let $\mathcal{V}$ be the row space of $\boldsymbol{A}$ , which is the orthogonal complement of the nullspace of $\boldsymbol{A}$ . Lemma 2.1 asserts that $\mathcal{V}^{\perp}$ is also the nullspace $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}$ , so that $\mathcal{V}$ is the row space, while any linear operator is nonsingular when the kernel is removed. The restriction is related to the pseudoinverse in the following way

[TABLE]

where $P_{\mathcal{V}}$ denotes the orthogonal projection onto $\mathcal{V}$ . The pseudoinverse is defined for any vector, while the domain of the restriction is only $\mathcal{V}$ . We will use the restriction for clarity and preciseness. Kaczmarz’s method is a row-action scheme: all iterates as well their limit are constructed as a linear combination of the rows of $\boldsymbol{A}$ , i.e., they are in $\mathcal{V}$ . This means that Kaczmarz’s method provides the minimum-norm solution subject to the constraint that (4) holds. This solution can be expressed in terms of the inverse of the restricted operator:

[TABLE]

Since $\boldsymbol{\Pi}:=\boldsymbol{A}(\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}|_{\mathcal{V}})^{-1}\boldsymbol{A}^{T}\boldsymbol{L}^{-1}$ is an oblique projection onto the span of $\boldsymbol{A}$ , we observe that $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{\Pi}\boldsymbol{b}$ , the image of $\boldsymbol{b}$ under this oblique projection. In the case of a consistent system (1), $\boldsymbol{b}$ is in the span of $\boldsymbol{A}$ and $\boldsymbol{\Pi}\boldsymbol{b}=\boldsymbol{b}$ , reaffirming Lemma 2.3 from a different viewpoint. We note that $\boldsymbol{x}$ can also be seen as the orthogonal projection onto the row space of $\boldsymbol{A}$ of an arbitrary $\boldsymbol{y}$ satisfying $\boldsymbol{A}\boldsymbol{y}=\boldsymbol{\Pi}\boldsymbol{b}$ .

Using (3) we can interpret Kaczmarz’s method as a fixed-point method, since we have

[TABLE]

where $\boldsymbol{I}$ is the identity matrix. Note that the iteration matrix $\boldsymbol{G}$ is independent of the row scaling of $\boldsymbol{A}$ ; if we premultiply $\boldsymbol{A}$ by any diagonal matrix, $\boldsymbol{G}$ is unaltered. Moreover, we point out that $\boldsymbol{G}$ is the identity on $\mathcal{V}^{\perp}$ , the nullspace of $\boldsymbol{A}$ . For any initial guess $\boldsymbol{x}_{0}\in\mathcal{V}$ , we can without loss of generality restrict $\boldsymbol{G}$ to $\mathcal{V}$ . In particular, our initial guess is always $\boldsymbol{x}_{0}=\boldsymbol{0}$ , so that we can always restrict $\boldsymbol{G}$ to $\mathcal{V}$ with impunity. We emphasize that $\boldsymbol{G}$ depends on the relaxation parameter $\omega$ via $\boldsymbol{L}$ ; cf. (3). Using (4), we find

[TABLE]

where $\boldsymbol{x}$ is given by (5). For fixed-point methods, we need the following well-established result, which we adapt from [22, Thm. 4.1].

Lemma 2.5.

*Kaczmarz’s method with $\boldsymbol{x}_{0}\in\mathcal{V}$ (so particularly for our choice $\boldsymbol{x}_{0}=\boldsymbol{0}$ ) converges to (5) if and only if the spectral radius of the restricted iteration matrix is strictly smaller than 1, i.e., $\rho(\boldsymbol{G}|_{\mathcal{V}})<1$ . *

Of course, the convergence of Kaczmarz’s method has been studied since its introduction; we recall the following well-known result from [8].

Lemma 2.6.

*Kaczmarz’s method is convergent for linear system (1) for any $\omega\in(0,2)$ . If $\boldsymbol{b}$ lies in the span of $\boldsymbol{A}$ , then the method converges to a solution of $\eqref{eq:Axb}$ . If in addition $\boldsymbol{x}_{0}\in\mathcal{V}$ , then the Kaczmarz converges to the minimal norm solution to (1). *

The two lemmas combined show that Kaczmarz’s method always converges to a solution of (4), and if $\boldsymbol{x}_{0}\in\mathcal{V}$ , then it provides the minimum-norm solution. From Lemma 2.3, we see that if (1) is consistent, (4) is equivalent to (1) and Kaczmarz provides a solution to (1).

Additionally, the two lemmas also combine to inform us on the eigenvalues of $\boldsymbol{G}$ . The following result, which follows from the fact that Kaczmarz’s method converges, is in the line of [26], but stated in terms of the restricted operator $\boldsymbol{G}$ .

Theorem 2.7.

*The restricted iteration matrix of Kaczmarz’s method $\boldsymbol{G}|_{\mathcal{V}}=(\boldsymbol{I}-\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A})|_{\mathcal{V}}$ satisfies $\rho(\boldsymbol{G}|_{\mathcal{V}})<1$ for any $0<\omega<2$ and any $\boldsymbol{A}$ . *

Aside from any consistency of (1), we also have to deal with noise, which is a slightly more subtle concept. Noise in the right-hand side $\boldsymbol{b}$ will generally have components both inside and outside of the span of $\boldsymbol{A}$ . The reconstruction $\boldsymbol{x}$ may be greatly affected by the noise, but, of course, only by the noise component in the span of $\boldsymbol{A}$ . We now introduce several concepts to analyze this situation. Denote the noise-free image by $\boldsymbol{x}_{\mathrm{nf}}$ , and define $\boldsymbol{b}^{\star}:=\boldsymbol{A}\boldsymbol{x}_{\mathrm{nf}}$ . If we would execute the Kaczmarz process until convergence using $\boldsymbol{b}^{\star}$ as right-hand side, we would obtain a solution $\boldsymbol{x}^{\star}$ satisfying $\boldsymbol{x}^{\star}=(\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}|_{\mathcal{V}})^{-1}\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{b}^{\star}$ . We refer to $\boldsymbol{x}^{\star}$ as the noise-free solution. As $\boldsymbol{b}^{\star}$ is in the span of $\boldsymbol{A}$ , we conclude from Lemma 2.3 that $\boldsymbol{A}\boldsymbol{x}^{\star}=\boldsymbol{b}^{\star}$ , and that $\boldsymbol{x}^{\star}$ is the orthogonal projection of $\boldsymbol{x}_{\mathrm{nf}}$ onto the row space of $\boldsymbol{A}$ . Given a measured data vector $\boldsymbol{b}$ , the noise vector is defined by $\delta\boldsymbol{b}:=\boldsymbol{b}-\boldsymbol{b}^{\star}$ . In our numerical experiments in Section 5, we simulate $\delta\boldsymbol{b}$ as a Gaußian random vector with zero mean and standard deviation $\sigma$ . Unlike the stopping rules we review in Section 2.3, which require normally distributed noise to work, the methods we propose do not need any assumptions on the statistical nature of the noise.

We also make a corresponding splitting in the reconstruction iterates by introducing $\widehat{\boldsymbol{x}}_{k}$ as the iterates of Kaczmarz’s method with right-hand side $\boldsymbol{b}^{\star}$ . The iteration error is then defined as $\widehat{\boldsymbol{x}}_{k}-\boldsymbol{x}^{\star}$ , while the noise error is defined as $\boldsymbol{x}_{k}-\widehat{\boldsymbol{x}}_{k}$ . Obviously, the error with respect to the noise-free solution is the sum of the iteration error and the noise error, i.e.,

[TABLE]

As we will see in Section 3.2.2, the iteration error diminishes while the noise error grows as the iteration progresses.

2.2 Up-sweep Kaczmarz

The up-sweep version of Kaczmarz’s method relates in an appealing way to the down-sweep method. The following result is implicit in Elfving and Nikazad [8]; we merely provide an explicit demonstration.

Proposition 2.8.

The up-sweep iteration matrix $\widetilde{\boldsymbol{G}}$ is related to the down-sweep iteration matrix $\boldsymbol{G}$ (3) by transposition, i.e.,

[TABLE]

Proof 2.9.

Let $\boldsymbol{J}$ denote the reverse identity, i.e., the permutation matrix that reverses the ordering. We investigate what happens when we apply the cyclical down-sweep Kaczmarz’s method to the system $\boldsymbol{J}\boldsymbol{A}\boldsymbol{x}=\boldsymbol{J}\boldsymbol{b}$ . Note first that $\boldsymbol{J}^{T}=\boldsymbol{J}=\boldsymbol{J}^{-1}$ , so that $\boldsymbol{J}\boldsymbol{A}\boldsymbol{A}^{T}\boldsymbol{J}^{T}=\boldsymbol{J}\boldsymbol{A}\boldsymbol{A}^{T}\boldsymbol{J}$ , where pre- and post-multiplying $\boldsymbol{A}$ by $\boldsymbol{J}$ results in the flipping over of both the columns and the rows respectively. From this, we see that

[TABLE]

since flipping over both columns and rows of a diagonal matrix yields a diagonal matrix with the entries reversed. Here, $\widetilde{\boldsymbol{D}}$ is the up-sweep analogue of $\boldsymbol{D}$ . Next, it may be checked that the strictly lower triangular part of some matrix $\boldsymbol{J}\boldsymbol{M}\boldsymbol{J}$ is the strictly upper triangular part of $\boldsymbol{M}$ , flipped over both columns and rows. Hence, the relation is exactly $\textsf{slt}(\boldsymbol{J}\boldsymbol{M}\boldsymbol{J})=\boldsymbol{J}\,\textsf{sut}(\boldsymbol{M})\,\boldsymbol{J}$ , where sut takes the strictly upper triangular part. Applying this to the symmetric matrix $\boldsymbol{A}\boldsymbol{A}^{T}$ , we obtain

[TABLE]

Putting (a) and (b) together, we find that

[TABLE]

where $\widetilde{\boldsymbol{L}}$ is the up-sweep analogue of $\boldsymbol{L}$ . Thus, when we inspect the up-sweep iteration matrix $\widetilde{\boldsymbol{G}}$ , we find

[TABLE]

*We identify this last matrix as $\boldsymbol{G}^{T}$ , which completes the proof. *

Key Point 1.

*The iteration matrix of the up-sweep Kaczmarz method is the transpose of the down-sweep iteration matrix. Consequently, they have the same eigenvalues. *

2.3 Statistical stopping rules

To use the semi-convergence of Kaczmarz’s method for noisy data we need a stopping rule for terminating the iterations near the point of semi-convergence. Ideally we prefer a stopping rule based on an estimate of the reconstruction error $\|\boldsymbol{x}_{k}-\boldsymbol{x}^{\star}\|$ . Several such rules have been proposed for regularization methods that can be implicitly expressed as a filtered SVD expansion [13, Sec. 7.3]; they do not apply to Kaczmarz’s method, as Kaczmarz cannot be represented in this way. The stopping rule presented in this work does not have this limitation.

Instead of estimating the reconstruction error, several statistical stopping rules based on the prediction error have been proposed. These rules, which indeed apply to Kaczmarz’s method, seek to minimize the norm of the prediction error for the $k$ th iteration, defined as

[TABLE]

However, since this is unavailable, the methods work instead with the norm of the residual vector $\boldsymbol{b}-\boldsymbol{A}\,\boldsymbol{x}_{k}$ . One way to do so involves the trace of the influence matrix $\boldsymbol{A}\boldsymbol{A}^{\#}_{k}$ , where $\boldsymbol{x}_{k}=\boldsymbol{A}^{\#}_{k}\boldsymbol{b}$ and $\boldsymbol{A}^{\#}_{k}$ is the action of Kaczmarz’s method. The trace of this matrix features in many stopping rules. For iterative regularization methods the trace is often estimated by means of a Monte Carlo approach as proposed in [9] and [23]. This is done using a normally distributed random vector $\boldsymbol{\xi}_{0}$ , which is used as an initial guess for Kaczmarz’s method applied to the system $\boldsymbol{A}\boldsymbol{\xi}=\boldsymbol{0}$ . The inner product of $\boldsymbol{\xi}_{0}$ with the iterate $\boldsymbol{\xi}_{k}$ provides the trace estimate. Employing the quadratic form identity [21, Thm. 5.2a], it can be shown that

[TABLE]

A Monte Carlo method typically draws many samples to estimate a quantity. However, each sample would require another instance of Kaczmarz’s method, quickly becoming very expensive. In practice often only a single sample is drawn.

We can now summarize the three statistical stopping rules that we compare with in this work. In the unbiased predictive risk estimation (UPRE) method we find the $k$ that minimizes the expected prediction estimation error norm $\mathbb{E}(\|\boldsymbol{p}_{k}\|^{2})$ . This is done by minimizing the quantity

[TABLE]

The generalized cross validation (GCV) method also seeks to minimize the expected prediction error, and it does so without the need for the noise’s standard deviation $\sigma$ . Here we find the $k$ that minimizes

[TABLE]

The compensated discrepancy principle (CDP) was defined by Turchin [27] for Tikhonov regularization. The underlying idea is to determine the largest iteration number for which we cannot reject $\boldsymbol{x}_{k}$ – computed from the noisy data – as a possible solution to the noise-free system, cf. [27, p. 93]. Here we stop at the first iteration $k$ for which

[TABLE]

A derivation of UPRE is given in [29, Sec. 7.1] while summaries of GCV and CDP can be found in [13, Secs. 7.2 and 7.4]. The methods we reviewed here require the assumption that $\delta\boldsymbol{b}$ is white Gaußian noise, while the methods we propose do not require any assumptions on the statistical nature of the noise.

3 The Error Gauge and the Twin Method

Given two numerical methods designed to solve the same problem, a general approach to error estimation is to take the difference of the two numerical solutions $\boldsymbol{x}_{k}$ and $\widetilde{\boldsymbol{x}}_{k}$ . One way to reason about this is by adding and subtracting the noise-free solution $\boldsymbol{x}^{\star}$ .

[TABLE]

where $\boldsymbol{e}_{k}$ and $\widetilde{\boldsymbol{e}}_{k}$ are their respective errors, and $\varphi_{k}$ is the angle between the two errors $\boldsymbol{e}_{k}$ and $\widetilde{\boldsymbol{e}}_{k}$ . We will exploit the expression involving the cosine in Section 3.2.2.

3.1 Analysis of an error gauge: the noise-free consistent case

Here, we propose to employ (15) as an error gauge, which estimates the accuracy of the iterates. When the approximations are different, but converge to the same solution at the same rate, the difference between them will vanish at the same rate. Therefore this gives us a simple error gauge.

To obtain two different iterates $\boldsymbol{x}_{k}$ and $\widetilde{\boldsymbol{x}}_{k}$ of Kaczmarz’s method to be used in (15), we use down-sweeps and up-sweeps, respectively.

Key Point 2.

*For Kaczmarz’s method, the iteration matrix $\boldsymbol{G}$ is generally not symmetric, even for symmetric $\boldsymbol{A}$ . *

Before stating the results, we recall that the condition number of a simple eigenvalue $\lambda$ is given by

[TABLE]

where $\widetilde{\boldsymbol{v}}$ and $\boldsymbol{v}$ are the normalized left and right eigenvectors, respectively [25, Chap. 1, Sec. 3.2], and $\boldsymbol{\cdot}^{H}$ denotes the conjugate transpose. A simple eigenvalue $\lambda$ is called normal if $\kappa(\lambda)=1$ , which is the case if and only if $\boldsymbol{v}$ and $\widetilde{\boldsymbol{v}}$ coincide. In the following proposition we make two assumptions. First, we assume that $\boldsymbol{G}$ is diagonalizable, which means that it has an eigenvalue decomposition; second, that $\lambda_{1}$ is a simple and nonnormal eigenvalue. Note that these two assumptions are generic properties, holding almost always. 111A necessary condition for matrices to be nondiagonalizable is to have zero discriminant, defined by $\mathrm{dis}(\boldsymbol{G})=\prod_{i<j}(\lambda_{i}-\lambda_{j})^{2}$ [15, 2.4.P21]. The set of matrices with $\mathrm{dis}(\boldsymbol{G})=0$ constitutes a set of zero measure in the $n^{2}$ -dimensional space, meaning that the first assumption indeed is a generic property. For the second assumption, given $\boldsymbol{G}$ with its eigenpair $(\lambda,\boldsymbol{v})$ consider the Schur decomposition $\left[\!\!\begin{array}[]{cc}\lambda_{1}&\boldsymbol{y}^{*}\\ \boldsymbol{0}&\boldsymbol{M}\end{array}\!\!\right]$ , corresponding to any basis $[\boldsymbol{v}\ \,\boldsymbol{V}_{\perp}]$ , where $\boldsymbol{V}_{\perp}\in\mathbb{C}^{n\times(n-1)}$ , $\boldsymbol{V}_{\perp}^{*}\boldsymbol{v}=\boldsymbol{0}$ , $\boldsymbol{M}\in\mathbb{C}^{(n-1)\times(n-1)}$ , and $\boldsymbol{y}\in\mathbb{C}^{n-1}$ . Then $\lambda_{1}$ is a normal eigenvalue (so with $\widetilde{\boldsymbol{v}}=\boldsymbol{v}$ ) if and only if $\boldsymbol{y}=\boldsymbol{0}$ . This means that the property of being a simple and nonnormal eigenvalue is a generic property, since the constraint $\boldsymbol{y}=\boldsymbol{0}$ also results in a set of zero measure. .

We now present two propositions. For the first, we assume that the eigenvalues of $\boldsymbol{G}$ can be labeled according to their modulus as

[TABLE]

This implies that $\lambda_{1}$ is real. This assumption holds for many of the experiments that we carry out in Section 5.

Proposition 3.1.

Suppose the iteration matrix $\boldsymbol{G}\in\mathbb{R}^{n\times n}$ has an eigendecomposition and the largest eigenvalue in modulus $\lambda_{1}$ is isolated, with $|\lambda_{1}|>|\lambda_{i}|$ for all $i>1$ . We furthermore assume that $\lambda_{1}$ is simple and nonnormal and we denote its right and left eigenvector by $\boldsymbol{v}_{1}$ and $\widetilde{\boldsymbol{v}}_{1}$ , respectively. Suppose that $\boldsymbol{e}_{0}$ , the initial error for the down-sweep iterates, has a nonzero component $\gamma_{1}$ in the direction of $\boldsymbol{v}_{1}$ , and that the same holds for $\widetilde{\gamma_{1}}$ , the component of $\widetilde{\boldsymbol{e}}_{0}$ in the direction of $\widetilde{\boldsymbol{v}}_{1}$ .

Then, for consistent systems, the error gauge $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|$ is asymptotically proportional to the true error norms $\|\boldsymbol{e}_{k}\|$ and $\|\widetilde{\boldsymbol{e}}_{k}\|$ , i.e.,

[TABLE]

*for large $k$ . Moreover, the right-hand sides are nonzero. *

Proof 3.2.

The system is consistent, so that (7) holds, i.e.,

[TABLE]

The eigendecomposition allows us to write

[TABLE]

The error of the up-sweeps admits an analogous expression. From the assumption that $\lambda_{1}$ is isolated, it follows that

[TABLE]

However, we also have

[TABLE]

*The assumption that $\lambda_{1}$ is nonnormal is equivalent to $\boldsymbol{v}_{1}$ and $\widetilde{\boldsymbol{v}}_{1}$ being linearly independent. Consequently, we have $\|\gamma_{1}\boldsymbol{v}_{1}-\widetilde{\gamma_{1}}\widetilde{\boldsymbol{v}}_{1}\|>0$ , since the only way to produce a zero constant is to have $\gamma_{1}=\widetilde{\gamma_{1}}=0$ , which by assumption does not occur. Therefore we can conclude (18). *

Next, we consider the situation that $\lambda_{1}$ is non-real and therefore $\lambda_{2}=\overline{\lambda_{1}}$ , the complex conjugate of $\lambda_{1}$ , so that $|\lambda_{2}|=|\lambda_{1}|$ . This case may also arise in our numerical experiments for some choices of the problem parameters. Since $\lambda_{1}$ is usually (very) close to 1 and inside the complex unit circle, the imaginary part is very modest. We make the generic assumption that $|\lambda_{1}|>|\lambda_{3}|$ .

Proposition 3.3.

Given the context of Proposition 3.1, but now with $\lambda_{2}=\overline{\lambda_{1}}$ , and $|\lambda_{1}|>|\lambda_{3}|$ , we have

[TABLE]

Proof 3.4.

*This follows easily from the fact that the asymptotically dominant term of $\boldsymbol{e}_{k}$ is $\lambda_{1}^{k}\gamma_{1}\boldsymbol{v}_{1}+\overline{\lambda}_{1}^{k}\,\overline{\gamma}_{1}\overline{\boldsymbol{v}_{1}}$ , with a similar expression for $\widetilde{e}_{k}$ . *

Key Point 3.

*The error gauge $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|$ depends crucially on the fact that $\boldsymbol{G}$ is not symmetric. *

Put in colloquial terms, Propositions 3.1 and 3.3 assert that the error gauge $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|$ is a good estimate of the iteration error of a consistent system. The intuitive argument is simple: both methods converge to the same solution at the same rate, but along different paths, so that the difference vanishes at the same rate as the errors.

3.2 Analysis of the noisy case

Having established the error gauge $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|$ we now turn to an analysis of the behavior of Kaczmarz’s method for noisy data.

3.2.1 Semi-convergence

Kaczmarz’s method applied to noisy CT problems exhibits semi-convergence: the error first decreases and the iterates approach the noise-free solution, after which the noise component starts to dominate and the iterates move away from the noise-free solution. The minimal error is known as the semi-convergence point. The semi-convergence behavior is sketched in Figure 1. The nature of the first part of this subsection may be considered folklore of Kaczmarz type methods; we try to present a thorough analysis here.

Recall that the data vector is split according to $\boldsymbol{b}=\boldsymbol{b}^{\star}+\delta\boldsymbol{b}$ with a noise-free consistent part $\boldsymbol{b}^{\star}$ and a noise part $\delta\boldsymbol{b}$ . We also recall that analogously the error is split into an iteration error $\widehat{\boldsymbol{x}}_{k}-\boldsymbol{x}^{\star}$ and a noise error $\boldsymbol{x}_{k}-\widehat{\boldsymbol{x}}_{k}$ , where $\widehat{\boldsymbol{x}}_{k}$ are the iterates of the noise-free consistent system using $\boldsymbol{b}^{\star}$ as the right-hand side. When we use an empirical data vector $\boldsymbol{b}$ instead of a noise-free consistent data vector $\boldsymbol{b}^{\star}$ in Kaczmarz’s down-sweep method, we obtain from (6)

[TABLE]

Since $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}(\boldsymbol{b}^{\star}+\delta\boldsymbol{b})\in\mathcal{V}$ , we can restrict the geometric sum to $\mathcal{V}$ and get

[TABLE]

We now employ the identity $\boldsymbol{b}^{\star}=\boldsymbol{A}\boldsymbol{x}^{\star}$ and use the fact that $\boldsymbol{x}^{\star}$ is in the row space of $\boldsymbol{A}$ , yielding

[TABLE]

We thus arrive at an exact expression for the error

[TABLE]

Hence, the splitting in the empirical data vector leads to the splitting of the iteration error and the noise error, i.e.,

[TABLE]

where we have defined (cf. (5))

[TABLE]

This is the limiting vector of the noise error iterates for $k\to\infty$ , and it therefore may be called the “inverted noise.” The expression (23) should be compared with (5); it represents exactly the part that one would like to suppress in a regularized solution.

We will now show that not only $\rho(\boldsymbol{G}|_{\mathcal{V}})<1$ , but also the stronger result that $\|\boldsymbol{G}|_{\mathcal{V}}\|<1$ . (It is stronger since any consistent norm is an upper bound for the spectral radius.) As for Theorem 2.7, we derive this property from a known fact of Kaczmarz type methods.

Lemma 3.5.

*The iteration matrix $\boldsymbol{G}$ satisfies $\|\boldsymbol{G}|_{\mathcal{V}}\|<1$ . *

Proof 3.6.

*Both $\boldsymbol{G}$ and its transpose maps $\mathcal{V}$ onto itself. The two-norm of $\boldsymbol{G}|_{\mathcal{V}}$ is given by $\sqrt{\rho(\boldsymbol{G}^{T}\boldsymbol{G}|_{\mathcal{V}})}$ , but $\boldsymbol{G}^{T}\boldsymbol{G}$ is the iteration matrix of the method known as the symmetric Kaczmarz method from Elfving and Nikazad [8, Prop. 11]. This method is convergent and applying Lemma 2.5 concludes the proof. *

Kaczmarz’s method is convergent, and Lemma 3.5 shows that it is even monotonically convergent, as $\|\boldsymbol{e}_{k+1}\|=\|\boldsymbol{G}\boldsymbol{e}_{k}\|\leq\|\boldsymbol{G}|_{\mathcal{V}}\|\,\|\boldsymbol{e}_{k}\|<\|\boldsymbol{e}_{k}\|$ . This also implies that the noise error forms a monotonic sequence with respect to its limiting vector $\boldsymbol{f}$ .

Proposition 3.7.

The noise error satisfies

[TABLE]

with $\boldsymbol{f}$ from (23). Moreover, $\|\boldsymbol{x}_{k+1}-\widehat{\boldsymbol{x}}_{k+1}-\boldsymbol{f}\|$ constitutes a monotonically decreasing sequence, so that

[TABLE]

Proof 3.8.

From (6), the noise error is given by

[TABLE]

Since $\boldsymbol{G}=\boldsymbol{I}-\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\boldsymbol{A}$ , we can write

[TABLE]

*which resolves into (24). After taking norms, applying Lemma 3.5 completes the proof. *

The following proposition is, to the best of our knowledge, new. It shows that the noise error must increase, which is a fundamental insight into the semi-convergence behavior of Kaczmarz’s method.

Proposition 3.9.

The norm of the noise error is bounded from below by

[TABLE]

*where $\boldsymbol{f}$ is defined in (23). Moreover, if $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\delta\boldsymbol{b}\neq\boldsymbol{0}$ , then the lower bound forms a monotonically increasing sequence. *

Proof 3.10.

We take the norm and apply the reverse triangle inequality

[TABLE]

*We can now use the fact that $\|\boldsymbol{G}^{k}\boldsymbol{f}\|\leq\|\boldsymbol{G}|_{\mathcal{V}}\|^{k}\|\boldsymbol{f}\|$ , with Lemma 3.5 asserting that $\|\boldsymbol{G}|_{\mathcal{V}}\|<1$ , to find (26). Finally, the right-hand side of (26) is a monotonically increasing sequence if $\boldsymbol{f}\neq\boldsymbol{0}$ , which is equivalent to $\boldsymbol{A}^{T}\boldsymbol{L}^{-1}\delta\boldsymbol{b}\neq\boldsymbol{0}$ . *

Propositions 3.7 and 3.9 show roughly where semi-convergence comes from: the iteration error decreases while the noise error increases. When that happens in just the right way, the error goes through a minimum. In the following, we employ a more phenomenological approach.

3.2.2 The error gauge and semi-convergence

We can use (3.2.1) to define pseudo-iterates for any real number $\tau\geq 0$ , using the spectral decomposition (or Jordan normal form) to define $\boldsymbol{G}^{\tau}$ . Let $f$ be the error of the pseudo-up-sweeps and $g$ be the error of the pseudo-down-sweeps, then the errors of the iterates are discrete samples of the continuous functions $f$ and $g$ , i.e.,

[TABLE]

The discrete error behavior of Kaczmarz’s method is typically very benign, with the error being convex up to an inflection point, after which it is concave and reaches its asymptote from below. We assume that $f$ and $g$ exhibit these features in a continuous way.

{assumptions}

We assume that $f$ and $g$ behave in the following way. There exists a $T>0$ , such that $f:[0,T]\to\mathbb{R}$ and $g:[0,T]\to\mathbb{R}$ satisfy:

$f\geq f_{\mathrm{min}}>0$ and $g\geq g_{\mathrm{min}}>0$ . 2. 2.

$f^{\prime\prime}\geq c>0$ and $g^{\prime\prime}\geq c>0$ . 3. 3.

Both $f$ and $g$ attain their global minima in $(0,T)$ .

The minimizers of $f$ and $g$ give us directly the minimizers of $\|\boldsymbol{e}_{k}\|$ and $\|\widetilde{\boldsymbol{e}}_{k}\|$ by rounding to the nearest integer.

The following is a well-known result from convex analysis (see, e.g., [5, Sec. 3.2.1]) We provide the proof, which is elementary, for completeness.

Lemma 3.11.

*Let $\phi:[a,b]\to\mathbb{R}$ and $\psi:[a,b]\to\mathbb{R}$ be strictly convex with unique minimizers in $(a,b)$ . Let $\Phi$ be a nonnegative, nonzero weighted sum of $\phi$ and $\psi$ , i.e., $\Phi=\alpha\phi+\beta\psi$ , with $\alpha\geq 0$ , $\beta\geq 0$ while $\alpha+\beta>0$ . Then, $\Phi$ has a unique minimum which lies in between the minimizers of $\phi$ and $\psi$ . *

Proof 3.12.

*Let us say the minimum of $\phi$ occurs at $t_{\phi}$ and the minimum of $\psi$ occurs at $t_{\psi}$ . Assume without loss of generality that $t_{\phi}\leq t_{\psi}$ . Then, for $t<t_{\phi}$ we have $\phi^{\prime}(t)<0$ and $\psi^{\prime}(t)<0$ , so that $\Phi^{\prime}(t)<0$ . Likewise, for $t>t_{\psi}$ , we have $\phi^{\prime}(t)>0$ and $\psi^{\prime}(t)>0$ so that $\phi^{\prime}(t)>0$ . Hence, $\Phi$ must have at least one minimum in $[t_{\phi},t_{\psi}]$ by the intermediate value theorem. However, $\Phi$ is also strictly convex so that this minimum is unique. *

We now employ some general results from perturbation analysis of optimization problems to show that the minimizer of the error gauge will be close to the minimizers of the two continuous errors $f$ and $g$ . The proof of the following result, which provides an upper bound, largely follows the line that of [4, Prop. 4.32].

Proposition 3.13.

Let $S:[a,b]\to\mathbb{R}$ be a strongly convex function with convexity modulus $c$ and a perturbation $p:[a,b]\to\mathbb{R}$ be continuously differentiable. Let $t^{\star}$ be the minimizer of $S$ and $\tilde{t}$ be the global minimizer of $h=S+p$ . Then, $p(t^{\star})-p(\tilde{t})\geq 0$ and

[TABLE]

Proof 3.14.

Let us consider the difference in $S$ between $\tilde{t}$ and $t^{\star}$ , i.e.,

[TABLE]

Note that this is a positive quantity since $t^{\star}$ is the minimizer of $S$ . However, since $\tilde{t}$ is assumed to be the global minimizer of $h$ , we have $h(\tilde{t})\leq h(t^{\star})$ , so that

[TABLE]

which also shows that $p(t^{\star})-p(\tilde{t})\geq 0$ . Since $S$ is strongly convex, we have $S(\tilde{t})-S(t^{\star})\geq c\,(\tilde{t}-t^{\star})^{2}$ , so that

[TABLE]

*This leads to (28). *

We apply Proposition 3.13 to our problem by using the functions $S=f^{2}+g^{2}$ and $p=-2fg\cos(\varphi)$ , where $\varphi$ is the continuous angle between the errors, see (15). It is easy to verify that $S$ is strongly convex if both $f$ and $g$ are. The proposition guarantees us that the shift in the location of the minimum will be bounded, but we can also argue that it must be relatively small. To see this, consider that the perturbation $p$ will be smallest around the minima of $f$ and $g$ . Moreover, Lemma 3.11 tells us that the minimizer of $S=f^{2}+g^{2}$ is between those of $f$ and $g$ . Therefore, the minimizer of the error gauge will be close to the minimizers of both true errors. We will demonstrate this in Section 5, see Figure 3.

3.3 The Twin Algorithm

We now propose a new method that utilizes the error gauge as a stopping rule. We use here the shorthand $\mathcal{K}^{\downarrow}(\boldsymbol{x})$ for a Kaczmarz down-sweep starting with $\boldsymbol{x}$ and fixed parameter $0<\omega<2$ . Likewise, $\mathcal{K}^{\uparrow}(\widetilde{\boldsymbol{x}})$ denotes the twin Kaczmarz up-sweep. A possible implementation of the Twin Method in pseudocode is presented in Algorithm 1.

The algorithm requires some maximum number of iterations maxits, which is simply a convenience. The output of the Twin Algorithm is defined as the average of the stopped down- and up-sweeps. This is because we do not have any preference for one over the other (this step can be skipped if multiple reconstructions are desirable).

As with the statistical stopping rules, we have a sequence of positive real values whose minimum indicates where we should stop. Any strategy or method that is used to find the minimum for the statistical stopping rules can therefore be used for the error gauge. Typically, the functions occurring from the statistical stopping rules will not be smooth, and neither will the error gauge. As such, we opt to find the minimum by introducing a user-specified slack $s$ , i.e., a number of iterations to keep running after a (local) minimum has been found to accommodate any oscillations. The best approximation so far is stored.

As an example, suppose that Kaczmarz is running and we come upon iteration $q$ and the slack is (re)initialized, and furthermore suppose that $e_{q+1}>e_{q}$ , but $e_{q+2}<e_{q}$ . In this case, the slack would reinitialize at $q+2$ . We observe: a slack of size $s$ ignores oscillations of size $s-1$ . Continuing the example, suppose that the smallest value of the error gauge has occurred at iteration $p$ , then the algorithm will keep running at least until iteration $p+s$ . If $e_{p}<e_{j}$ for $j=p+1,\ldots,p+s$ , then the algorithm terminates at iteration $p+s$ and returns $\tfrac{1}{2}(\boldsymbol{x}_{p}+\widetilde{\boldsymbol{x}}_{p})$ as the reconstruction. Hence, we see that if $s$ is sufficiently large, the algorithm is guaranteed to find the global minimum of the error gauge.

We use $s=7$ , which is long enough for most oscillations we encountered in our test problems, but not so long that it wastes a lot of computational resources.

4 The Mutual-Step Method

Up till now we employed our error gauge to select the best iteration while leaving the iterative method unaltered. Alternatively, we can adopt the error gauge to modify the method in such a way that it precludes the necessity of a stopping rule altogether. Specifically, we will exploit the error gauge to determine step lengths that eventually diminish, causing the method to converge to a good approximation of the noise-free solution.

4.1 Motivation for a new method

Let us define $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ as the search directions in the down-sweep and up-sweep versions of Kaczmarz’s method, respectively; that is, they formally satisfy

[TABLE]

Recall that this definition includes the relaxation parameter $\omega\in(0,2)$ via the matrix $\boldsymbol{L}$ . Iteration $k{+}1$ of the down-sweep method is then given by $\boldsymbol{x}_{k}+\boldsymbol{s}_{k}$ . A simple modification to the method allows for an iteration-dependent step size $\alpha_{k}$ , so that $\boldsymbol{x}_{k}+\alpha_{k}\boldsymbol{s}_{k}$ is the next iteration. Similarly, we define $\widetilde{\boldsymbol{x}}_{k}+\beta_{k}\widetilde{\boldsymbol{s}}_{k}$ as the next up-sweep iterate. Ideally, one wishes to plug these expressions into the exact error to find the optimal step sizes. However, the exact error is not available, so we use the error gauge instead. Hence, we aim to compute the step length parameters that solve the minimization problem

[TABLE]

which minimizes the error gauge of iteration $k+1$ . We can find the minimum of (30) by setting the derivatives with respect to $\alpha$ and $\beta$ to zero, yielding

[TABLE]

We solve this linear system for $\alpha_{k}$ and $\beta_{k}$ to obtain the step sizes. It is important to note that this system is nonsingular when the vectors $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ are linearly independent. In this case, the $2\times 2$ coefficient matrix in (31) is symmetric positive definite. Of course, since the two search directions come from up-sweeps and down-sweeps, they will generally be linearly independent. If the search directions do happen to be dependent, we set $\alpha_{k}=0$ and only solve for $\beta_{k}$ .

Because we are minimizing the distance between the up- and down-sweep iterates at every iteration by choosing suitable step sizes, the distance cannot increase. Therefore the error gauge, which is this distance, will form a monotonically decreasing sequence. This immediately leads to the following result.

Proposition 4.1.

*Suppose that $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ are linearly independent for all $k$ . By using the step sizes from (31), the error gauge $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|$ converges to a local minimum. *

Proof 4.2.

*The error gauge is a monotonically decreasing sequence, while it is trivially bounded from below, i.e., $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|\geq 0$ . *

Corollary 4.3.

If $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ are linearly independent for all $k$ , the step sizes $\alpha_{k}$ and $\beta_{k}$ converge to zero. Moreover, the asymptotic search directions $\boldsymbol{s}$ and $\widetilde{\boldsymbol{s}}$ are related to the limiting approximations $\boldsymbol{x}$ and $\widetilde{\boldsymbol{x}}$ by

[TABLE]

Proof 4.4.

*By assumption, $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ are linearly independent, so that there is no linear combination resulting in the zero vector other than $\alpha_{k}=\beta_{k}=0$ . Proposition 4.1 asserts that the error gauge converges. Therefore, the step sizes must vanish as well, otherwise the error gauge would change. Second, since the system (31) is symmetric positive definite, again by the assumption that $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ are linearly independent, the zero solution can only occur when the right-hand side vanishes. *

Corollary 4.3 provides us the possibility to design a stopping criteria. According to (32), we can stop when

[TABLE]

which is to say that the angles with the error gauge and the search directions are close enough to orthogonal.

An additional stopping criterion also follows from Proposition 4.3; a naive approach might be to stop simply when $|\alpha|+|\beta|$ falls below a specified threshold. This is guaranteed to happen since the method is convergent. It is worth pausing here to appreciate this remarkable result. The semi-convergence property has been completely circumvented, and yet, the Mutual-Step Method will converge to an approximation of the semi-convergence point.

Key Point 4.

*The Mutual-Step Method is convergent. *

The naive stopping criterion $|\alpha|+|\beta|\leq\varepsilon_{2}$ can be somewhat hard to interpret, as we have no control over the norm of the search directions $\boldsymbol{s}_{k}$ and $\widetilde{\boldsymbol{s}}_{k}$ . Hence, we suggest an equivalent stopping criterion that has an easy interpretation as the relative change in the reconstructions. We propose to stop when

[TABLE]

Clearly, $|\alpha_{k}|\frac{\|\boldsymbol{s}_{k}\|}{\|\boldsymbol{x}_{k}\|}=\frac{\|\boldsymbol{x}_{k+1}-\boldsymbol{x}_{k}\|}{\|\boldsymbol{x}_{k}\|}$ , which is indeed the relative change in the down-sweep reconstruction. Hence, when the sum of the relative changes falls below the threshold $\varepsilon_{2}$ , we stop. This is guaranteed to happen by Corollary 4.3. To summarize: we test for both criteria (33) and (34), if either triggers, we stop.

The threshold values $\varepsilon_{1}$ and $\varepsilon_{2}$ can be chosen as large as the minimum relative error that can be achieved. This will largely come down to experience and educated guesses. However, since the method is convergent, it is always possible to use the final up- and down-sweeps as input. With the Mutual-Step Method, it is possible to run a certain number of iterations and inspect the reconstructions. If it is suspected that a better reconstruction is possible, the up- and down-sweeps can be reinserted as input. In the worst case, the images are unaltered. This is a distinct practical advantage that the Twin Algorithm, or any statistical stopping rule for that matter, does not have.

4.2 The algorithm

We present here a possible implementation of the Mutual-Step Method in pseudocode, see Algorithm 2. Similar to the Twin Algorithm, the Mutual-Step Algorithm also computes two reconstructions. Since we have no bias towards one or the other, we again define the final reconstruction as the average of the two. Again, this step is not crucial for the algorithm, and it can be skipped if desired.

The Mutual-Step Algorithm requires two different starting vectors; if we start with the same vector, the initial error gauge would be zero, causing the step sizes to come out zero. Consequently, the algorithm would exit immediately. Furthermore, according to [7], the starting vector should lie in the row space of $\boldsymbol{A}$ . Our approach is simply to use a single down-sweep and up-sweep starting from the zero vector. These vectors are different and lie in the correct space.

4.3 Computational cost

We now turn to the computational cost of the Mutual-Step Method. At this point, it is convenient to introduce a work unit: a certain number of operations so that we can easily compare the cost of the various methods. The most convenient work unit in our context is a single Kaczmarz sweep. For X-ray tomography problems, the average number of nonzero elements in a row of $\boldsymbol{A}$ is $\sqrt{n}$ for a 2D problem and $\sqrt[3]{n}$ for a 3D problem. Carefully going through the operations in (2) reveals that there are $4m\sqrt{n}$ or $4m\sqrt[3]{n}$ operations in a sweep. We omit operations that scale as constants as they will be negligible. Note that the cost of one matrix-vector multiplication is half a work unit.

For each method, we examine the total cost and express it in terms of work units in Table 1. The Twin Method requires two Kaczmarz sweeps together with determination of the difference between the two iterates. The Mutual-Step Method requires two Kaczmarz sweeps and the solution of the system (31). All the stopping rules (see Section 5 for details) require a trace estimate which means that, per iteration, the additional amount of work is one Kaczmarz sweep, the determination of the residual and its norm, and an inner product. There are a small number of additional operations which we ignore here.

Naturally, as $m$ and $n$ grow, the cost of each method becomes dominated by the cost of the Kaczmarz sweeps and the determination of the residual, if needed. The cost of inner products are evidently negligible in the larger scheme of things. Our proposed methods do not require the residual, so that their asymptotic cost is 2 work units per iteration. The stopping rules have an asymptotic cost of $\tfrac{5}{2}$ work units per iteration. Our methods are therefore slightly cheaper for large systems. As an example, for a small $128\times 128$ image we have a system size of $n=128^{2}=16384$ and $m\approx 1.2\,n$ . We should remark that this choice of $m$ is entirely arbitrary. In practice, a whole range of $m$ is used, from vastly underdetermined systems to extremely overdetermined. This amounts to a cost of about $2.005$ work units per iteration for the Twin Method and $2.02$ work units per iteration for the Mutual-Step Method.

In terms of storage, the Twin Algorithm as we use it costs the same as the statistical stopping rules, while the Mutual-Step Algorithm is more expensive. The statistical stopping rules require three vectors of length $n$ to be stored: the actual image, the initial random Gaußian noise vector and the Kaczmarz iterate that uses the random noise vector as its initial guess. The Mutual-Step Algorithm requires the search vectors and image vectors to be stored, which means it requires four $\mathbb{R}^{n}$ -vectors of storage. The Twin Algorithm with slack period requires storing the best reconstruction so far, together with two image vectors, resulting in 3 $\mathbb{R}^{n}$ -vectors – exactly the same as the statistical stopping rules. However, we must point out that it is possible in the Twin Algorithm to employ other methods of detecting the minimum in the error gauge. The method presented here was chosen for performance reasons; it is guaranteed to find the global minimum provided the slack is sufficiently large. It is possible to use local estimators to find the minimum, which would bring the memory requirements down to only two vectors of length $n$ .

5 Numerical experiments

To conduct our numerical experiments, we use the AIR Tools II package for MATLAB which contains various codes for the creation and solution of tomographic problems [14]. The package also contains a function phantomgallery that creates various phantoms with different features. We use the parallel beam set-up with $128\times 128$ pixels per phantom, projection angles $0^{\circ},1.5^{\circ},3^{\circ},\dots,178.5^{\circ}$ and $\textsf{round}(\sqrt{2}\cdot 128)=181$ rays per projection. Furthermore, as a tolerance for the Mutual-Step Algorithm we use $\textsf{tol1}=\textsf{tol2}=10^{-4}$ .

Key Point 5.

*We will use $\boldsymbol{x}_{0}=\boldsymbol{0}$ throughout this work, unless mentioned otherwise. This choice simplifies the expressions somewhat, but more importantly it serves as a good initial guess for noisy inverse problems. *

To simulate noise, we add white Gaußian noise $\delta\boldsymbol{b}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}_{m})$ scaled such that we can specify the expected relative noise level to be $\eta$ , i.e.,

[TABLE]

To demonstrate the effect of noise on the reconstruction, Figure 2 shows error histories of the relative error $\|\boldsymbol{x}_{k}-\boldsymbol{x}^{\star}\|\ /\,\|\boldsymbol{x}^{\star}\|$ for various noise levels. Our phantom of choice is the grains phantom, which simulates the polycrystalline structure found in many metals, rocks, and bones. We used the standard Kaczmarz (down-sweep) method with $\omega=1$ . We see that as $\eta$ increases the whole curve moves up and the minimum becomes less flat.

For the lowest noise level, $\eta=10^{-3}$ , any iteration between $k=70$ and 100 gives almost the same relative error. However, for the highest noise level it is more critical to find the right number of iterations. The general trend is clear: for higher noise levels the stopping rule needs to be more accurate.

5.1 Casing the competition

As the closest competitors of our proposed algorithms, we consider the standard Kaczmarz algorithm with either of the three statistical stopping rules from Section 2.3: UPRE, GCV, and CDP. Our experience is that the performance of these stopping rules for Kaczmarz’s algorithm is generally quite poor, especially for high noise levels, and to demonstrate this we show a representative error history for the grains phantom in Figure 3.

It is clear that all three statistical stopping rules overshoot the mark by quite a margin. GCV and UPRE overshoot by roughly 100 iterations, while CDP did not stop for maxits $=300$ . As already mentioned in Section 1, this may not matter very much for the simultaneous iterative methods where the minimum is very flat. For Kaczmarz’s algorithm, on the other hand, there is a significant difference. For the current example, our error gauge stops within two iterations of the minimum and produces an image that is roughly 60% better compared to the statistical stopping rules.

As we have emphasised before, the output of the Twin Algorithm is not equal to any iteration of Kaczmarz’s method. For ease of presentation, we have indicated in Figure 3 the iteration at which the Twin Algorithm stops. The error of the output – the average of the up- and down-sweeps – typically has a smaller error and lies below the exact error curve. The outputs of Kaczmarz equipped with GCV and UPRE are in fact on the curve, and this would also be the case for CDP be if it had stopped within maxits iterations. Instead of comparing the proposed algorithms with statistical stopping rules, we therefore from now on compare with the exact minimum.

We will suppose that we have an oracle222The oracle is a concept borrowed from computational complexity theory [18]: oracle machines “are machines that are given access to an “oracle” that can magically solve the decision problem for some language”. Here, we will take the oracle to magically provide the exact error of a reconstruction. that can tell you the exact error of a reconstruction, but importantly, not anything else. Having access to the exact error allows one to pick the best iteration from the sequence of reconstructions generated by an iterative method. This is what we will compare our algorithms with.

5.2 Effect of the relaxation parameter

Throughout this work, we have ignored the choice of a relaxation parameter. However, for many applications this is a crucial point and it is a fair question to ask whether or not the error gauge works for relaxation parameters other than $\omega=1$ . To demonstrate the affirmative, we present Figure 4, which is produced using the test problem as previously defined with the grains phantom and a relative noise level of $8\cdot 10^{-3}$ .

The figure shows that for a range of values $0<\omega<2$ , the error gauge picks out an iteration close to the oracle’s choice, i.e., the iteration with the smallest error. The error gauge typically produces a good reconstruction, only being a couple of iterations off in most cases.

An interesting point to note about Figure 4 is that a smaller relaxation parameter produces a smaller minimal error. This might be somewhat surprising, as asymptotic convergence theory shows that the optimal relaxation parameter satisfies $\omega>1$ : it is known that Kaczmarz is equivalent to Successive Over-Relaxation applied to the system $\boldsymbol{A}\boldsymbol{A}^{T}\boldsymbol{y}=\boldsymbol{b}$ with $\boldsymbol{x}=\boldsymbol{A}^{T}\boldsymbol{y}$ [3], while it can be shown that $\omega>1$ results in the asymptotically optimal convergence rate for Successive Over-Relaxation [22]. However, we must note that we are optimizing different objectives: in the asymptotic case we optimize the convergence rate, while in the noisy case we are optimizing the minimal error at the semi-convergence point. The observed behavior of a smaller relaxation parameter producing a smaller error is quite consistent for the phantoms we tested with. Moreover, typically a smaller relaxation parameter also leads to slower convergence. As a compromise between computation time and performance, we suggest a relaxation parameter in the range of $0.4\leq\omega\leq 0.7$ , which produces good results for all phantoms in a decent time.

We must also note that the error gauge does not seem to work quite as well for very small relaxation parameters. However, as we have already pointed out, these cases are likely to be avoided for reasons of speed. In Figure 4, the error gauge only has difficulties with $\omega=0.1$ .

5.3 Variants of row-by-row Kaczmarz methods

We will also briefly demonstrate that our error gauge approach also works on some variants of Kaczmarz’s method. Specifically, the symmetric Kazcmarz method and a special family of block-sequential Kaczmarz methods.

The symmetric Kaczmarz method consists of performing an up-sweep after a down-sweep or the other way around. The iteration matrices are given by $\boldsymbol{G}^{T}\boldsymbol{G}$ and $\boldsymbol{G}\boldsymbol{G}^{T}$ , which therefore share eigenvalues – the squared singular values of $\boldsymbol{G}$ – but not eigenvectors. Indeed, the eigenvectors of $\boldsymbol{G}^{T}\boldsymbol{G}$ are the right-singular vectors of $\boldsymbol{G}$ , while for $\boldsymbol{G}\boldsymbol{G}^{T}$ we have the left-singular vectors as eigenvectors. We can therefore again construct an error gauge by comparing the up-down version with the down-up version.

Following Elfving and Nikazad, we refer to a collection of rows of $A$ as a block, and each block is processed by computing the pseudoinverse [8]. From the fact that Kaczmarz’s method computes a minimum-norm solution to (4), it can be seen that Kaczmarz’s method actually computes the pseudoinverse when the rows of $\boldsymbol{A}$ are orthogonal and a starting guess $\boldsymbol{x}_{0}=\boldsymbol{0}$ is used. Thus, if all blocks consist of sets of orthogonal rows, each can be treated by using Kaczmarz’s method. The conclusion is that a block-sequential method with blocks of orthogonal rows is equivalent to a row-by-row Kaczmarz method. Moreover, when such blocks consist of structurally orthogonal rows, meaning each term in the inner product vanishes, the result can be computed in a parallelized way. Sørensen and Hansen experimentally confirmed that such block-sequential methods perform very well and can be efficiently implemented in parallel [24].

The results are plotted in Figure 5. The methods are displayed in terms of iterations of Kaczmarz’s method; note that one iteration of symmetric Kaczmarz takes two standard Kaczmarz iterations. The error gauge does not perform quite as well for the symmetric Kaczmarz method.

The PART (parallel ART) blocking strategy has been first proposed by Gordon [10], who uses blocks whose rows correspond to parallel beams, ensuring each block consists of structurally orthogonal rows. Structural orthogonality between rows roughly corresponds to rays that do not intersect inside the measurement volume, hence it is not very difficult to construct other parallelizable versions of Kaczmarz. To demonstrate that this is indeed fairly easy, we have constructed a variant of PART that groups rays into fans with their shared intersection point outside the measurement volume. As a consequence, each block consists of structurally orthogonal rows. We have labeled this “FAN” in Figure 5. All methods were run with a relaxation parameter of $\omega=0.4$ .

5.4 Randomized Kaczmarz methods

The idea of the error gauge does not seem to easily combine with randomized methods. We believe this is due to the fact that the error gauge requires a fixed set of eigenvectors to work. We can arbitrarily call $m$ updates an iteration and set up the iteration matrix, however, it would clearly change with every iteration. To test randomized Kaczmarz in conjunction with the error gauge, we employ a popular randomization: a random shuffle of the rows each sweep. The popularity stems from the fact that each row is used exactly once in an iteration, thereby making sure all of the data is used. The shuffle can be generated by, e.g., the Fisher–Yates algorithm [16, Vol. 2, Sec. 3.4.2, Algo. P]. The iterate $\boldsymbol{x}_{k}$ is computed using the shuffled ordering, while $\widetilde{\boldsymbol{x}}_{k}$ is computed using the reversed shuffled ordering. The results are plotted in Figure 6. A fixed ordering is also shown in the figure for comparison.

It is interesting to note that many of the peaks occurring in the exact error are reflected in $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|_{2}$ . However, most notably should be the fact that $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|_{2}$ does not show an increasing trend after the semi-convergence point. By contrast, in the cyclic case $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|_{2}$ follows the behavior of the exact error rather well. We must conclude that $\|\boldsymbol{x}_{k}-\widetilde{\boldsymbol{x}}_{k}\|_{2}$ is not an error gauge for randomized methods, and depends on a fixed ordering. Extending the error gauge to work with randomized methods is not a simple modification.

Any randomized Kaczmarz method is going to give a distribution of outcomes for a fixed right-hand side. If we consider the infinite sequence of iterates as the outcome, any instance of a randomized Kaczmarz method will have a vanishing probability mass. At the same time, we should only compare events that have a nonzero probability. We can, for instance, compare randomized methods with cyclic methods by their average behavior. Alternatively, we can estimate the probability that an outcome of a randomized method produces a better result than a cyclic method. It turns out that for the CT problem, at least for our test problem, simply choosing a good relaxation parameter $\omega$ makes the cyclic method better than the randomized method in these senses.

For randomized Kaczmarz, we have observed the general trend that a smaller relaxation parameter leads to a smaller minimal error. Hence, the relaxation parameter $\omega=0.4$ , which is on the lower end of range we suggested in Section 5.2, is also a good choice for randomized Kaczmarz. In Figure 7, we have performed 3000 runs of our test problem. For each run, Gaußian noise was generated with a relative noise level of $8\cdot 10^{-3}$ . The results are therefore both averaged over the noise instances and the random orderings.

As should be clear from the figure, for this example the cyclic Kaczmarz method converges faster and provides a smaller error on average. In 1891 out of 3000 cases the cyclic method produces a lower minimal error, which corresponds to roughly $63\%$ of the cases. This demonstrates that randomized methods do not always outperform cyclic Kaczmarz.

It is, of course, possible that different randomized methods do perform better in some sense; how to choose the row-selection probabilities is a whole topic in and of itself. But then again, it is also possible to find better row orderings for the cyclic methods. Pursuing this quickly leads to a completely different topic that is outside the scope of the current work: we are concerned here with demonstrating the effectiveness of our error gauge.

5.5 Haunted house: a selection of phantoms

From now on, we use the relative noise level $\eta=8\cdot 10^{-3}$ in our experiments, as this may be seen as realistic; it is also the second-largest noise level from Figure 2. We also fix $\omega=0.7$ , which is in the upper end of the range we suggested in Section 5.2. To illustrate the point of Section 5.1, we plot the results of the various algorithms under consideration for two phantoms from phantomgallery, namely, the already-mentioned grains phantom and shepplogan which implements the Shepp–Logan phantom. The corresponding reconstructions are shown in Figure 8. Both phantoms have pixel values between 0 and 1. When we compute the relative error with respect to each phantom we use the solutions as produced by the algorithms, with some negative pixels and some pixels greater than one. The figures, on the other hand, show the reconstructions with the pixel values limited to the range $[0,1]$ . For both phantoms the Mutual-Step Algorithm produces the best reconstruction.

First consider the results for the shepplogan phantom. The Twin Algorithm produces the best reconstruction: it is the least grainy picture, while the small details can be clearly distinguished. Both GCV and UPRE produce a very grainy image and the smallest details are harder to identify. For reference, we have also included a result using the oracle as a stopping rule. We refer to this algorithm as “Kaczmarz+Oracle”, which is indicated by KO in the figure.

The results for the grains phantom are slightly more interesting to examine. This phantom consists of a collection of piecewise constant regions (which are Voronoi regions belonging to a random collection of points). The relative performance of the algorithms can be visually evaluated by looking especially at the contrast, e.g., between two neighboring regions that have close intensity values. Whether or not one can distinguish two such regions is a matter of opinion, but generally speaking it does appear that the Mutual-Step Algorithm produces the best results as well as the least grainy picture. GCV and UPRE perform particularly badly, it seems, where neighboring regions are sometimes very hard, if not impossible, to discern.

To demonstrate that the observed behavior is not an oddity or statistical fluke, we run the reconstructions of the grains phantom 1000 times and keep track of the relative errors. From this point on, we compare our algorithms only to Kaczmarz+Oracle, as the statistical stopping rules can never do better.

The results are shown in the histograms in Figure 9. The histograms show several interesting features. First off, the Mutual-Step Algorithm is overall best, with the lowest average error and the smallest spread. In fact, the outcomes from this algorithm has very little or no overlap with the other algorithms. The Twin Algorithm gives a slightly better average performance than Kaczmarz+Oracle, though their histograms overlap almost completely. Kaczmarz+Oracle has the largest spread, and is on average slightly worse than the Twin Algorithm. At first, it may seem odd that the Twin Algorithm produces a better result on average than the Kaczmarz+Oracle. Yet, we should recall that the output of the Twin Algorithm is the average of the up-sweeps and down-sweep iterates, while the Kaczmarz+Oracle algorithm is associated with the down-sweep iterates only. Evidently the averaging often gives a better result than a down-sweep or up-sweep algorithm separately.

To provide further support for the quality of our methods, we ran the proposed algorithms 100 times on each of seven phantoms available from phantomgallery. Additionally, for each run we assign points based on which method produces the best result. Our point system assigns a score of 1 point to the method with the best reconstruction, half a point for the second best, and no point for the worst. Hence, a score of 100 means the algorithm produced the best result each time, while a score of 0 means it produced the worst result each time. This allows us to see roughly how well the methods behave relative to each other. The final tally is contained in Table 2.

The Mutual-Step Algorithm is probably the best algorithm under consideration when it comes to producing a high-quality image. For 4 out of 7 phantoms it produces the best image in the vast majority of tests. The Twin Algorithm also produces good reconstructions, being the best for two phantoms. Combined, the Twin Algorithm and the Mutual-Step Algorithm produce the best image in the majority of test cases for 6 out of 7 phantoms. Interestingly, only for the binary phantom does Kaczmarz+Oracle have the highest score.

To provide more insight, we also display the average errors for each particular phantom and the total average error in the first three columns of Table 2. Here we see another remarkable aspect of the proposed algorithms that was also observed in Figure 9: averaged over all phantoms, both proposed algorithms are at least as good as Kaczmarz+Oracle; the Mutual-Step Algorithm is significantly better.

Finally, in the middle three columns of Table 2 we also compare the methods when it comes to computing cost expressed in our work units (recall that one work unit is the work required to complete one sweep of Kaczmarz’s method). We should point out that Kaczmarz+Oracle has a very low work load, which is due to the fact that consulting the oracle is assumed to be free. The corresponding workload should therefore be read as a lower bound on the work required for Kaczmarz’s method equipped with any pure stopping rule. Interestingly, the Mutual-Step Algorithm is not too far off, with less work required when averaged over all 7 phantoms. The Twin Algorithm usually requires much more work than the oracle, which is to be expected of course: if the Twin Algorithm would stop at the same point as the oracle, it would have done roughly twice the work plus the slack.

6 Conclusion and future work

We presented a new approach to noisy CT reconstruction based on Kaczmarz’s method. The regularizing property of the proposed methods stems from the semi-convergence of Kaczmarz’s method, where the iteration number is used as the regularization parameter. The problem of choosing the regularization parameter takes the form of a stopping rule, typically based on a statistical analysis of the noise and the error.

Our key idea is to combine two Kaczmarz iterations with different row orderings and the same convergence rate, which allows us to compute an error gauge, i.e., an estimate of the reconstruction error. We then stop the iterations when the error gauge is minimum, providing an alternative to existing stopping rules and avoiding assumptions about the noise statistics.

When the original linear system (1) is consistent, we can prove rigorously that the error gauge estimates the reconstruction error. For noisy systems we argued that the error gauge represents semi-convergence with a reasonable fidelity.

We suggested two algorithms that utilize the error gauge: the Twin Algorithm and the Mutual-Step Algorithm. The former uses the error gauge directly as a stopping rule, stopping when the error gauge is minimal. The latter uses the error gauge to determine approximately optimal step sizes for every iteration. We showed that the Mutual-Step Algorithm converges monotonically to a locally optimal pair of approximations. It therefore converges to an approximation of the semi-convergence point, precluding the need for a stopping rule.

Using several numerical experiments from parallel-beam X-ray CT, we demonstrated that the proposed algorithms perform very well indeed. As a reference, we used an oracle for the standard Kaczmarz algorithm that provides the exact error, which is therefore able to pick the best possible reconstruction from the sequence of iterates. Our proposed algorithms – whose output is the average of the up- and down-sweeps – perform on average better than Kaczmarz’s method equipped with the oracle.

For four out of the seven phantoms that we consider, the Mutual-Step Algorithm produce the best image for a large majority of cases. For two other phantoms, the Twin Algorithm produces the best image in the majority of the test cases. Only for a single phantom our algorithms fail to produce the best reconstruction in the majority of the cases. Clearly, the Mutual-Step Algorithm is a solid choice for a reconstruction algorithm. However, it does require additional storage. Depending on what method is employed to detect a minimum in the error gauge, the Twin Algorithm has the same memory requirements as the statistical stopping rules. Therefore, if memory is a concern, the Twin Algorithm is a good alternative.

Acknowledgements

We would like to thank the referees for their excellent comments and suggestions. B.S. van Lith is supported by the EuroTech Postdoc Programme, co-funded by the European Commission under its framework programme Horizon 2020. Grant Agreement number 754462.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. M. Bardsley , Applications of a nonnegatively constrained iterative method with statistically based stopping rules to CT, PET, and SPECT imaging , Electron. Trans. Numer. Anal., 38 (2011), pp. 34–43.
2[2] M. Bertero and P. Boccacci , Introduction to Inverse Problems in Imaging , CRC Press, 1998.
3[3] Å. Björck and T. Elfving , Accelerated projection methods for computing pseudoinverse solutions of systems of linear equations , BIT Numer. Math., 19 (1979), pp. 145–163.
4[4] J. F. Bonnans and A. Shapiro , Perturbation Analysis of Optimization Problems , Springer, Berlin Heidelberg, 2000.
5[5] S. P. Boyd and L. Vandenberghe , Convex Optimization , Cambridge University Press, 2004.
6[6] T. Elfving, P. C. Hansen, and T. Nikazad , Semiconvergence and relaxation parameters for projected SIRT algorithms , SIAM J. Sci. Comput., 34 (2012), pp. A 2000–A 2017.
7[7] , Semi-convergence properties of Kaczmarz’s method , Inverse Problems, 30 (2014), p. 055007.
8[8] T. Elfving and T. Nikazad , Properties of a class of block-iterative methods , Inverse Problems, 25 (2009), p. 115011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A twin error gauge for Kaczmarz’s iterations††thanks: Version .

Abstract

keywords:

1 Introduction

2 Background theory

2.1 Kaczmarz and its convergence

Lemma 2.1**.**

Proof 2.2**.**

Lemma 2.3**.**

Proof 2.4**.**

Lemma 2.5**.**

Lemma 2.6**.**

Theorem 2.7**.**

2.2 Up-sweep Kaczmarz

Proposition 2.8**.**

Proof 2.9**.**

Key Point 1**.**

2.3 Statistical stopping rules

3 The Error Gauge and the Twin Method

3.1 Analysis of an error gauge: the noise-free consistent case

Key Point 2**.**

Proposition 3.1**.**

Proof 3.2**.**

Proposition 3.3**.**

Proof 3.4**.**

Key Point 3**.**

3.2 Analysis of the noisy case

3.2.1 Semi-convergence

Lemma 3.5**.**

Proof 3.6**.**

Proposition 3.7**.**

Proof 3.8**.**

Proposition 3.9**.**

Proof 3.10**.**

3.2.2 The error gauge and semi-convergence

Lemma 3.11**.**

Proof 3.12**.**

Proposition 3.13**.**

Proof 3.14**.**

3.3 The Twin Algorithm

4 The Mutual-Step Method

4.1 Motivation for a new method

Proposition 4.1**.**

Proof 4.2**.**

Corollary 4.3**.**

Proof 4.4**.**

Key Point 4**.**

4.2 The algorithm

4.3 Computational cost

5 Numerical experiments

Key Point 5**.**

5.1 Casing the competition

5.2 Effect of the relaxation parameter

5.3 Variants of row-by-row Kaczmarz methods

5.4 Randomized Kaczmarz methods

5.5 Haunted house: a selection of phantoms

6 Conclusion and future work

Acknowledgements

Lemma 2.1.

Proof 2.2.

Lemma 2.3.

Proof 2.4.

Lemma 2.5.

Lemma 2.6.

Theorem 2.7.

Proposition 2.8.

Proof 2.9.

Key Point 1.

Key Point 2.

Proposition 3.1.

Proof 3.2.

Proposition 3.3.

Proof 3.4.

Key Point 3.

Lemma 3.5.

Proof 3.6.

Proposition 3.7.

Proof 3.8.

Proposition 3.9.

Proof 3.10.

Lemma 3.11.

Proof 3.12.

Proposition 3.13.

Proof 3.14.

Proposition 4.1.

Proof 4.2.

Corollary 4.3.

Proof 4.4.

Key Point 4.

Key Point 5.