Accelerated Sampling Kaczmarz Motzkin Algorithm for The Linear   Feasibility Problem

Md Sarowar Morshed; Md Saiful Islam; Md. Noor-E-Alam

arXiv:1902.03502·math.OC·August 16, 2022·J. Glob. Optim.

Accelerated Sampling Kaczmarz Motzkin Algorithm for The Linear Feasibility Problem

Md Sarowar Morshed, Md Saiful Islam, Md. Noor-E-Alam

PDF

TL;DR

This paper introduces an Accelerated Sampling Kaczmarz Motzkin (ASKM) algorithm that improves convergence for large-scale linear feasibility problems, especially in ill-conditioned cases, outperforming existing methods.

Contribution

The paper proposes a novel accelerated version of the SKM algorithm with proven convergence improvements for solving large-scale linear inequalities.

Findings

01

ASKM outperforms SKM, IPM, and ASM on various test instances.

02

ASKM converges faster on ill-conditioned problems.

03

Numerical experiments validate the effectiveness of ASKM.

Abstract

The Sampling Kaczmarz Motzkin (SKM) algorithm is a generalized method for solving large scale linear systems of inequalities. Having its root in the relaxation method of Agmon, Schoenberg, and Motzkin and the randomized Kaczmarz method, SKM outperforms the state of the art methods in solving large-scale Linear Feasibility (LF) problems. Motivated by SKM's success, in this work, we propose an Accelerated Sampling Kaczmarz Motzkin (ASKM) algorithm which achieves better convergence compared to the standard SKM algorithm on ill conditioned problems. We provide a thorough convergence analysis for the proposed accelerated algorithm and validate the results with various numerical experiments. We compare the performance and effectiveness of ASKM algorithm with SKM, Interior Point Method (IPM) and Active Set Method (ASM) on randomly generated instances as well as Netlib LPs. In most of the test…

Figures17

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: CPU time comparisons among MATLAB methods solving LP, and SKM and ASKM solving LF. ∗ indicates that the solver did not solve the problem to the desired accuracy due to reaching an predetermined upper limit on function evaluations of 100000.

Instance

Dimensions

ASKM

SKM

Interior

Point

Active

set

ϵ

β

lp_brandy

1047 \times 303

0.007

0.0117

16.97

63.11

0.1

50

lp_blend

337 \times 114

0.41

0.56

2.28

4.62

0.001

20

lp_agg

2207 \times 615

0.059

0.088

{66.54}^{*}

{315.91}^{*}

0.01

50

lp_adlittle

389 \times 138

0.0008

0.002

2.16

4.96

0.01

10

lp_bandm

1555 \times 472

0.28

0.24

14.57

{529.43}^{*}

0.01

70

lp_degen2

2403 \times 757

8.29

10.16

7.13

21038

0.01

200

lp_finnis

3123 \times 1064

0.13

0.15

{66.16}^{*}

237750^{*}

0.005

100

lp_recipe

591 \times 204

0.19

0.27

0.89

63.24

0.002

30

lp_scorpion

1709 \times 466

6.83

11.86

17.68

8.02

0.005

200

lp_stocfor1

565 \times 165

0.31

0.37

2.13

2.52

0.001

50

Equations209

A x \leq b, b \in R^{m}, A \in R^{m \times n}

A x \leq b, b \in R^{m}, A \in R^{m \times n}

x_{k + 1} = x_{k} - δ \frac{( a _{i^{*}}^{T} x _{k} - b _{i^{*}} ) ^{+}}{∥ a _{i^{*}} ∥ ^{2}} a_{i^{*}}

x_{k + 1} = x_{k} - δ \frac{( a _{i^{*}}^{T} x _{k} - b _{i^{*}} ) ^{+}}{∥ a _{i^{*}} ∥ ^{2}} a_{i^{*}}

γ_{k}^{2} - \frac{ζ}{m} γ_{k} = \frac{d}{β} (1 - \frac{λ β}{m} γ_{k}) γ_{k - 1}^{2}

γ_{k}^{2} - \frac{ζ}{m} γ_{k} = \frac{d}{β} (1 - \frac{λ β}{m} γ_{k}) γ_{k - 1}^{2}

α_{k}

α_{k}

β_{k}

x_{k + 1}

x_{k + 1}

v_{k + 1}

y_{k} = α_{k} v_{k} + (1 - α_{k}) x_{k}

y_{k} = α_{k} v_{k} + (1 - α_{k}) x_{k}

x_{k + 1} = y_{k} - θ_{k} \nabla f (y_{k})

v_{k + 1} = β_{k} v_{k} + (1 - β_{k}) y_{k} - γ_{k} \nabla f (y_{k})

E [∥ x_{k + 1} - P_{A, b} (x_{k + 1}) ∥^{2}]

E [∥ x_{k + 1} - P_{A, b} (x_{k + 1}) ∥^{2}]

\leq (1 - \frac{2 δ - δ ^{2}}{m L ^{2}})^{k + 1} ∥ x_{k} - P_{A, b} (x_{0}) ∥^{2}

\displaystyle\operatorname{\mathbb{E}}\left[\big{\|}v_{k+1}-x^{*}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\right]\ \leq\ \frac{4\big{\|}x_{0}-x^{*}\big{\|}^{2}_{(A^{T}A)^{\dagger}}}{\left(\sigma_{1}^{k+1}+\sigma_{2}^{k+1}\right)^{2}}

\displaystyle\operatorname{\mathbb{E}}\left[\big{\|}v_{k+1}-x^{*}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\right]\ \leq\ \frac{4\big{\|}x_{0}-x^{*}\big{\|}^{2}_{(A^{T}A)^{\dagger}}}{\left(\sigma_{1}^{k+1}+\sigma_{2}^{k+1}\right)^{2}}

\displaystyle\operatorname{\mathbb{E}}\left[\big{\|}x_{k+1}-x^{*}\big{\|}^{2}\right]\ \leq\ \frac{4\lambda\big{\|}x_{0}-x^{*}\big{\|}^{2}_{(A^{T}A)^{\dagger}}}{\zeta\left(\sigma_{1}^{k+1}-\sigma_{2}^{k+1}\right)^{2}}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}\right]=\frac{1}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\binom{\beta-1+k}{\beta-1}\big{|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i_{k}}\big{|}^{2}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}\right]=\frac{1}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\binom{\beta-1+k}{\beta-1}\big{|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i_{k}}\big{|}^{2}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}a_{i^{*}}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\right]\ \leq\ \frac{\beta}{m}\ \|(Ay-b)^{+}\|^{2}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}a_{i^{*}}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\right]\ \leq\ \frac{\beta}{m}\ \|(Ay-b)^{+}\|^{2}

E_{β_{i^{*}}}

E_{β_{i^{*}}}

\displaystyle=\frac{1}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\binom{\beta-1+k}{\beta-1}\big{\|}a_{i_{k}}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\big{|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i_{k}}\big{|}^{2}

\displaystyle\leq\ \frac{\binom{m-1}{\beta-1}}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\big{\|}a_{i_{k}}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\big{|}\left(a_{i_{k}}^{T}y-b_{i_{k}}\right)^{+}\big{|}^{2}

\displaystyle\leq\ \frac{\beta}{m}\sum\limits_{j=1}^{m}\big{\|}a_{j}\big{\|}^{2}_{(A^{T}A)^{\dagger}}\big{|}\left(a_{j}^{T}y-b_{j}\right)^{+}\big{|}^{2}

\displaystyle=\ \frac{\beta}{m}\sum\limits_{j=1}^{m}\ \Big{\langle}(A^{T}A)^{\dagger}a_{j}\left(a_{j}^{T}y-b_{j}\right)^{+},a_{j}\left(a_{j}^{T}y-b_{j}\right)^{+}\Big{\rangle}

= \frac{β}{m} Tr [(A^{T} A)^{†} j = 1 \sum m a_{j} {(a_{j}^{T} y - b_{j})^{+}}^{2} a_{j}^{T}]

= \frac{β}{m} Tr [(A^{T} A)^{†} A^{T} D^{2} [(A y - b)^{+}] A]

= \frac{β}{m} Tr [V Σ^{- 2} V^{T} V Σ U^{T} D^{2} [(A y - b)^{+}] U Σ V^{T}]

= \frac{β}{m} Tr [U^{T} D^{2} [(A y - b)^{+}] U]

\displaystyle=\frac{\beta}{m}\big{\|}D\left[\left(Ay-b\right)^{+}\right]U\big{\|}^{2}_{F}

\displaystyle=\frac{\beta}{m}\sum\limits_{j=1}^{m}\big{|}\left(a_{j}^{T}y-b_{j}\right)^{+}\big{|}^{2}\|U_{j}\|^{2}_{2}

\displaystyle\leq\ \frac{\beta}{m}\sum\limits_{j=1}^{m}\big{|}\left(a_{j}^{T}y-b_{j}\right)^{+}\big{|}^{2}=\ \frac{\beta}{m}\|(Ay-b)^{+}\|^{2}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}a_{i^{*}}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}\right]\ \leq\ \frac{\beta}{m}\ \|(Ay-b)^{+}\|^{2}

\displaystyle\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}a_{i^{*}}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i^{*}}\big{\|}^{2}\right]\ \leq\ \frac{\beta}{m}\ \|(Ay-b)^{+}\|^{2}

E_{β_{i^{*}}}

E_{β_{i^{*}}}

\displaystyle=\frac{1}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\binom{\beta-1+k}{\beta-1}\big{\|}a_{i_{k}}\big{\|}^{2}\big{|}\left(a_{i}^{T}y-b_{i}\right)^{+}_{i_{k}}\big{|}^{2}

\displaystyle\leq\ \frac{\binom{m-1}{\beta-1}}{\binom{m}{\beta}}\sum\limits_{k=0}^{m-\beta}\big{|}\left(a_{i_{k}}^{T}y-b_{i_{k}}\right)^{+}\big{|}^{2}

\displaystyle\leq\ \frac{\beta}{m}\sum\limits_{j=1}^{m}\big{|}\left(a_{j}^{T}y-b_{j}\right)^{+}\big{|}^{2}=\ \frac{\beta}{m}\|(Ay-b)^{+}\|^{2}

\displaystyle\frac{1}{m}\ \|(Ay-b)^{+}\|^{2}\leq\ \|y-x^{*}\|^{2}-\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}\mathcal{P}_{a_{i^{*}},b_{i^{*}}}(y)-x^{*}\big{\|}^{2}\right]

\displaystyle\frac{1}{m}\ \|(Ay-b)^{+}\|^{2}\leq\ \|y-x^{*}\|^{2}-\operatorname{\mathbb{E}}_{\beta_{i^{*}}}\left[\big{\|}\mathcal{P}_{a_{i^{*}},b_{i^{*}}}(y)-x^{*}\big{\|}^{2}\right]

i^{*} = i \in τ_{k} arg max {a_{i}^{T} x_{k} - b_{i}, 0} = i \in τ_{k} arg max (A_{τ_{k}} x_{k} - b_{i})_{i}^{+}

i^{*} = i \in τ_{k} arg max {a_{i}^{T} x_{k} - b_{i}, 0} = i \in τ_{k} arg max (A_{τ_{k}} x_{k} - b_{i})_{i}^{+}

∥ x_{k + 1} - P ∥^{2}

∥ x_{k + 1} - P ∥^{2}

\displaystyle=\big{\|}y_{k}-\frac{(A_{\tau_{k}}y_{k}-b_{\tau_{k}})^{+}_{i^{*}}}{\|A_{\tau_{k}}\|^{2}}a_{i^{*}}-\mathcal{P}(y_{k})\big{\|}^{2}

= ∥ y_{k} - P (y_{k}) ∥^{2} + [(A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+}]^{2}

- 2 (A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+} a_{i^{*}}^{T} (y_{k} - P (y_{k}))

\leq ∥ y_{k} - P ∥^{2} + [(A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+}]^{2} - 2 (A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+} (a_{i^{*}}^{T} y_{k} - b_{i^{*}})

= ∥ y_{k} - P ∥^{2} - [(A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+}]^{2}

= ∥ y_{k} - P ∥^{2} - ∥ (A_{τ_{k}} y_{k} - b_{τ_{k}})_{i^{*}}^{+} ∥_{\infty}^{2}

E_{β_{i^{*}}}

E_{β_{i^{*}}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Accelerated Sampling Kaczmarz Motzkin Algorithm for the Linear Feasibility Problem

Md Sarowar Morshed Department of Mechanical $\&$ Industrial Engineering, Northeastern University, Boston, MA 02115, USA

Md Saiful Islam 11footnotemark: 1

Md Noor-E-Alam 11footnotemark: 1 Corresponding Author: [email protected]

Abstract

The Sampling Kaczmarz Motzkin (SKM) algorithm is a generalized method for solving large-scale linear systems of inequalities. Having its root in the relaxation method of Agmon, Schoenberg, and Motzkin and the randomized Kaczmarz method, SKM outperforms the state-of-the-art methods in solving large-scale Linear Feasibility (LF) problems. Motivated by SKM’s success, in this work, we propose an Accelerated Sampling Kaczmarz Motzkin (ASKM) algorithm which achieves better convergence compared to the standard SKM algorithm on ill-conditioned problems. We provide a thorough convergence analysis for the proposed accelerated algorithm and validate the results with various numerical experiments. We compare the performance and effectiveness of ASKM algorithm with SKM, Interior Point Method (IPM) and Active Set Method (ASM) on randomly generated instances as well as Netlib LPs. In most of the test instances, the proposed ASKM algorithm outperforms the other state-of-the-art methods.

Keywords: Kaczmarz Method, Nesterov’s Acceleration, Motzkin Method, Sampling Kaczmarz Motzkin Algorithm

MSC 2010: 90C05, 65F10, 90C25, 15A39, 68W20

1 Introduction

We consider the problem of solving large-scale systems of linear inequalities:

[TABLE]

Since, iterative methods are usually better suited to problems with large number of constraints compared to the number of variables, we confine the scope of this work to the $m\gg n$ regime. We denote the rows of matrix $A$ by $a_{i}^{T}$ for $i=1,2,..,m$ . In addition, we make the following assumptions: (1) the system is consistent, (2) matrix $A$ has no zero rows and (3) the rows of $A$ are normalized (i.e. $\|a_{i}\|=1$ ). It is worth noting that the last assumption is not a significantly important requirement for algorithmic efficiency, but it helps in the convergence analysis.

While most classical iterative methods are deterministic, recent works [1, 2, 3, 4, 5, 6, 7, 8] suggest that randomization can play a huge role in the design of efficient algorithms for solving LF problems and randomized algorithms often perform better than existing deterministic methods. As shown in [9], randomized iterative methods can outperform state-of-the-art methods (i.e., IPM, ASM) for large-scale LF. In the field of large-scale optimization, mainly IPMs, there is a growing interest in approximate Newton-type methods ([10, 11, 12, 13, 14, 15, 16]) which use fast sub-schemes for calculating approximate solutions of large-scale Linear System (LS).

The Kaczmarz method for solving LS, discovered in 1937 [17], remained unnoticed to the western research community until the early 1980s, when it found an important application in the area of Algebraic Reconstruction Techniques (ART) for image reconstruction [18]. Since then it has been used for several other areas like digital signal processing, computer tomography, and belongs to a general category of methods including row-action, component solution, cyclic projection, and successive projection methods (see [19]). It gained immense popularity in the research community after the convergence analysis done in 2009 for the randomized version [1]. The convergence analysis of Strohmer [1] encouraged numerous extensions and generalizations of the randomized Kaczmarz method (see [2, 3, 5, 6, 7, 20], for instance when we replace the equality constraints with inequality constraints we get a variant of the original problem.

Motzkin’s relaxation method is a variation of the Kaczmarz method which was introduced in the early 1950s [21, 22] for solving systems of linear inequalities. Since then, it has been rediscovered several times. For instance, the famous perceptron algorithm in machine learning [23, 24, 25] can be thought of as a member of this family of methods. Additionally, the relaxation method has been referred to as the Kaczmarz method with the “most violated constraint control” or the “maximal-residual control” [19, 26, 27]. The rate of convergence of Motzkin’s method depends on step lengths and the so called Hoffman constants [21, 28].

Combining both the Kaczmarz and Motzkin method together, the SKM algorithm proposed in [9] for solving LF problem given in (1) requires only $O(n)$ memory storage and it has a linear convergence rate. As shown by the authors, SKM is much more efficient than the state-of-the-art techniques such as IPMs, ASMs, and Kaczmarz Methods. Roughly, the SKM algorithm selects a row out of $\beta$ rows (sampled from $A$ ) by the maximum violation criterion (i.e. choose the row $i^{*}$ with $i^{*}=\operatorname*{arg\,max}_{i\in\tau_{k}}\{a_{i}^{T}x_{k}-b_{i},0\},\ \beta=|\tau_{k}|$ ) and then updates the next point as follows:

[TABLE]

In equation (2), $\delta$ can be $0\leq\delta\leq 2$ . Without the loss of generality, we consider $\delta=1$ in this work. The SKM method described in [9] overcomes the drawbacks of the individual methods (Kaczmarz, Motzkin) and combines their strengths. By selecting the maximum violated hyperplane from a sample, SKM achieves faster convergence compared to the randomized Kaczmarz method. In addition, per iteration computational cost is cheaper compared to Motzkin’s method. Recently, Wright et. al [20] applied the acceleration scheme of Nesterov to the randomized Kaczmarz method. In a different work, Xu et. al [29] investigated the acceleration scheme in the context of the extended randomized Kaczmarz method for least square problems. Moreover, there is a recent work in applying Nesterov scheme in IPMs for solving large-scale linear programming problems [30]. The above-mentioned works showed that the introduction of Nesterov’s acceleration scheme fasten the convergence of the original method.

In this work, we apply Nesterov’s acceleration scheme [31, 32, 33, 34, 35] to the generalized SKM algorithm. This can be seen as a generalized accelerated scheme for both randomized Kaczmarz method for solving linear systems as well as linear system of inequalities. It can be noted that with some modification, like the one stated in the work of Lewis et. al [2], we can apply this method to linear systems with both equality and inequality constraints. The overarching goal of this paper is to incorporate the ideas of the Kaczmarz method [1, 17, 36] for LS and Motzkin’s relaxation [9] for LF problem and develop an accelerated randomized scheme for solving large-scale LF problem. The paper is organized as follows. The proposed algorithm is discussed in section 2, and the convergence analysis of the proposed algorithm is given in section 3. Extensive Numerical experiments performed on random and Netlib LP instances are provided in section 4. And finally the paper is concluded with the conclusion in section 5.

2 ASKM Algorithm

2.1 Notation:

We follow the standard notation in this work. For example, $\mathbb{R}$ will be used to denote the set of real numbers. Matrix $A$ with $m$ rows and $n$ columns belong to $\mathbb{R}^{m\times n}$ , with $A_{ij}$ denoting the real-valued element in row $i$ and column $j$ . $A^{T}$ will be used to denote the transpose of matrix $A$ , with $tr(A)$ , $det(A)$ , and $diag(A)$ denoting the trace, determinant, and diagonal of matrix $A$ respectively. $I_{n}$ will be used as the $n\times n$ identity matrix.

Furthermore, we use vectors $\mathbf{1}=\left[1~{}1~{}\ldots~{}1\right]^{T}$ and $e_{i}$ as the standard $i$ -th basis vector. A function $f:X\mapsto Y$ maps its domain, $dom(f)\subseteq X$ , into set $Y$ . As it is customary, we use $\nabla f$ and $\nabla^{2}f$ to represent the gradient and Hessian of $f$ . Finally, $\langle x,y\rangle=x^{T}y$ denotes the standard inner product and $\|x\|=\sqrt{\langle x,x\rangle}$ as the euclidean ( $L_{2}$ ) norm. $\lambda_{min},\lambda_{max}$ are set to be the minimum and maximum nonzero eigenvalues of $A^{T}A$ respectively. $\|A\|$ is the spectral norm of the matrix $A$ and $\|A\|_{F}$ denote the Frobenius norm. Moreover, $A^{\dagger}$ is the Moore-Penrose pseduinverse of $A$ and the corresponding compact singular value decomposition of $A\in\mathbb{R}^{m\times n}$ as $A=U\Sigma V^{T}$ , where $U,V$ are unitary matrices with appropriate size and $\Sigma$ is the non-singular and diagonal matrix with singular value on the diagonal. Throughout the paper, we denote $\zeta$ as the condition number of matrix $A$ . The notation $\mathcal{P}_{A,b}(x)$ denotes the Euclidean norm projection of $x$ onto the feasible region of $Ax\leq b$ . In this section, we review the proposed SKM algorithm in [9] and then based on the motivation from the accelerated randomized Kaczmarz algorithm in [20] and accelerated extended Kaczmarz algorithm in [29], we develop ASKM algorithm.

In the above algorithm, we propose to use the acceleration scheme discovered by Nesterov [31, 32, 33, 34, 35] in the SKM algorithm framework to achieve second order convergence rate as compared to the linear rate shown in [9]. The ASKM algorithm uses the acceleration procedure [33], which is more famous in the context of gradient descent algorithm. Note that, Nesterov’s acceleration scheme uses two new sequences $\{y_{k}\}$ and $\{v_{k}\}$ and update the sequences as follows:

[TABLE]

In equation (2.1), $\nabla f$ is the gradient of the given function and $\theta_{k}$ is the step-size. The main contribution for the above scheme is that it uses appropriate values for the parameters $\alpha_{k},\beta_{k},\gamma_{k}$ , which in turn yield better convergence in the context of standard gradient descent. Now, using the general setup of Nesterov’s scheme [35] for coordinate descent and the idea in [20], we developed ASKM algorithm shown above (Algorithm 2).

3 Convergence Analysis

In this section, we analyze the convergence of the proposed ASKM algorithm 2. Throughout the analysis, we make the assumptions: 1) $\|a_{i}\|=1$ for any $i\in m$ , which implies $\|A\|_{F}^{2}=m$ and 2) $\mathcal{P}_{A,b}$ is full dimensional. The following convergence result was proven in [9] for the SKM algorithm (Algorithm 1):

[TABLE]

In the above equation, $L$ is the Hoffman constant and $V_{k}$ is defined in the proof of Lemma 4. For the ASKM algorithm (Algorithm 2) shown above, we prove a better convergence result as stated in Theorem 1 compared to the one in (3) (we consider the case $\delta=1$ ).

Remark 1.

This framework for convergence in the context of acceleration follows the general idea developed by Nesterov [32] for the Gradient Descent method. The proof of Theorem 1 follows the generalized sketch developed by Nesterov [35] for proving the convergence result of Coordinate Descent method. Due to the similarity of acceleration methods derived in [20] for the randomized Kaczmarz method and our proposed method, we will use the same standard notation on this subject. In addition to that, the following results generalize results for acceleration in Kaczmarz types methods (i.e. if we select $\beta=1$ and use linear systems, we get the same results shown in [20]).

Theorem 1.

The ASKM algorithm defined above with $\lambda\in[0,\lambda_{min}]$ and $\sigma_{1}=1+\frac{\sqrt{\lambda\beta\zeta}}{2m},\sigma_{2}=1-\frac{\sqrt{\lambda\beta\zeta}}{2m}$ , then for all $k\geq 0$ we have the following:

[TABLE]

Here, $x^{*}\in\mathbb{R}^{n}$ is a unique limit point of the ASKM iterates (for the uniqueness of $x^{*}$ see Lemma 2.2-2.4 in [9]), $\zeta$ is the condition number of matrix $A$ and $\beta$ is the sample size of the random sampling process.

Before delving into Theorem 1, we start with the proof of some useful lemmas. For the expectation calculation of the random process described in our algorithm, we need a certain setup. Let, $(Ax_{k}-b)^{+}_{i_{j}}$ denote the $(\beta+j)^{th}$ smallest entry of the residual vector (if we order the entries of $(Ax_{k}-b)^{+}$ from smallest to largest, $(Ax_{k}-b)^{+}_{i_{j}}$ is the entry in $(\beta+j)^{th}$ position). Now, if we consider the size of all the entries of the residual vector $(Ax_{k}-b)^{+}$ , we can calculate the probability that a particular entry of the residual vector is selected. In this case, each sample has equal probability of selection (i.e., $\frac{1}{\binom{m}{\beta}}$ ). Moreover, the size of the residual vector controls the frequency that each entry of the residual vector will be expected to be selected (Algorithm 2, sample of constraints selection). For example, if we have only one sample then $\beta^{th}$ smallest entry will be selected and for the case of all samples, $m^{th}$ smallest entry will be selected. Therefore, if we expand the expectation of the residual (with respect to the probabilistic choice of sample constraints, $\tau_{j}$ , of size $\beta$ ), we get the following:

[TABLE]

where $\operatorname{\mathbb{E}}_{\beta_{i^{*}}}$ denotes the required expectation in accordance with the above sampling process ( $\beta$ is the sample size).

Lemma 2.

For any $y\in\mathbb{R}^{n},$ we have the following:

[TABLE]

Proof.

Let us define the singular value decomposition of $A$ as $A=U\Sigma V^{T},$ where both $U$ and $V$ are unitary matrices of appropriate dimension and $\Sigma$ is a positive diagonal. We can easily show that $(A^{T}A)^{\dagger}=V\Sigma^{-2}V^{T}$ . Then, with the defined orientation above, we have the following:

[TABLE]

This proves Lemma 2. ∎

Lemma 3.

For any $y\in\mathbb{R}^{n},$ we have the following:

[TABLE]

Proof.

With the expression of expectation defined in (12) we have,

[TABLE]

This proves the Lemma 3. ∎

Lemma 4.

For any $y\in\mathbb{R}^{n}$ and $x^{*}$ that satisfies $Ax^{*}\leq b,$ we have the following:

[TABLE]

Proof.

Let us define $\mathcal{P}$ as the projection operator onto the feasible region $P=\{x\in\mathbb{R}^{n}\ |\ Ax\leq b\}$ , and denote $s_{k}$ as the number of zero entries in the residual $(Ax_{k}-b)^{+}$ , which also corresponds to number of satisfied constraints. We also define $V_{j}=\max\{m-s_{j},m-\beta+1\}$ . Now, from the update formula shown in Algorithm 2, we know that $x_{k+1}=y_{k}-\frac{(A_{\tau_{k}}y_{k}-b_{\tau_{k}})^{+}_{i^{*}}}{\|a_{i^{*}}\|^{2}}a_{i^{*}}$ ; where,

[TABLE]

Then we have,

[TABLE]

Now, taking expectation in both sides of equation (17), we have,

[TABLE]

The expectation above proves the Lemma 4. ∎

Definition: Let us define a function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ as follow:

[TABLE]

The gradient of the function is given by:

[TABLE]

Lemma 5.

For any $x,y\in\mathbb{R}^{n}$ and condition number of $A$ matrix $\zeta=\frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}=\frac{\lambda^{2}_{\max}(A^{T}A)}{\lambda^{2}_{\min}(A^{T}A)}$ , we have the following:

[TABLE]

Proof.

We first prove that $\nabla f$ is Lipschitz continuous with the constant $\zeta$ . Using the definition given in (18), for any $x,y\in\mathbb{R}^{n}$ we have,

[TABLE]

The above equation shows that $\nabla f$ is Lipschitz continuous with the constant $\zeta$ . Here, we use the common expression $(A^{T}A)^{\dagger}A^{T}=(A)^{\dagger}$ . Now using Lemma 1.2.3 proven in [33], as $\nabla f$ is Lipschitz continuous, for any $x,y\in\mathbb{R}^{n}$ we can write the following:

[TABLE]

Now, by simplifying (20), we get the following bound:

[TABLE]

The bound mentioned above proves the Lemma 5. ∎

Lemma 6.

For any $m^{2}>\zeta\lambda\beta$ and with the following definitions:

[TABLE]

both sequences $\{\alpha_{k}\},\ \{\beta_{k}\}$ lies in the interval $[0,1]$ if and only if $\gamma_{k}$ satisfies the following property:

[TABLE]

Proof.

The proof of Lemma 6 is straightforward. If we consider the definitions of the sequences $\{\alpha_{k}\},\ \{\beta_{k}\}$ with the given condition, we find that $\alpha_{k},\beta_{k}\in[0,1]$ implies that the following bound must hold:

[TABLE]

Conversely, if we assume the bound holds for $\gamma_{k}$ , then we can easily find that it implies the sequences $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ lies in the interval $[0,1]$ . ∎

Lemma 7.

For any $d\geq\beta$ , if $\gamma_{k-1}\leq\sqrt{\frac{\zeta}{\lambda d}}$ holds, then $\gamma_{k}$ satisfies the bound in Lemma 6 and also $\gamma_{k}$ lies in the interval $[\gamma_{k-1},\sqrt{\frac{\zeta}{\lambda d}}]$ .

Proof.

Let us define the function $g:\mathbb{R}\rightarrow\mathbb{R}$ as follows:

[TABLE]

As we know from the definition, $\gamma_{k}$ is the largest root of $g(\gamma)$ , then it satisfies $g(\gamma_{k})=0$ . Now we have,

[TABLE]

Similarly,

[TABLE]

Therefore, we can write,

[TABLE]

This proves the first part of the Lemma. For the second part, notice that, assuming $\beta\leq d$ we have the following:

[TABLE]

Here, the last inequality follows from the assumed condition $\gamma_{k-1}\leq\sqrt{\frac{\zeta}{\lambda d}}$ . In a similar fashion we have,

[TABLE]

In this case, we use the identity $\sqrt{\frac{\zeta}{\lambda d}}>\frac{\zeta}{m}$ and $\gamma_{k-1}\leq\sqrt{\frac{\zeta}{\lambda d}}$ . This proves the statement, $\gamma_{k}\in[\gamma_{k-1},\sqrt{\frac{\zeta}{\lambda d}}]$ . ∎

Remark 2.

Note that by taking limits as $\lambda\rightarrow 0^{+}$ in Theorem 1 we have,

[TABLE]

Therefore, we can conclude that when $\lambda>0$ , the ASKM algorithm converges with a linear rate. When $\lambda=0$ , we get a sublinear convergence. But for the case of $\lambda\rightarrow 0^{+}$ , we get a quadratic convergence, which is consistent with the convergence rate of the original accelerated algorithm of Nesterov [33] and also with the Accelerated Randomized Kaczmarz algorithm proposed in [20]. Furthermore, if we take $\beta=1$ , we get exactly the same convergence theorem proven in [20].

Proof.

(Theorem 1) The proof of theorem 1 is general in the context of acceleration. We follow the standard notation and steps shown in [35], [20]. Using the definitions given in Lemma 6, we note that the following relation holds:

[TABLE]

Now, let us define $r_{k}^{2}=\|v_{k}-x^{*}\|^{2}_{(A^{T}A)^{\dagger}}$ . We can write,

[TABLE]

Now, we divide the RHS of equation (24) into three parts and simplify them separately. Since $\|.\|^{2}_{(A^{T}A)^{\dagger}}$ is a convex function and $0\leq\beta_{k}\leq 1$ , $1^{st}$ part of (24) satisfies the following inequality:

[TABLE]

Let us denote $i(k)$ as the index which represents the random selection at iteration $k$ . And let $I(k)$ denote all random indices occurred before or at iteration $k$ , i.e.,

[TABLE]

The sequences $x_{k+1},y_{k+1},v_{k+1}$ are dependent on $I(k)$ . In the next part of the proof, we use $\operatorname{\mathbb{E}}_{i(k)|I(k-1)}[.]$ to represent the expectation of a random variable conditioned on $I(k-1)$ with respect to the index $i(k)$ . Note that,

[TABLE]

Also note that, from now on we use $\operatorname{\mathbb{E}}$ instead ( $\operatorname{\mathbb{E}}_{\beta i^{*}}$ ) to denote the expectation. Now, based on the Lemma 3 and Lemma 4, we can write the $2^{nd}$ part of (24) as follows:

[TABLE]

Now, by using the definitions of the sequences $\{\alpha_{k}\},\{\beta_{k}\}$ and $\{\gamma_{k}\}$ , we can simply show that the following identity holds:

[TABLE]

We use the identity of (29) in the next part of our proof. After taking expectation in the third term of equation (24), we get,

[TABLE]

Using the definition of the function $f(.)$ defined in (18) and denoting $z_{k}=\beta_{k}\frac{1-\alpha_{k}}{\alpha_{k}}$ , we get,

[TABLE]

Now, substituting equation (31) in (30) with the known identity, $\frac{\zeta\beta\gamma_{k}}{m}z_{k}=d\beta_{k}\gamma_{k-1}^{2}$ , we have,

[TABLE]

Now by substituting all three parts of (2), (28) and (32) in equation (24), we get,

[TABLE]

From now on, we will assume $d=\beta$ , which will simplify our algorithm. Let us define two sequences $\{A_{k}\}$ and $\{B_{k}\}$ as follows:

[TABLE]

Without loss of generality, we assume $A_{0}=0$ to be consistent with the definition $\gamma_{-1}=0$ . Also note that since $\beta_{k}\in(0,1]$ , we have $B_{k+1}\geq B_{k}$ . Now using the definition of the sequence $\{\gamma_{k}\},\{\alpha_{k}\}$ , we have,

[TABLE]

Equation (35) also implies that the sequence $\{A_{k}\}$ is an increasing sequence. Now, it is straightforward to check that the following identities hold.

[TABLE]

Now, multiplying both sides of (33) by $B_{k+1}^{2}$ and using the above identities we have,

[TABLE]

Furthermore, we have,

[TABLE]

Therefore, using (38) we can conclude the following bound,

[TABLE]

Now, we need to estimate the growth of the defined sequences $\{A_{k}\}$ and $\{B_{k}\}$ . Here, we follow the proof for the Accelerated Coordinate Descent method of Nesterov [35] and accelerated randomized Kaczmarz algorithm by Wright et. al [20] as they are more general in the context of acceleration. We have,

[TABLE]

Simplifying (40) we get,

[TABLE]

Here, we used the identity $B_{k+1}\geq B_{k}$ , which simplifies to:

[TABLE]

Similarly, note that,

[TABLE]

Above equation simplifies to the following:

[TABLE]

In this case, we used the identity $A_{k+1}\geq A_{k}$ , which leads to the following identity:

[TABLE]

By combining the two expressions of (41) and (42) in a LS we get,

[TABLE]

The Jordan decomposition of the matrix in the above expression is given by,

[TABLE]

Here, $\sigma_{1}=1+\frac{\sqrt{\lambda\beta\zeta}}{2m}$ and $\sigma_{2}=1-\frac{\sqrt{\lambda\beta\zeta}}{2m}$ . Using $A_{0}=0$ and the decomposition of (44), from equation (43) we have,

[TABLE]

The above gives us the following growth bound for the sequences $\{A_{k}\}$ and $\{B_{k}\}$ as follows:

[TABLE]

Substituting these above bounds of (45) and (46) in equation (39), we get the following bounds:

[TABLE]

The above equations complete the proof of Theorem 1. ∎

4 Numerical Experiments

We implemented the ASKM algorithm in MATLAB and performed the numerical experiments in a Dell Precision 7510 workstation with 32GB RAM, Intel Core i7-6820HQ CPU, processor running at 2.70 GHz. We divided the numerical experiments into three categories: experiments on randomly generated problems, experiments on real-world non-random problems and comparison among different methods. In these experiments, we compared ASKM with SKM and other state-of-the-art methods (i.e., IPM and ASM). As mentioned earlier, our main focus is on the over-determined systems regime (i.e., $m\gg n$ ), where iterative methods are applied in general. For all of the experiments, we ran the algorithms 10 times and report the averaged performance.

4.1 Comparison of SKM and ASKM on random data

We considered systems $Ax\leq b$ where the entries of $A$ and $b$ are chosen randomly from the corresponding distribution. To make sure that $b\in\mathcal{R}(\mathbf{A})$ , we generated two vectors $x_{1},x_{2}\in\mathbb{R}^{n}$ at random from the corresponding distributions, then multiplied them by $A$ and set $b$ as a convex combination of those two vectors. We considered two types of random data sets: highly correlated systems and Gaussian systems. In the highly correlated systems, entries of $A$ are chosen uniformly at random between $[0.9,1.0]$ and $b$ is chosen accordingly such that the system $Ax\leq b$ has a feasible solution. The entries of $A$ in the Gaussian systems are chosen from standard normal distribution and $b$ is chosen accordingly as before.

In Figure 1, we provide a comparison between SKM and ASKM for three randomly generated correlated systems. We compare the average computational time necessary for SKM and ASKM with several choices of sample size $\beta$ to reach positive residual error $10^{-05}$ (i.e., $\|\left(Ax_{k}-b\right)^{+}\|_{2}\leq 10^{-05}$ ). We compare the two algorithms for the choice of 111 $\delta$ here is same as $\lambda$ in De Loera et. al [9] $\delta=1$ . For the three test cases, we see that for any $1\leq\beta\leq m$ , ASKM significantly outperform SKM in terms of average computation time.

In Figure 2, we show the same comparison experiments for randomly generated Gaussian systems. Similar to the correlated systems, ASKM algorithm solves the Gaussian systems much faster than the SKM algorithm. Notice that in Figure 2(b) the computational time of SKM algorithm stays at 1000 seconds for sample size $\beta\geq 2000$ . This happens due to an additional terminating condition of maximum run time set at 1000 seconds. While SKM algorithm fails to converge within the limiting time for larger sample sizes ( $\beta\geq 2000)$ , ASKM algorithm finds a feasible solution for any sample size. Moreover, if we analyze the trend of ASKM’s average computational time in both figures (Figure 1 and 2), we see that ASKM accelerates the SKM algorithm and the nature of acceleration is quadratic which validates our claim of Theorem 1.

In Figure 3 and 4, we compare the positive residual error for SKM and ASKM for different sample sizes ( $\beta=1,100,1000$ ). We plot iteration versus residual error and time versus residual error for random Gaussian systems. Based on the findings of Figure 3 and 4, we can conclude that irrespective of sample size selection, $\|(Ax_{k}-b)^{+}\|_{2}$ converges to zero much more faster for ASKM than for SKM. The convergence of $\|(Ax_{k}-b)^{+}\|_{2}$ for both ASKM and SKM are much slower for the choice of $\beta=1$ as expected.

For $\beta=100$ and $\beta=1000$ , the convergence rate of ASKM takes over SKM after a small amount of time. In addition, the convergence rate remains similar for both the test case problems ( $5000\times 1000$ and $8000\times 2000$ ). To investigate the solution quality of both SKM and ASKM, we measure the number of satisfied constraints after each iteration and the corresponding computational time for both algorithms. We summarize our findings in Figure 5 and 6 for the above test sets. For simplification, we used the Fraction of satisfied constraints (FSC) as a measure of quality of the solution generated by both SKM and ASKM algorithms. After analyzing Figures 5 and 6, we can conclude that the choice of $\beta=1$ is the worst choice as both SKM and ASKM takes much more time to satisfy all the constraints. However, for the choice of $\beta=100$ and $\beta=1000$ , ASKM takes much less time compared to SKM to find a solution within the error margin. For example, in Figure 5, the choice of $\beta=1000$ ASKM takes approximately 37 seconds to satisfy all the 5000 constraints whereas SKM takes up to 75 seconds.

4.2 Comparison of SKM and ASKM for real-world non-random data

In this subsection, we consider two real-world non-random problems. We consider Support Vector Machine (SVM) instances with linear classification and feasibility problems arising in benchmark libraries. We considered the standard test cases given in [37, 38, 9].

We compare SKM and ASKM methods to solve the linear classification problem with SVM for 1) Wisconsin (diagnostic) breast cancer data set and 2) Credit card data set. The breast cancer data set includes data points whose features are computed from digitized images. Each data point is classified either as malignant or as benign. Our goal is to find a solution of the homogeneous system of inequalities, $Ax\leq 0$ which represents the separating hyperplane between malignant and benign data points. The system of inequalities has 569 constraints (data points) and 30 variables (features). Since the data set is not separable, we set SKM and ASKM to find the minimized residual norm $\|Ax_{k}\|_{2}$ . For our setup, We consider the threshold $\|Ax_{k}\|_{2}\leq 0.0005$ and $10^{-6}$ .

The credit card data set described in [39, 9] consists of features describing the payment profile of user and binary variable for on-time or default payment in a certain billing cycle. Similar to the breast cancer data set, this problem can be solved by finding a solution to the corresponding homogeneous system of inequalities, $Ax\leq 0$ which represents the separating hyperplane between given on-time and default data points. The resulting system of inequalities has 30000 constraints (30000 user profiles) and 23 variables (22 profile features). Since the data set is not separable, we set SKM and ASKM to find the minimized residual norm $\|Ax_{k}\|_{2}$ . For our setup, we considered the threshold as $\frac{\|Ax_{k}\|_{2}}{\|Ax_{0}\|_{2}}\leq 0.1$ and $0.001$ .

Based on the comparison graphs shown in Figure 7, we can conclude ASKM performs much better than SKM for the breast cancer data set (Figure 7(a)). For the credit card data set ASKM performs marginally better than SKM for smaller error. Also note that, the computation time curve for credit card data is not as smooth as previous curves, which we can attribute to the irregularity of the coefficients. Such irregularity in the coefficients creates a dependence bias between residual error and actual constraints.

4.3 Comparison among SKM, ASKM and existing methods for Netlib LP

In this subsection, we investigate the comparative performance of the proposed ASKM algorithm with SKM and benchmark algorithms such as IPM and ASM on several Netlib LPs. For the implementation of SKM and ASKM to the Netlib LPs, we follow the framework given by De Loera et. al [9]. Each of these problems was formulated as a standard LP problem ( $\min c^{T}x$ subject to $Ax=b,\ l\leq x\leq u$ with optimum value $p^{*}$ ). Loera et. al [9] transformed them into an equivalent LF problem $\bar{A}x\leq\bar{b}$ , where $\bar{A}=[A\ -A\ I\ -I\ c^{T}]^{T}$ and $\bar{b}=[b\ -b\ u\ -l\ p^{*}]^{T}$ . We used this setup for all the experiments on Netlib LPs.

In Table 1, we provide the performance behaviour (computation time in seconds) of ASKM, SKM, IPM and ASM on the Netlib LPs. For fair comparison, we coded ASKM, SKM in MATLAB and compared with the MATLAB Optimization Toolbox function fmincon. Note that fmincon allows us to select both IPM and ASM methods.

At first, we solve the feasibility problem ( $\bar{Ax}\leq\bar{b}$ ) using SKM and ASKM and recorded the CPU time in Table 1. But we didn’t solve the feasibility problem ( $\min 0\ s.t\ \bar{A}x\leq\bar{b}$ ) directly using fmincon’s IPM and ASM algorithms since both of these methods fail to solve the feasibility problem due to the fact that in IPM, the Karush Kuhn Tucker (KKT) condition system in each iteration becomes singular and similarly ASM halts in the initial step of finding a feasible point.

For fairness of comparison, in Table 1, we list the CPU time as follows: for SKM and ASKM method we used the feasibility problem ( $\bar{Ax}\leq\bar{b}$ ) and for the fmincon algorithms we used the original optimization LPs ( $\min c^{T}x\ s.t\ Ax\leq b,\ l\leq x\leq u$ ). As noted in [9], this is not an obvious comparison. For a better comparison, following [9] we set the halting criterion for SKM and ASKM as $\frac{\max(\bar{A}x_{k}-\bar{b})}{\max(\bar{A}x_{0}-\bar{b})}\leq\epsilon$ and the halting criterion for the fmincon’s algorithms are set as $\frac{\max(Ax_{k}-b,l-x_{k},x_{k}-u)}{\max(Ax_{0}-b,l-x_{0},x_{0}-u)}\leq\epsilon$ and $\frac{c^{T}x_{k}}{c^{T}x_{0}}\leq\epsilon$ , where $\epsilon$ is listed in Table 1. For each problem, every method started with the same initial solution far from the feasible region.

The experiments show that the proposed ASKM method compares favorably with IPM and ASM methods. Notice that the improvement of ASKM over SKM method for some problems are marginal as the analyzed instances contain sparse matrices while our proposed ASKM is explicitly designed for dense problems. Following the method provided by Liu and Wright [20], we believe one can develop a sparse version of ASKM algorithm. The trick is to aggregate several steps to reduce the calculation by using the sparsity of the instances. For example, after calculating $x_{k},y_{k}$ and $v_{k}$ , instead of updating to $x_{k+1},y_{k+1}$ and $v_{k+1}$ , for $T\gg 1$ we can update $x_{k+T},y_{k+T}$ and $v_{k+T}$ using the recurrence relation which will reduce the computational effort significantly.

5 Conclusion

In this work, we have proposed an accelerated version of SKM algorithm for solving LF problem using the celebrated Nesterov acceleration of Gradient Descent method. The proposed algorithm also generalizes the accelerated randomized Kaczmarz algorithm for solving LS problems in the context of sample size $\beta$ . We have performed a series of numerical experiments to show the performance and effectiveness of our proposed algorithm in comparison with IPM and ASM methods. ASKM algorithm performs favourably in comparison with the original SKM method, IPM and ASM method for a wide range of test instances. The proposed algorithm as it is, including the convergence analysis, can be adopted effectively for both dense and sparse systems, however, we believe, a more efficient algorithm is possible for the sparse case. In the future, we plan to extend this work to solve large-scale real-world problems with greater sparsity on the constraint matrix. Furthermore, due to the introduction of the acceleration to the SKM algorithm, we have a set of parameters (i.e., $\beta,d$ etc.) which we plan to optimize based on the problem structure to further improve the efficiency of the proposed algorithm.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications , 15(2):262, Apr 2008.
2[2] Dennis Leventhal and Adrian S. Lewis. Randomized methods for linear constraints: Convergence rates and conditioning. Mathematics of Operations Research , 35(3):641–654, 2010.
3[3] Deanna Needell. Randomized kaczmarz solver for noisy linear systems. BIT Numerical Mathematics , 50(2):395–403, Jun 2010.
4[4] Petros Drineas, Michael W. Mahoney, Shan Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik , 117(2):219–249, Feb 2011.
5[5] Anastasios Zouzias and Nikolaos M. Freris. Randomized extended kaczmarz for solving least squares. SIAM Journal on Matrix Analysis and Applications , 34(2):773–793, 2013.
6[6] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science , FOCS ’13, pages 147–156, Washington, DC, USA, 2013. IEEE Computer Society.
7[7] Anna Ma, Deanna Needell, and Aaditya Ramdas. Convergence properties of the randomized extended gauss seidel and kaczmarz methods. SIAM Journal on Matrix Analysis and Applications , 36(4):1590–1604, Jan 2015.
8[8] Zheng Qu, Peter Richtarik, Martin Takac, and Olivier Fercoq. SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. In Proceedings of The 33rd International Conference on Machine Learning , volume 48, pages 1823–1832, New York, USA, 20–22 Jun 2016. PMLR.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Accelerated Sampling Kaczmarz Motzkin Algorithm for the Linear Feasibility Problem

Abstract

1 Introduction

2 ASKM Algorithm

2.1 Notation:

3 Convergence Analysis

Remark 1**.**

Theorem 1**.**

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Remark 2**.**

Proof.

4 Numerical Experiments

4.1 Comparison of SKM and ASKM on random data

4.2 Comparison of SKM and ASKM for real-world non-random data

4.3 Comparison among SKM, ASKM and existing methods for Netlib LP

5 Conclusion

Remark 1.

Theorem 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Remark 2.