Randomized Kaczmarz in Adversarial Distributed Setting

Longxiu Huang; Xia Li; Deanna Needell

arXiv:2302.14615·math.OC·March 14, 2024

Randomized Kaczmarz in Adversarial Distributed Setting

Longxiu Huang, Xia Li, Deanna Needell

PDF

Open Access

TL;DR

This paper introduces an adversary-tolerant distributed optimization method based on randomized Kaczmarz, demonstrating its effectiveness in convex problems with adversarial workers through simulations.

Contribution

It proposes a novel iterative approach that ensures convergence and robustness in distributed convex optimization under adversarial conditions.

Findings

01

Method converges despite adversarial workers

02

High accuracy in identifying adversarial workers

03

Effective in various adversary rate scenarios

Abstract

Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. In this paper, we propose an iterative approach that is adversary-tolerant for convex optimization problems. By leveraging simple statistics, our method ensures convergence and is capable of adapting to adversarial distributions. Additionally, the efficiency of the proposed methods for solving convex problems is shown in simulations with the presence of adversaries. Through simulations, we demonstrate the efficiency of our approach in the presence of adversaries and its ability to identify adversarial workers with high accuracy and tolerate varying levels of adversary rates.

Tables6

Table 1. Table 1 : Notation Table

$A$	Data matrix $A$ , $A \in ℝ^{d_{1} \times d_{2}}$
$\tilde{A}$	Row normalized version of matrix $A$
$N$	Number of workers in total
$N_{r}$	Number of workers holding row $r$
$n_{r}$	Number of workers chosen for row $r$
$C_{ℓ}$	$ℓ$ -th error category
$k$	Number of error categories in total
$e_{r, ℓ}$	Error of the $ℓ$ -th error category for a row $r$
$e_{r}$	Vector form of errors in all error categories of row $r$ , $e_{r} ≔ {(e_{r, 1}, \dots, e_{r, k})}^{⊤}$
$e$	Matrix form of errors in all error categories of all rows, $e ≔ {(e_{r, ℓ})}_{r, ℓ}$
$d_{0}$	Number of chosen rows per iteration
$p_{r, ℓ}$	Fraction of workers holding row $r$ in category $ℓ$
${\hat{q}}_{mode}^{ℓ, r}$	Probability that there is a mode among the outputs of chosen workers of row $r$ and the mode is in the category $ℓ$ (see Lemma 3.2)
$q^{r}$	Probability that there is a mode among the outputs of chosen workers for row $r$ (see Lemma 3.2)
$[d_{1}]$	Set of the integers from 1 to $d_{1}$ , $[d_{1}] ≔ {1, \dots, d_{1}}$
$(\frac{[d_{1}]}{d_{0}})$	Collection of the subsets of $[d_{1}]$ with $d_{0}$ elements
$unif ((\frac{[d_{1}]}{d_{0}}))$	Uniform sampling from the collection $(\frac{[d_{1}]}{d_{0}})$ .
$τ_{i}$	Index set of chosen rows at $i$ -th iteration, $\| τ_{i} \| = d_{0}$
$τ_{i}^{'}$	Index set of chosen rows that have a mode, $τ_{i}^{'} \subset τ_{i}$
$t_{i}$	Row index that has the largest mode number, $t_{i} = t (x_{i - 1}, τ_{i})$
$ℙ (r mode, g, ℓ)$	Probability that the mode is in the category $ℓ$ with mode number $g$ for row $r$ (see Lemma 3.1)
$ℙ (t_{i}, ℓ, g \| τ_{i}, x_{i - 1})$	Probability that a mode is from row $t_{i}$ among rows $τ_{i}$ and the mode is in category $C_{ℓ}$ with mode number $g$ , given the previous estimate $x_{i - 1}$ . It is also denoted by $ℙ (t_{i}, ℓ, g)$ ,
$ℙ (t_{i}, g)$	Probability that the mode is from row $t_{i}$ among rows $τ_{i}$ with mode number $g$ provided the previous estimate $x_{i - 1}$ , more details refer to Corollary 3.4.

Table 2. Table 2 : Total number of workers N r ≡ 10 subscript 𝑁 𝑟 10 N_{r}\equiv 10 , number of error categories k = 3 𝑘 3 k=3 , the adversarial rate p r = p / k subscript 𝑝 𝑟 𝑝 𝑘 p_{r}=p/k .

$p$	$n_{r}$	$d_{0}$	$Q$	$β_{t}$
$0.6$	5	2	$4.25 \times 10^{- 3}$	$8.5 \times 10^{- 4}$
	5	3	$3.3 \times 10^{- 4}$	$9.9 \times 10^{- 5}$
	5	5	$3.63 \times 10^{- 6}$	$1.82 \times 10^{- 6}$

Table 3. Table 3 : Conditional probability of being in block-list.

$S$	$5$	$10$	$50$	$100$
$ℙ_{bl}^{1}$	0.403	$0.452$	$0.5$	$0.5$
$ℙ_{bl}^{0}$	$0.065$	$0.032$	$\sim 0$	$\sim 0$

Table 4. Table 4 : The accuracy of recognizing the block-list with different updating cycles S 𝑆 S by fixing k = 3 , N r = 20 formulae-sequence 𝑘 3 subscript 𝑁 𝑟 20 k=3,N_{r}=20 , and n r = 4 subscript 𝑛 𝑟 4 n_{r}=4 .

$S$	$200$	$500$	$1000$	$2000$
$p = 0.6, d_{0} = 8$	$0.75$	$0.792$	$0.875$	$0.875$
$p = 0.4, d_{0} = 6$	$0.75$	$0.9375$	$1$	$1$

Table 5. Table 5 : Total number of workers N = 100 𝑁 100 N=100 , number of chosen workers n = 5 𝑛 5 n=5 .

$p$	$k$	${\hat{q}}_{m o d e}^{ℓ}$	${\hat{q}}_{m o d e}^{0}$	$q$	$q_{0}$
$0.8$	$5$	$0.1$	$0.16$	$0.67$	$0.15$
	$10$	$0.04$	$0.21$	$0.57$	$0.36$
	$15$	$0.02$	$0.23$	$0.48$	$0.46$
$0.2$	$3$	$0.002$	$0.63$	$0.64$	$0.98$
	$5$	$8 \times 10^{- 4}$	$0.65$	$0.65$	$0.99$
	$10$	$2 \times 10^{- 4}$	$0.66$	$0.67$	$0.99$
	$15$	$2 \times 10^{- 4}$	$0.685$	$0.689$	$0.99$

Table 6. Table 6 : Total number of workers N = 100 𝑁 100 N=100 , number of error categories k = 5 𝑘 5 k=5 .

$p$	$n$	${\hat{q}}_{m o d e}^{ℓ}$	${\hat{q}}_{m o d e}^{0}$	$q$	$q_{0}$
$0.8$	10	0.099	0.18	0.67	0.26
	15	0.099	0.2	0.7	0.29
	20	0.097	0.23	0.71	0.31
$0.2$	10	$7 \times 10^{- 6}$	0.904	0.90	$1 - 5 \times 10^{- 6}$
	15	$5 \times 10^{- 7}$	0.97	0.97	$1 - 3 \times 10^{- 6}$
	20	$1 \times 10^{- 7}$	0.99	0.99	$1 - 6 \times 10^{- 7}$

Equations104

x \in R^{d_{2}} min F (x) = i = 1 \sum d_{1} f_{i} (x)

x \in R^{d_{2}} min F (x) = i = 1 \sum d_{1} f_{i} (x)

x_{j + 1} = x_{j} + γ_{j} i = 1 \sum d_{1} \nabla f_{i} (x_{j})

x_{j + 1} = x_{j} + γ_{j} i = 1 \sum d_{1} \nabla f_{i} (x_{j})

ℓ^{'} = 0, ℓ^{'} \neq = ℓ \prod k (j = 0 \sum g - 1 (j N _{r} p _{r, ℓ^{'}}) x^{j}) .

ℓ^{'} = 0, ℓ^{'} \neq = ℓ \prod k (j = 0 \sum g - 1 (j N _{r} p _{r, ℓ^{'}}) x^{j}) .

ℓ = 0 \prod k (j = 0 \sum g - 1 (j N _{r} p _{r, ℓ}) x^{j})

ℓ = 0 \prod k (j = 0 \sum g - 1 (j N _{r} p _{r, ℓ}) x^{j})

\overset{q}{^}_{m o d e}^{ℓ, r} = g = g_{0} (r) \sum n_{r} \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )},

\overset{q}{^}_{m o d e}^{ℓ, r} = g = g_{0} (r) \sum n_{r} \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )},

q_{g}^{r} = ℓ = 0 \sum k \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )} .

q_{g}^{r} = ℓ = 0 \sum k \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )} .

q^{r} = ℓ = 0 \sum k \overset{q}{^}_{m o d e}^{ℓ, r} = g = g_{0} (r) \sum n_{r} ℓ = 0 \sum k \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )},

q^{r} = ℓ = 0 \sum k \overset{q}{^}_{m o d e}^{ℓ, r} = g = g_{0} (r) \sum n_{r} ℓ = 0 \sum k \frac{( g N _{r} p _{r, ℓ} ) a _{g, ℓ}^{r}}{( n _{r} N _{r} )},

P (t_{i}, ℓ, g) = \frac{( g N _{t_{i}} p _{t_{i}, ℓ} ) a _{g, ℓ}^{t_{i}}}{( n _{t_{i}} N _{t_{i}} )} s \in τ_{i} ∖ t_{i} \prod \frac{b _{g}^{s}}{( n _{s} N _{s} )} .

P (t_{i}, ℓ, g) = \frac{( g N _{t_{i}} p _{t_{i}, ℓ} ) a _{g, ℓ}^{t_{i}}}{( n _{t_{i}} N _{t_{i}} )} s \in τ_{i} ∖ t_{i} \prod \frac{b _{g}^{s}}{( n _{s} N _{s} )} .

P (t_{i}, ℓ, g) =

P (t_{i}, ℓ, g) =

=

=

P (t_{i}, g) =

P (t_{i}, g) =

E ∥ x_{i} - x^{*} ∥_{2}^{2} \leq

E ∥ x_{i} - x^{*} ∥_{2}^{2} \leq

α = 1 - Q_{m i n} \frac{d _{0}}{d _{1}} σ_{m i n}^{2} (\tilde{A}),

α = 1 - Q_{m i n} \frac{d _{0}}{d _{1}} σ_{m i n}^{2} (\tilde{A}),

Q_{m i n} = g, t_{i}, τ_{i} min g = g_{0} (t_{i}) \sum n_{t} q_{g}^{t_{i}} s \in τ_{i} ∖ {t_{i}} \prod \frac{b _{g}^{s}}{( n _{s} N _{s} )},

Q_{m a x} (t, g, τ_{i}) = ℓ max \frac{( g N _{t} p _{t, ℓ} ) a _{g, ℓ}^{t}}{( n _{t} N _{t} )} s \in τ_{i} \prod \frac{b _{g}^{s}}{( n _{s} N _{s} )},

∥ e_{t} ∥_{2}^{2} = ℓ = 0 \sum k \tilde{e}_{t, ℓ}^{2}, \tilde{e}_{t, ℓ}^{2} = \frac{e _{t, ℓ}^{2}}{∥ A _{t} ∥ _{2}^{2}},

β_{t} = \tilde{τ}_{t, i} \in (d _{0} - 1 [ d _{1} - 1 ]) \sum g = g_{0} (t) \sum n_{t_{i}} \frac{1}{( d _{0} d _{1} )} Q_{m a x} (t, g, \tilde{τ}_{t, i}),

A_{t_{i}} x

A_{t_{i}} x

A_{t_{i}} x

⋮

A_{t_{i}} x

x_{i} = x_{i - 1} - \frac{⟨ A _{t_{i}} , x _{i - 1} ⟩ - ( b _{t_{i}} + e _{t_{i}, ℓ} )}{∥ A _{t_{i}} ∥ ^{2}} A_{t_{i}}^{⊤},

x_{i} = x_{i - 1} - \frac{⟨ A _{t_{i}} , x _{i - 1} ⟩ - ( b _{t_{i}} + e _{t_{i}, ℓ} )}{∥ A _{t_{i}} ∥ ^{2}} A_{t_{i}}^{⊤},

∥ x_{i} - x^{*} ∥_{2}^{2}

∥ x_{i} - x^{*} ∥_{2}^{2}

= ∥ x_{i - 1} - x^{*} ∥_{2}^{2} + \frac{(⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ - e _{t_{i}, ℓ_{t_{i}}} ) ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

- \frac{2}{∥ A _{t_{i}} ∥ _{2}^{2}} ⟨ x_{i - 1} - x^{*}, A_{t_{i}}^{T} ⟩ (⟨ A_{t_{i}}^{T}, x_{i - 1} - x^{*} ⟩ - e_{t_{i}, ℓ_{t_{i}}})

= ∥ x_{i - 1} - x^{*} ∥_{2}^{2} - \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} + \frac{e _{t_{i}, ℓ_{t_{i}}}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

E ∥ x_{i} - x^{*} ∥_{2}^{2} = ∥ x_{i - 1} - x^{*} ∥^{2} + E_{τ_{i}} E_{t_{i}} E_{ℓ_{t_{i}}} \frac{e _{t_{i}, ℓ_{t_{i}}}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} - E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} .

E ∥ x_{i} - x^{*} ∥_{2}^{2} = ∥ x_{i - 1} - x^{*} ∥^{2} + E_{τ_{i}} E_{t_{i}} E_{ℓ_{t_{i}}} \frac{e _{t_{i}, ℓ_{t_{i}}}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} - E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} .

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \geq Q_{m i n} σ_{m i n}^{2} (\tilde{A}) \frac{d _{0}}{d _{1}} ∥ x_{i - 1} - x^{*} ∥^{2},

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \geq Q_{m i n} σ_{m i n}^{2} (\tilde{A}) \frac{d _{0}}{d _{1}} ∥ x_{i - 1} - x^{*} ∥^{2},

∥ x_{i - 1} - x^{*} ∥^{2} - E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \leq (1 - Q_{m i n} σ_{m i n}^{2} (\tilde{A}) \frac{d _{0}}{d _{1}}) ∥ x_{i - 1} - x^{*} ∥^{2},

∥ x_{i - 1} - x^{*} ∥^{2} - E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \leq (1 - Q_{m i n} σ_{m i n}^{2} (\tilde{A}) \frac{d _{0}}{d _{1}}) ∥ x_{i - 1} - x^{*} ∥^{2},

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

=

=

=

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

E_{τ_{i}} E_{t_{i}} \frac{⟨ A _{t_{i}}^{T} , x _{i - 1} - x ^{*} ⟩ ^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}}

\geq Q_{m i n} \cdot \frac{d _{0} ! ( d _{1} - d _{0} )!}{d _{1} !} \cdot (d _{0} - 1 d _{1} - 1) \cdot ∥ \tilde{A} (x_{i - 1} - x^{*}) ∥_{2}^{2}

\geq Q_{m i n} \cdot \frac{d _{0}}{d _{1}} \cdot σ_{m i n}^{2} (\tilde{A}) ∥ x_{i - 1} - x^{*} ∥_{2}^{2},

E_{τ_{i}} E_{t_{i}} E_{ℓ} \frac{e _{t_{i}, ℓ}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \leq t_{i} \in [d_{1}] \sum β_{t_{i}} ∥ \tilde{e}_{t_{i}} ∥_{2}^{2},

E_{τ_{i}} E_{t_{i}} E_{ℓ} \frac{e _{t_{i}, ℓ}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} \leq t_{i} \in [d_{1}] \sum β_{t_{i}} ∥ \tilde{e}_{t_{i}} ∥_{2}^{2},

E_{τ_{i}} E_{t_{i}} E_{ℓ} \frac{e _{t_{i}, ℓ}^{2}}{∥ A _{t_{i}} ∥ _{2}^{2}} =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Distributed Control Multi-Agent Systems

Full text

Distributed Randomized Kaczmarz for the Adversarial Workers

Longxiu Huang Department of Computational Mathematics, Science and Engineering and Department of Mathematics, Michigan State University, MI ([email protected] ).

Xia Li Microsoft, WA (Corresponding author: [email protected] or [email protected] ).

Deanna Needell Department of Mathematics, University of California Los Angeles, CA ([email protected] ).

Abstract

Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. In this paper, we propose an iterative approach that is adversary-tolerant for convex optimization problems. By leveraging simple statistics, our method ensures convergence and is capable of adapting to adversarial distributions. Additionally, the efficiency of the proposed methods for solving convex problems is shown in simulations with the presence of adversaries. Through simulations, we demonstrate the efficiency of our approach in the presence of adversaries and its ability to identify adversarial workers with high accuracy and tolerate varying levels of adversary rates.

1 Introduction

As machine-learning algorithms gain popularity in industrial applications, it is critical to make them and their optimization subroutines robust and adversary-tolerant. These attacks can take various forms, including evasion [9], data poisoning [8] and model extraction [26, 16]. In large-scale machine learning problems, which are often run on distributed systems, attacks can come in the form of Byzantine attacks [17], where individual computing units, also known as ‘workers machines’ or simply ‘workers’, may produce adversarial results. A commonly used approach to mitigate such attacks is to use redundancy; that is, to request the same computation from multiple workers. The main challenge with such an approach is how to leverage the outputs from these workers efficiently, and in such a way that even seemingly catastrophic adversarial outputs can be identified and tolerated. Let’s consider the optimization problem of the following form:

[TABLE]

where $d_{1}$ is a positive integer. To solve the problem iteratively, we use gradient descent method to update the estimate:

[TABLE]

with some step-size $\gamma_{j}$ . Such objective functions lend themselves naturally to distributed algorithms. In the distributed setting, the central worker distributes $f_{i}$ among the workers. Each worker returns the corresponding gradient $\nabla f_{i}(x_{j})$ and the central worker aggregates those returns to compute or approximate the updating step (2). In particular, we illustrate our method on solving an over-determined linear system $Ax=b$ . However, the algorithms can be easily adapted for (1). The linear system can be modeled as a least squares problem $\min_{x}\|Ax-b\|_{2}^{2}$ and the least squares problem can be rewritten in the form of (1) with $f_{i}(x_{j})=\frac{1}{2}(A_{i}x_{j}-b_{i})^{2}$ , where $A\in\mathbb{R}^{d_{1}\times d_{2}},b\in\mathbb{R}^{d_{1}}$ , $A_{i}$ is the $i$ -th row of $A$ , and $b_{i}$ is the $i$ -th component of $b$ . The central worker partitions the data matrix $A$ into rows $A_{i}$ and the rows are distributed among the workers. In the linear setting, each worker only needs to return the scalar $A_{i}x_{j}-b_{i}$ instead of the gradient $(A_{i}x_{j}-b_{i})A_{i}^{\top}$ . Then the central worker aggregates those returns and approximate the updates in (2).

In this work, we consider the setting where some of the workers are adversarial, i.e., the workers return noisy results or enormously large results. Our goal is to develop a variant of the randomized Kaczmarz (RK) method [25] for adversarial workers to solve the linear system $Ax=b$ . For readers’ convenience, we restate the RK method in Alg. 1.

We assume that there is one central worker $w_{c}$ and $N$ workers in total, among which $p$ fraction of the unknown workers are adversarial and there are $k$ error categories in total. During the initial data distribution, each row $A_{r}$ is distributed to $N_{r}$ workers. Among those $N_{r}$ workers, workers in the $\ell$ -th category $C_{\ell}$ consist of $p_{r,\ell}$ fraction of all workers. We assume $C_{0}$ contains all reliable workers. The total adversarial rate for row $r$ is $p_{r}=\sum_{\ell=1}^{k}p_{r,\ell}$ and the fraction of reliable workers for row $r$ is $p_{r,0}$ . We assume $p_{r,\ell}<1-p_{r}=p_{r,0}$ , for all $r$ , and $\ell\neq 0$ . In particular, we assume that an adversarial worker $w_{s}^{r}$ in category $C_{\ell}$ returns the residual $c_{s}^{r}=b_{r}+e_{\ell,r}-\langle x_{j},A_{r}^{T}\rangle$ , $e_{\ell,r}\in\mathbb{R}$ , and a reliable worker returns $c_{s}^{r}=b_{r}-\langle x_{j},A_{r}^{T}\rangle$ . Our approach utilizes simple statistics to identify and ignore adversarial results, and thus the setting in which the adversarial workers communicate and select among $k$ types of errors to output is the most challenging for our approach.

1.1 Contribution

Our key contributions are threefold: (i) develop efficient methods and algorithms to guarantee accurate estimates for the true solution in the presence of adversarial workers; (ii) identify the adversarial workers efficiently; (iii) provide theoretical convergence analysis for solving linear systems with a portion of workers being adversarial.

1.2 Related work

**Kaczmarz method. ** The Kaczmarz method is an iterative technique for solving linear systems that was first introduced in 1937 by Kaczmarz [14]. In computer tomography, the method is also referred to as the Algebraic Reconstruction Technique (ART) [11, 13, 20]. The method has a broad range of applications, from computer tomography to digital signal processing. Later Strohmer et al. proposed a randomized version of the Kaczmarz method (RK) [25] in the context of consistent linear systems. They proved that RK has an exponential bound on the expected rate of convergence, with the probability of selecting each row proportional to the squared Euclidean norm of that row. The method has also been adapted to handle inconsistent linear systems [23, 24, 4, 19]. For example, Needell proved in [21] that RK converges for inconsistent linear systems to a horizon that depends on the size of the largest entry of the noise. An adaptive maximum-residual sampling strategy has also been analyzed for the inconsistent extension [23]. Additionally, RK has been studied in the context of solving systems of linear inequalities [18, 1, 3].

**Robust optimization. ** In optimization problems, practical challenges often arise due to various factors such as errors in data collection and transmission, adversarial or non-responsive workers (also known as stragglers), and corruptions in modern storage systems. To address these challenges, researchers have proposed various mitigation strategies. For instance, to tackle the issue of straggling workers, several encoding schemes have been proposed in literature. For example, Gordon et al. [10] and Karakus et al. [15] introduced methods to embed redundancy directly in the data, while Bitar et al. [5] proposed a gradient-coding scheme for straggler mitigation when stragglers are uniformly random.

Another important branch in the analysis of SGD-type methods is to deal with robustness to adversaries from the data. Chi et al. [6] and Haddock et al. [12] designed quantile-based methods to solve corrupted linear equations. Yang et al. [27] proposed a variant of the gradient descent method based on the geometric median to deal with adversarial workers, while Alistarh et al. [2] discussed the problem of stochastic optimization in an adversarial setting where the workers sample data from a distribution and an $\alpha$ fraction of them may adversarially return any vector. However, these methods are limited to scenarios where the adversary rate is less than $\frac{1}{2}$ . Our proposed algorithm, on the other hand, can converge to the exact solution even with an adversary rate higher than $\frac{1}{2}$ by utilizing redundancy.

2 Method

In this section, we present a simple and efficient mode-based method for solving linear systems in the presence of adversarial workers, as well as identifying potential adversarial workers which may be placed in a block-list (more details will be provided later). The method detects the mode category based on the size of the returned result groups. More specifically, for each row, the central worker groups similar results and selects the result from the group with the largest size, referred to as the mode. From these modes across all selected rows, the central worker then updates the guess with the mode with the largest size. If there is only one row, the central worker updates the guess with the mode.

Given the number of used workers $n_{r}$ for a specific row $r$ , the expected number of workers from category $C_{\ell}$ is ${n_{r}p_{r,\ell}}$ 111The central worker determines the number of different result groups during the first $m$ iterations and takes the maximum number and the number of non-adversarial workers is $n_{r}(1-p_{r})$ with $p_{r}=\sum_{\ell=1}^{k}p_{r,\ell}$ . In practice, a group with the maximum size is randomly selected and used to update the guess as long as its size is greater than $n_{r}(1-p_{r})$ (as shown in Alg. 2 Line 13). If the algorithm is implemented with a block-list, the block-list is updated through a frequency-based approach throughout the iterations: each row has a counter that records whether a worker is selected but fails to be the mode during each iteration. For every updating cycle $S$ , the worker with the largest count in each counter is identified as a potential adversarial worker and placed in the block-list (as shown in Alg. 2 Line 2 – 20). Once a worker is in the block-list, it will not be considered in future iterations. The full details of the algorithm can be found in Alg. 2 and the related theoretical results are provided in the following section.

3 Theoretical results

In this section, we provide a rigorous theoretical analysis of the mode distributions and convergence behavior of our mode-based method. To simplify the presentation, we provided a summary of the key notation used in our analysis in Table 1.

3.1 Mode distribution

Algorithm 2 utilizes the mode to identify adversarial workers and achieve convergence. In this section, we discuss the calculation of the probability of a specific category $\ell$ being the mode of a given row $r$ during each iteration of the algorithm. For simplicity, let $C_{0}$ denote the category of “reliable” workers (workers return correct results). For each row $r$ , the fraction of reliable workers holding row $r$ is $p_{r,0}=1-p_{r}$ . We use $d_{0}$ rows for the computation per iteration. Recall that each row $r$ is held by $N_{r}$ workers (fixed). Among those $N_{r}$ workers, workers in the category $\ell$ take up a fraction of $p_{r,\ell}$ . At each iteration, the central worker chooses a set of row indices of size $d_{0}$ uniformly at random and requests the corresponding workers to return their results. More specifically, given a set of row indices $\tau_{i}$ at $i$ -th iteration, the central worker first finds the modes among the results from each row $r\in\tau_{i}$ and among those modes, chooses the mode with the largest group size (“the majority vote”).

For any row $r$ , let $a^{r}_{g,\ell}$ be the coefficient of the monomial $x^{n_{r}-g}$ of the polynomial

[TABLE]

Let $b_{g}^{r}$ be the coefficient of the term $x^{n_{r}}$ of the polynomial

[TABLE]

Lemma 3.1.

For row $r$ , the probability that the mode is in the category $\ell$ with mode number $g$ is $\mathbb{P}(r\text{ mode},g,\ell)=\frac{\binom{N_{r}p_{r,\ell}}{g}a^{r}_{g,\ell}}{\binom{N_{r}}{n_{r}}}$ .

Proof.

See Appendix C. ∎

Using Lemma 3.1, we obtain the following conclusions by going over all possible mode numbers and all error categories.

Lemma 3.2.

For row $r$ , the probability that the category $\ell$ is the mode is

[TABLE]

where $g_{0}(r)=\max(\lceil\frac{n_{r}}{k+1}\rceil,\lceil n_{r}p_{r,0}\rceil)$ , and the probability that there is a mode with mode number $g$ for the calculation of row $r$ is

[TABLE]

Additionally, the probability that there is a mode for the calculation of row $r$ is

[TABLE]

where $\binom{n}{g}=0$ if $n<g$ , for any integer $n$ .

In the following lemma, we also calculate the probability $\mathbb{P}(t_{i},\ell,g|\tau_{i},x_{i-1})$ that a mode is from row $t_{i}$ in the category $C_{\ell}$ with a mode number $g$ when rows $\tau_{i}$ are used in the computation and the previous estimate $x_{i-1}$ is given. For simplicity, we omit the condition of $\tau_{i},x_{i-1}$ in the notation and denote $\mathbb{P}(t,\ell,g|\tau_{i},x_{i-1})$ by $\mathbb{P}(t_{i},\ell,g)$ .

Lemma 3.3.

Given the previous estimate $x_{i-1}$ and row indices $\tau_{i}$ , we have

[TABLE]

Proof.

The probability that the mode is produced from category $\ell$ of row $t_{i}$ with the mode number $g$ can be expressed as

[TABLE]

∎

Taking the modes produced from different categories into account, we can easily obtain the following result.

Corollary 3.4.

Given the previous estimate $x_{i-1}$ and the row indices $\tau_{i}$ , the probability that a mode is from row $t_{i}$ with mode number $g$ is

[TABLE]

3.2 Convergence without block-list

In this section, our main goal is to provide theoretical error bound for the method without block-list (i.e., Alg. 2 without block-list). The main result for this section is present below.

Theorem 3.5.

Let $A\in\mathbb{R}^{d_{1}\times d_{2}}$ with $d_{1}\geq d_{2}$ and $b,e_{1},\ldots,e_{k}\in\mathbb{R}^{d_{1}}$ . Assume that we solve $Ax^{*}=b$ via Alg. 2 without block-list; then

[TABLE]

where

[TABLE]

and $\tilde{A}$ is the row normalized matrix of $A$ and $\sigma_{\min}(\tilde{A})$ is $\tilde{A}$ ’s smallest singular value.

Before we prove Theorem 3.5, we let $t_{i}$ be the row selected at $i$ -th iteration to update the guess $x$ and consider solving

[TABLE]

according to some probability distribution. Thus, we have the iteration

[TABLE]

at $i$ -th iteration, where $\ell\in\{0,1,\ldots,k\}$ , and $A_{t_{i}}$ is the $t_{i}$ -th row of matrix $A$ .

In the following analysis, let $\mathbb{E}_{\tau_{i}}$ denote the expectation with respect to the uniformly random sample ${\tau_{i}}$ conditioned upon the sampled $\tau_{j}$ for $j<i$ , and let $\mathbb{E}$ denote expectation with respect to all random samples ${\tau_{j}}$ for $1\leq j\leq i$ , where $i$ is the last iteration in the context in which $\mathbb{E}$ is applied.

We start our analysis by decomposing the squared error

[TABLE]

Taking the expectation of the above equation, we can easily achieve that

[TABLE]

Therefore, the proof of Theorem 3.5 can be divided into the computations of the conditional expectation of the squared error from the adversarial workers $\mathbb{E}_{\tau_{i}}\mathbb{E}_{t_{i}}\mathbb{E}_{\ell_{t_{i}}}\frac{e_{t_{i},\ell_{t_{i}}}^{2}}{\|A_{t_{i}}\|_{2}^{2}}$ and the residual part $\mathbb{E}_{\tau_{i}}\mathbb{E}_{t_{i}}\frac{\langle A_{t_{i}}^{T},x_{i-1}-x^{*}\rangle^{2}}{\|A_{t_{i}}\|_{2}^{2}}$ separately which are provided in the following lemmas.

Lemma 3.6.

The conditional expectation of squared residual can be bounded below:

[TABLE]

where $\tilde{A}$ is the row normalized version of $A$ . Thus, we have

[TABLE]

where $Q_{\min}=\min_{g,t_{i},\tau_{i}}\sum_{g=g_{0}({t_{i}})}^{n_{t}}q^{t_{i}}_{g}\prod_{s\in\tau_{i}\setminus\{t_{i}\}}\frac{b^{s}_{g}}{\binom{N_{s}}{n_{s}}}.$

Proof.

The expectation of the squared residual can be represented as

[TABLE]

Recall that $\tau_{i}\sim\text{unif}(\binom{[d_{1}]}{d_{0}})$ . We thus have $p_{x_{i-1}}(\tau_{i})=1/\binom{d_{1}}{d_{0}}=\frac{d_{0}!(d_{1}-d_{0})!}{d_{1}!}$ . Therefore,

[TABLE]

i.e., (7) is derived. Hence, we also have (8). ∎

Lemma 3.7.

The expectation of the squared error from the adversarial workers can be bounded above by $\sum\limits_{t_{i}\in[d_{1}]}\beta_{t_{i}}\|\tilde{e}_{t_{i}}\|_{2}^{2}$ , i.e.,

[TABLE]

*where $\tilde{e}_{t_{i}}^{2}=\frac{e_{t_{i},\ell}^{2}}{\|A_{t_{i}}\|_{2}^{2}}$ and $\beta_{t_{i}}=\sum\limits_{\tilde{\tau}_{t_{i},i}\in\binom{[d_{1}-1]}{d_{0}-1}}\sum\limits_{g=g_{0}({t_{i}})}^{n_{t_{i}}}\frac{1}{\binom{d_{1}}{d_{0}}}Q_{\max}(t_{i},g,\tilde{\tau}_{t_{i},i})$ with

$Q_{\max}(t_{i},g,\tau_{i})=\max\limits_{\ell}\frac{\binom{N_{t_{i}}p_{t_{i},\ell}}{g}a^{t_{i}}_{g,\ell}}{\binom{N_{t_{i}}}{n_{t_{i}}}}\prod\limits_{s\in\tau_{i}}\frac{b^{s}_{g}}{\binom{N_{s}}{n_{s}}}$ .*

Proof.

The expectation of the squared error from the adversarial workers can be represented as $\mathbb{E}_{\tau_{i}}\mathbb{E}_{t_{i}}\mathbb{E}_{\ell}\frac{e_{t_{i},\ell}^{2}}{\|A_{t_{i}}\|_{2}^{2}}=\mathbb{E}_{\tau_{i}}\sum\limits_{{t_{i}}\in\tau_{i}}\sum\limits_{g=g_{0}({t_{i}})}^{n_{t_{i}}}\sum\limits_{\ell=0}^{k}\frac{\binom{N_{t_{i}}p_{t_{i},\ell}}{g}a^{t_{i}}_{g,\ell}}{\binom{N_{t_{i}}}{n_{t_{i}}}}\prod\limits_{s\in\tau_{i}\setminus\{{t_{i}}\}}\frac{b^{s}_{g}}{\binom{N_{s}}{n_{s}}}\frac{e_{{t_{i}},\ell}^{2}}{\|A_{{t_{i}}}\|_{2}^{2}}$ . To simplify the expressions, we let

$Q_{\max}({t_{i}},g,\tau_{i}\setminus\{{t_{i}}\})=\max\limits_{\ell}\frac{\binom{N_{t}p_{{t_{i}},\ell}}{g}a^{t_{i}}_{g,\ell}}{\binom{N_{t_{i}}}{n_{t_{i}}}}\prod\limits_{s\in\tau_{i}\setminus\{{t_{i}}\}}\frac{b^{s}_{g}}{\binom{N_{s}}{n_{s}}}$ ,

$\tilde{e}_{{t_{i}},\ell}^{2}=\frac{e_{{t_{i}},\ell}^{2}}{\|A_{{t_{i}}}\|_{2}^{2}}$ , and $\tilde{e}_{{t_{i}}}=(e_{t,0},e_{{t_{i}},1},\ldots,e_{{t_{i}},k})$ . Then (9) can be achieved by

[TABLE]

with $\beta_{t}=\sum\limits_{\tilde{\tau}_{t,i}\in\binom{[d_{1}-1]}{d_{0}-1}}\sum\limits_{g=g_{0}({t})}^{n_{t_{i}}}\frac{1}{\binom{d_{1}}{d_{0}}}Q_{\max}(t,g,\tilde{\tau}_{t,i})$ . ∎

Notice that

[TABLE]

we thus have $\sum_{g=g_{0}(t)}^{n_{t}}Q_{\max}(t,g,\tau_{i}\setminus\{t\})\leq 1$ , and

[TABLE]

The proof of Theorem 3.5.

Combining Lemma 3.6 and Lemma 3.7, we thus have Theorem 3.5. ∎

Next we provide some remarks for our main result Theorem 3.5.

Remark 3.8.

(i)

$0<1-\frac{d_{0}}{d_{1}}Q_{\min}\sigma_{\min}^{2}(\tilde{A})<1$ , provided that $d_{0}\leq d_{2}$ . This results from the facts that $\sigma_{\min}^{2}(\tilde{A})\leq\frac{d_{1}}{d_{2}}$ (since $\sum_{i=1}^{d_{2}}\sigma_{i}^{2}(\tilde{A})=d_{1}$ ) and $0\leq Q_{\min}\leq 1$ (by (10)). 2. (ii)

From (6), one can see the relation between $\beta_{t}$ and $d_{0}$ is complicated. An example in Table 2 shows that increasing $d_{0}$ , to some extent, can decrease $\beta_{t}$ and thus, improves the speed of convergence. For more details about the optimal choice of $d_{0}$ , one can refer to Appendix B. Meanwhile, one should be aware of that larger $d_{0}$ leads to more communication cost. Thus, in practice, finding an optimal $d_{0}$ is not just minimizing $\beta_{t}$ but also reducing the communication cost. 3. (iii)

When adversaries $\{\widetilde{e}_{t}\}_{t}$ are relatively small, the method without the block-list can guarantee a convergence error with the same or smaller magnitude as $\{\widetilde{e}_{t}\}_{t}$ using the right parameters. However, one can conclude from (6), when adversaries $\{\widetilde{e}_{t}\}_{t}$ are relatively larger, the convergence is not guaranteed. Therefore, it is crucial to introduce the block-list method for excluding the adversarial workers.

3.3 Block-list method

In this section, we use $P_{\ell}$ to denote the proportion of workers in category $\ell$ among the total of $N$ workers. Thus, the number of workers in category $\ell$ is $NP_{\ell}$ . As per Algorithm 2, after $S$ iterations, the worker $w^{*}$ with the highest count of non-mode selections, $c^{+}_{w^{*}}$ , will be added to the block-list. To assess the effectiveness of the proposed method, it is crucial to calculate the probability that a bad or reliable worker is included in the block-list. This problem can be mathematically reformulated as follows.

Problem 3.9.

Let $c^{+}_{w}(S)\coloneqq c^{+}_{w}$ (resp. $c^{0}_{w}(S)\coloneqq c^{0}_{w}$ ) be the counters of the worker $w$ being non-mode (resp. mode or in no mode case) among $S$ iterations. Then we have $0\leq c^{+}_{w},c^{0}_{w}\leq S$ and $\sum_{w=1}^{N}(c^{+}_{w}+c^{0}_{w})=nS$ . The probability that $w^{*}$ is in the block-list after S iterations can be calculated as follows:

[TABLE]

Note that this probability can be calculated by using integer dynamic programming or estimated by Monte Carlo simulations.

Lemma 3.10.

Run Alg. 2 with $S$ iteration. Then the conditional probability that a reliable worker $w_{0}$ is in the block-list is $\frac{P_{0}\mathbb{P}_{\textit{bl}}(w_{0})}{\sum_{i=0}^{k}P_{i}\mathbb{P}_{\textit{bl}}(w_{i})}$ . Similarly, the conditional probability that a bad worker $w_{\ell}$ in category $\ell$ is in the block-list is $\frac{P_{\ell}\mathbb{P}_{\textit{bl}}(w_{\ell})}{\sum_{i=0}^{k}P_{i}\mathbb{P}_{\textit{bl}}(w_{i})}$ .

Proof.

First notice that we have the following two facts: (i)The probability that a reliable worker is in the block-list is $P_{0}N\mathbb{P}_{\textit{bl}}(w_{0})$ . (ii) The probability that a bad worker $w_{\ell}$ in category $\ell$ is in the block-list is $P_{\ell}N\mathbb{P}_{\textit{bl}}(w_{\ell})$ . Therefore, the probability that a worker, either reliable or bad, i in the block-list is $\sum_{i=0}^{k}P_{i}N\mathbb{P}_{\textit{bl}}(w_{i})$ . The conditional probabilities can be easily computed by considering the ratios $\frac{P_{0}N\mathbb{P}_{\textit{bl}}(w_{0})}{\sum_{i=0}^{k}P_{i}N\mathbb{P}_{\textit{bl}}(w_{i})}$ and $\frac{P_{\ell}N\mathbb{P}_{\textit{bl}}(w_{\ell})}{\sum_{i=0}^{k}P_{i}N\mathbb{P}_{\textit{bl}}(w_{\ell})}$ . ∎

To illustrate how the quantities changes with respect to $S$ in Lemma 3.10, we consider the following example.

Example.

Assume that there are two categories of workers i.e. $k=0,1$ , and 5 workers $w_{i},i=1,\ldots,5$ in total with $w_{1},w_{2}\in C_{1},w_{3},w_{4},w_{5}\in C_{0}$ . Let $n_{r}\equiv 3$ . Note that $\mathbb{P}_{\textit{bl}}(w_{1})=\mathbb{P}_{\textit{bl}}(w_{2})\coloneqq\mathbb{P}_{\textit{bl}}^{1}$ and $\mathbb{P}_{\textit{bl}}(w_{3})=\mathbb{P}_{\textit{bl}}(w_{4})=\mathbb{P}_{\textit{bl}}(w_{5})\coloneqq\mathbb{P}_{\textit{bl}}^{0}$ . The probability is estimated by Monte Carlo simulations. We simulated the experiment 100 times and count the numbers of experiments where each worker is listed in the block-list. Those numbers are used to calculate the frequency and estimate the probability. The estimated results are summarized in Table 3.

Table 3 shows that the probability of an adversarial worker in the block-list increases as the number of iterations $S$ increases. Meanwhile, the probability of a reliable worker in the block-list decreases. Using the method with the block-list, we are able to avoid choosing the results from the adversarial workers. As a results, the probability of using the adversarial workers decreases, i.e., $\beta_{t}$ decreases.

4 Simulations

In this section, we evaluate the performance of our approaches for solving consistent linear systems through simulations. We randomly generate a row-normalized matrix $A\in\mathbb{R}^{1200\times 50}$ and a vector $x\in\mathbb{R}^{50}$ , both from a normalized Gaussian distribution, and set $b=Ax$ . For simplicity, we assume that each row has the same number of error categories $k$ and the same adversarial rate $p_{r}=p/k$ , where $p$ is the total adversarial rate. The linear system $Ax=b$ is solved using Algorithm 2 with and without the block-list. At each iteration, $d_{0}$ rows of $A$ are chosen uniformly at random, and for each row, $n_{r}$ workers are selected from the $N_{r}$ workers to participate in the calculation. We further assume $N_{r}$ are the same and $n_{r}$ are equal to $n$ for all $r$ . The simulation results show how the number of used rows $d_{0}$ , the number of used workers $n$ , the total adversary rate $p$ , and the number of error categories $k$ affect the performance.

Figs. 1 and 2 illustrate the impact of the number of used rows $d_{0}$ on the convergence results of our distributed RK method with and without the block-list. In this example, increasing the number of used rows $d_{0}$ from $2$ to $4$ improves the convergence rate for both with and without block-list scenarios, regardless of the magnitude of the adversaries $e$ . Fig. 1 demonstrates the convergence results when $\|e\|_{\infty}=10^{-3}$ . From the figures, Alg. 2 with the block-list shows fast convergence over all choices of $d_{0}$ when the adversarial rate $p=0.2$ (Fig. 1(a)); the larger the number of used rows $d_{0}$ , the faster the convergence when the adversarial rate $p=0.6$ (Fig. 1(c)). Fig. 1(b) reveals that, when $d_{0}=2,4$ , the central worker may use a corrupted step-size, resulting in oscillations around the solution and the error converging to a range of magnitude between $10^{-3}$ and $10^{-5}$ . On the other hand, when $d_{0}=6,8$ , the error goes to 0 after 4000 iterations. As seen in Fig. 1(d), the convergence errors of all $d_{0}$ values lie in the range of $(10^{-4},10^{-2})$ . With a small $\|e\|_{\infty}$ , the method without the block-list can converge by increasing $d_{0}$ . However, when the magnitude of the adversaries and the adversarial rate are large (as shown in Fig. 2(d)), the convergence is not guaranteed without a block-list. In comparison, the method with the block-list converges quickly to an accuracy of $10^{-10}$ when $d_{0}$ is sufficiently large (as seen in Fig. 2(a)). This highlights the importance of using the block-list in an environment with larger outliers.

Figure 3 examines the impact of the number of chosen workers $n_{r}$ on the convergence, with adversary rates of $0.2$ and $0.6$ . As the number of workers increases from $3$ to $7$ , the convergence becomes faster in both with and without block-list methods. However, the method using the block-list generally provides better convergence compared to the method without block-list. The block-list method requires extra storage, but it is worth the trade-off in terms of improved convergence. Without the block-list, oscillations can be observed in the convergence when $p=0.2$ and $n_{r}\equiv n=3$ and when $p=0.6$ and all $n_{r}$ .

Figure 4 demonstrates the effect of the adversary rate on the convergence. As the adversary rate $p$ increases, the accuracy decreases. Even though the adversary rate is large, the final results using the block-list method are still satisfying. Without the block-list, when the adversarial rate $p>0.5$ , the central worker fails to approach the true solution due to the adversarial workers. This again shows the importance and effectiveness of using the block-list, especially in a highly hostile environment with a higher adversarial rate and a higher magnitude of the adversary. In addition, Fig. 5 shows the effect of the number $k$ of error categories with the block-list. One can see that our method converges when $k$ is big enough for the adversarial rate being $0.2$ and $0.6$ . In Fig. 6, we use the Wisconsin (Diagnostic) Breast Cancer data set, which includes data points whose features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image (see [7] for more details). Similar to the setup in [12], we set the simulations in the following way: the collection of data points forms matrix $A\in\mathbb{R}^{569\times 10}$ . We then normalize $A$ and construct $x$ and $b$ using a Gaussian distribution to form a consistent system. The convergence results in Fig. 6 show the effectiveness of our method solving this linear systems in a relatively safer environment with an adversarial rate $p=0.3$ (Fig. 6(a)) and a more hostile environment with an adversarial rate $p=0.6$ (Fig. 6(b)). When $p=0.3$ , the method converges within $1000$ iterations, and as $d_{0}$ increases, the convergence speed becomes faster. Meanwhile, when $p=0.3$ , the method converges within $1500$ iterations, and $d_{0}=8$ has the fastest convergence speed among all choices of $d_{0}$ .

Finally, we have investigated the impact of updating cycles $S$ on the accuracy of block-list recognition. In Table 4, we calculate the accuracy of the block-list method when $S=200,500,1000,2000$ . The two examples in the table show that as $S$ increases, the accuracy is higher.

5 Conclusion and future work

It is of great significance for optimization algorithms to be robust and resistant to adversaries. In this work, we propose efficient algorithms based on the mode for solving large-scale linear systems in the presence of the adversarial workers. This kind of adversary has plenty of applications in the real world, e.g. Internet of Things (IoT). We provide theoretical convergence guarantee and the theories are supported by our experiments. The methods are capable of handling various levels of adversarial rates. In particular, the method with the block-list is able to provide accurate estimation of solution when the adversarial rate $p>0.5$ , and at the same time, our method can identify the adversarial workers. Our experiments also highlight the impact of several important parameters of the adversaries and of anti-adversary strategies, namely, the involved row number $d_{0}$ to update the solution per iteration, the number of error categories $k$ , the adversary rate $p$ , and the number of chosen workers $n_{r}$ at each iteration.

Our method can also be adapted to solve the nonlinear problem. In Fig. 7, we applied the method with the block-list to solve the optimization problem with $\ell_{1}$ regularization $\frac{1}{2}\|Ax-b\|_{2}^{2}+\gamma\|x\|_{1}$ with $\gamma=1$ . In the distributed setting, the problem can be formulated as $\sum_{i=1}^{d_{1}}f_{i}(x)$ , where $f_{i}(x)=\frac{1}{2}(A_{i}x_{j}-b_{i})^{2}+\frac{\gamma}{d_{1}}\|x\|_{1}$ . Fig. 7(a) shows the distance between the solution using our method and the solution using the LASSO solver from the Python package scikit-learn [22] when choosing different number of rows $d_{0}$ . When the number of used rows $d_{0}=4$ , it takes the least iterations to converge. All choices of $d_{0}$ end up with an relative error around $10^{-5}$ with the block-list. The object function $F(x)$ v.s. the iterations are shown in Fig. 7(b).

Additionally, we provide convergence analysis for methods when multiple rows are selected uniformly at each iteration. We also provide the proof of a more general sampling scheme where the row is sampled according to its squared Euclidean norm, although the proof only applies to the case where only a single row is used for computation at each iteration. Interesting avenues for future work include proving convergence under different sampling rules and rigorously generalizing this method to non-linear optimization problems.

Acknowledgements

Authors are listed in alphabetical order. Some of the work for this article was done while Longxiu Huang was Assistant Adjunct Professor and Xia Li was a graduate student at UCLA.

Appendix A Single row convergence without block-list

In this section, we present an algorithm and its corresponding theory for a specific case in which only one row ( $d_{0}=1$ ) is utilized for updating the solution at each iteration. Therefore, $N_{r}\equiv N$ . The algorithm is based on the assumption that the probability of selecting row $i\in[d_{1}]$ for the updating process is proportional to the squared length of the corresponding row. Further details can be found in Algorithm 3. We will then proceed to analyze the convergence of Algorithm 3.

Theorem A.1.

Let $A\in\mathbb{R}^{d_{1}\times d_{2}}$ with $d_{1}\geq d_{2}$ and $b,e_{1},\ldots,e_{k}\in\mathbb{R}^{d_{1}}$ . Assume that we solve $Ax^{*}=b$ via Algorithm 3, then

[TABLE]

where $\alpha=1-\frac{\sigma_{\min}^{2}(A)}{\|A\|_{F}^{2}}$ , $q_{\ell}=\frac{\hat{q}^{\ell}_{\text{mode}}}{q}$ and $\sigma_{\min}(A)$ is the smallest singular value of $A$ .

Additionally, if $\|e_{\ell}\|\leq C$ , we have

[TABLE]

Remark A.2.

To provide a quantitative understanding of Theorem A.1, we present several examples in Tables 5 and 6. For simplicity, assume that each error category has the same fraction $p_{\ell}=p/k$ . Thus, all $\hat{q}_{\text{mode}}^{\ell}$ are equal. Here $q_{0}$ is the probability that the algorithm chooses the right mode and $q$ is the probability that there is a mode. In these two tables, we present the values for $\hat{q}_{mode}^{\ell},\hat{q}_{mode}^{0},q$ and $q_{0}$ by varying the number of error categories $k$ , the number of chosen workers $n$ and the adversarial rate $p$ . These two tables are generated by solving a linear system with a row-normalized matrix $A\in\mathbb{R}^{1000\times 100}$ .

As $k$ increases, $q_{\ell}$ decreases and $q_{0}$ increases. Therefore, the error bound in (13) decreases with respect to $k$ and thus reaches better convergence. When $k$ is large enough, $q_{\ell}\approx 0$ . Therefore, when the noise is uniformly random error and there is a mode for the step-size, the mode will be the correct mode. As $n$ increases, there is a similar decrease effect and therefore a better convergence.

Proof of Theorem A.1.

To prove (12), at each iteration, we consider solving $Ax=b$ , $Ax=b+e_{1}$ , $\ldots$ , $Ax=b+e_{k}$ with probability $q_{0},q_{1},\cdots,q_{k}$ , respectively. Therefore, for the $(i+1)$ -th step, we have the iteration

[TABLE]

or

[TABLE]

for $\ell=1,\cdots,k$ , $A_{j}$ is the $j$ -th row of matrix $A$ .

Notice that when $x_{i+1}=x_{i}-\frac{\langle A_{j},x_{i}\rangle-b_{j}}{\|A_{j}\|^{2}}(A_{j})^{\top}$ , we have

[TABLE]

When $x_{i+1}=x_{i}-\frac{\langle A_{j},x_{i}\rangle-(b_{j}+e_{\ell}(j))}{\|A_{j}\|^{2}}(A_{j})^{\top}$ , we have

[TABLE]

Combining (14) and (15), we have

[TABLE]

Set $\alpha=1-\frac{\sigma_{\min}^{2}(A)}{\|A\|_{F}^{2}}$ . Therefore,

[TABLE]

∎

Appendix B Discussion of the optimal number of used rows

Assume that row $r$ is distributed to $N_{r}$ workers and $n_{r}$ workers are selected to involve in the computation of row $r$ with $N_{r}\equiv N,n_{r}\equiv n$ for all $r$ . Additionally, we assume that the probability of each error category $\ell$ ( $\ell\neq 0$ ) for each row is the same with $p_{r,\ell}\equiv p/k.$ Moreover, we have the restricted minimal mode number $g_{0}(r)\equiv g_{0}=\max(\lceil\frac{n}{k+1}\rceil,\lceil n(1-p)\rceil)$ , and

[TABLE]

Let $\alpha(d_{0})=\alpha=1-Q\frac{d_{0}}{d_{1}}\sigma_{\min}^{2}(\tilde{A})$ . To study the relation between $d_{0}$ and the convergence, consider

[TABLE]

If $d_{0}\geq-\frac{1}{\log(b_{g}/\binom{N}{n})}$ for all $g$ , then $\frac{\partial\alpha(d_{0})}{\partial d_{0}}\geq 0$ . This implies that as $d_{0}$ increases, $\alpha(d_{0})$ increases. When $g_{0}=n$ , we have

[TABLE]

and to reach the fastest convergence rate, $d_{0}=-\frac{1}{\log\left({b_{g}}/{\binom{N}{n}}\right)}$ . One can explore the minimizers for $\alpha$ in more general cases, where multiple local minimizers could present in the landscape.

Appendix C Proof of Lemma 3.1

Proof of Lemma 3.1.

Given a row $r$ , the number of combinations where the mode belongs to category $\ell$ with mode count $g$ can be divided into two parts: the combinations of workers in category $\ell$ and the combinations of workers in all other categories excluding $\ell$ .

We start by calculating the number of combinations of workers in the remaining categories. This is equivalent to selecting $(n_{r}-g)$ balls from $k$ bins subject to constraints that the $\tilde{\ell}$ -th bin contains $N_{r}p_{r,\tilde{\ell}}$ balls, with a maximum of $g$ balls that can be chosen from this bin for all $\tilde{\ell}\neq\ell$ . With these constraints, there are $\binom{N_{r}p_{r,\tilde{\ell}}}{j}$ ways to choose $j$ balls from bin $\tilde{\ell}$ . By letting $j$ vary and considering the $k$ bins, the total number of valid combinations is equal to the coefficient of the term $x^{n_{r}-g}$ in the polynomial $\prod\limits_{\tilde{\ell}=0,\tilde{\ell}\neq\ell}^{k}\sum_{j=0}^{g-1}\binom{N_{r}p_{r,\tilde{\ell}}}{j}x^{j}$ . The number of combinations of workers in category $\ell$ with mode count $g$ is $\binom{N_{r}p_{r,\ell}}{g}$ . Finally, the total number of combinations is given by $\binom{N_{r}}{n_{r}}$ . This concludes the proof of Lemma 3.1. ∎

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics , 6:382–392, 1954.
2[2] Dan Alistarh, Zeyuan Allen-Zh, and Jerry Li. Byzantine stochastic gradient descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
3[3] Zhong-Zhi Bai and Wen-Ting Wu. On partially randomized extended Kaczmarz method for solving large sparse overdetermined inconsistent linear systems. Linear Algebra and Its Applications , 578:225–250, 2019.
4[4] Zhong-Zhi Bai and Wen-Ting Wu. On greedy randomized augmented Kaczmarz method for solving large sparse inconsistent linear systems. SIAM Journal on Scientific Computing , 43(6):A 3892–A 3911, 2021.
5[5] Rawad Bitar, Mary Wootters, and Salim El Rouayheb. Stochastic gradient coding for flexible straggler mitigation in distributed learning. pages 1–5, 2019.
6[6] Yuejie Chi, Yuanxin Li, Huishuai Zhang, and Yingbin Liang. Median-truncated gradient descent: A robust and scalable nonconvex approach for signal estimation. 2019.
7[7] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
8[8] Jonas Geiping, Liam Fowl, W. Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. Witches’ brew: Industrial scale data poisoning via gradient matching. Clinical Orthopaedics and Related Research , abs/2009.02276, 2020.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Distributed Randomized Kaczmarz for the Adversarial Workers

Abstract

1 Introduction

1.1 Contribution

1.2 Related work

2 Method

3 Theoretical results

3.1 Mode distribution

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Lemma 3.3**.**

Proof.

Corollary 3.4**.**

3.2 Convergence without block-list

Theorem 3.5**.**

Lemma 3.6**.**

Proof.

Lemma 3.7**.**

Proof.

The proof of Theorem 3.5.

Remark 3.8**.**

3.3 Block-list method

Problem 3.9**.**

Lemma 3.10**.**

Proof.

Example**.**

4 Simulations

5 Conclusion and future work

Acknowledgements

Appendix A Single row convergence without block-list

Theorem A.1**.**

Remark A.2**.**

Proof of Theorem A.1.

Appendix B Discussion of the optimal number of used rows

Appendix C Proof of Lemma 3.1

Proof of Lemma 3.1.

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Corollary 3.4.

Theorem 3.5.

Lemma 3.6.

Lemma 3.7.

Remark 3.8.

Problem 3.9.

Lemma 3.10.

Example.

Theorem A.1.

Remark A.2.