Randomized Kaczmarz in Adversarial Distributed Setting
Longxiu Huang, Xia Li, Deanna Needell

TL;DR
This paper introduces an adversary-tolerant distributed optimization method based on randomized Kaczmarz, demonstrating its effectiveness in convex problems with adversarial workers through simulations.
Contribution
It proposes a novel iterative approach that ensures convergence and robustness in distributed convex optimization under adversarial conditions.
Findings
Method converges despite adversarial workers
High accuracy in identifying adversarial workers
Effective in various adversary rate scenarios
Abstract
Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. In this paper, we propose an iterative approach that is adversary-tolerant for convex optimization problems. By leveraging simple statistics, our method ensures convergence and is capable of adapting to adversarial distributions. Additionally, the efficiency of the proposed methods for solving convex problems is shown in simulations with the presence of adversaries. Through simulations, we demonstrate the efficiency of our approach in the presence of adversaries and its ability to identify adversarial workers with high accuracy and tolerate varying levels of adversary rates.
| Data matrix , | |
| Row normalized version of matrix | |
| Number of workers in total | |
| Number of workers holding row | |
| Number of workers chosen for row | |
| -th error category | |
| Number of error categories in total | |
| Error of the -th error category for a row | |
| Vector form of errors in all error categories of row , | |
| Matrix form of errors in all error categories of all rows, | |
| Number of chosen rows per iteration | |
| Fraction of workers holding row in category | |
| Probability that there is a mode among the outputs of chosen workers of row and the mode is in the category (see Lemma 3.2) | |
| Probability that there is a mode among the outputs of chosen workers for row (see Lemma 3.2) | |
| Set of the integers from 1 to , | |
| Collection of the subsets of with elements | |
| Uniform sampling from the collection . | |
| Index set of chosen rows at -th iteration, | |
| Index set of chosen rows that have a mode, | |
| Row index that has the largest mode number, | |
| Probability that the mode is in the category with mode number for row (see Lemma 3.1) | |
| Probability that a mode is from row among rows and the mode is in category with mode number , given the previous estimate . It is also denoted by , | |
| Probability that the mode is from row among rows with mode number provided the previous estimate , more details refer to Corollary 3.4. |
| 5 | 2 | |||
| 5 | 3 | |||
| 5 | 5 |
| 0.403 | ||||
| 10 | 0.099 | 0.18 | 0.67 | 0.26 | |
| 15 | 0.099 | 0.2 | 0.7 | 0.29 | |
| 20 | 0.097 | 0.23 | 0.71 | 0.31 | |
| 10 | 0.904 | 0.90 | |||
| 15 | 0.97 | 0.97 | |||
| 20 | 0.99 | 0.99 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Distributed Control Multi-Agent Systems
Distributed Randomized Kaczmarz for the Adversarial Workers
Longxiu Huang Department of Computational Mathematics, Science and Engineering and Department of Mathematics, Michigan State University, MI ([email protected] ).
Xia Li Microsoft, WA (Corresponding author: [email protected] or [email protected] ).
Deanna Needell Department of Mathematics, University of California Los Angeles, CA ([email protected] ).
Abstract
Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. In this paper, we propose an iterative approach that is adversary-tolerant for convex optimization problems. By leveraging simple statistics, our method ensures convergence and is capable of adapting to adversarial distributions. Additionally, the efficiency of the proposed methods for solving convex problems is shown in simulations with the presence of adversaries. Through simulations, we demonstrate the efficiency of our approach in the presence of adversaries and its ability to identify adversarial workers with high accuracy and tolerate varying levels of adversary rates.
1 Introduction
As machine-learning algorithms gain popularity in industrial applications, it is critical to make them and their optimization subroutines robust and adversary-tolerant. These attacks can take various forms, including evasion [9], data poisoning [8] and model extraction [26, 16]. In large-scale machine learning problems, which are often run on distributed systems, attacks can come in the form of Byzantine attacks [17], where individual computing units, also known as ‘workers machines’ or simply ‘workers’, may produce adversarial results. A commonly used approach to mitigate such attacks is to use redundancy; that is, to request the same computation from multiple workers. The main challenge with such an approach is how to leverage the outputs from these workers efficiently, and in such a way that even seemingly catastrophic adversarial outputs can be identified and tolerated. Let’s consider the optimization problem of the following form:
[TABLE]
where is a positive integer. To solve the problem iteratively, we use gradient descent method to update the estimate:
[TABLE]
with some step-size . Such objective functions lend themselves naturally to distributed algorithms. In the distributed setting, the central worker distributes among the workers. Each worker returns the corresponding gradient and the central worker aggregates those returns to compute or approximate the updating step (2). In particular, we illustrate our method on solving an over-determined linear system . However, the algorithms can be easily adapted for (1). The linear system can be modeled as a least squares problem and the least squares problem can be rewritten in the form of (1) with , where , is the -th row of , and is the -th component of . The central worker partitions the data matrix into rows and the rows are distributed among the workers. In the linear setting, each worker only needs to return the scalar instead of the gradient . Then the central worker aggregates those returns and approximate the updates in (2).
In this work, we consider the setting where some of the workers are adversarial, i.e., the workers return noisy results or enormously large results. Our goal is to develop a variant of the randomized Kaczmarz (RK) method [25] for adversarial workers to solve the linear system . For readers’ convenience, we restate the RK method in Alg. 1.
We assume that there is one central worker and workers in total, among which fraction of the unknown workers are adversarial and there are error categories in total. During the initial data distribution, each row is distributed to workers. Among those workers, workers in the -th category consist of fraction of all workers. We assume contains all reliable workers. The total adversarial rate for row is and the fraction of reliable workers for row is . We assume , for all , and . In particular, we assume that an adversarial worker in category returns the residual , , and a reliable worker returns . Our approach utilizes simple statistics to identify and ignore adversarial results, and thus the setting in which the adversarial workers communicate and select among types of errors to output is the most challenging for our approach.
1.1 Contribution
Our key contributions are threefold: (i) develop efficient methods and algorithms to guarantee accurate estimates for the true solution in the presence of adversarial workers; (ii) identify the adversarial workers efficiently; (iii) provide theoretical convergence analysis for solving linear systems with a portion of workers being adversarial.
1.2 Related work
**Kaczmarz method. ** The Kaczmarz method is an iterative technique for solving linear systems that was first introduced in 1937 by Kaczmarz [14]. In computer tomography, the method is also referred to as the Algebraic Reconstruction Technique (ART) [11, 13, 20]. The method has a broad range of applications, from computer tomography to digital signal processing. Later Strohmer et al. proposed a randomized version of the Kaczmarz method (RK) [25] in the context of consistent linear systems. They proved that RK has an exponential bound on the expected rate of convergence, with the probability of selecting each row proportional to the squared Euclidean norm of that row. The method has also been adapted to handle inconsistent linear systems [23, 24, 4, 19]. For example, Needell proved in [21] that RK converges for inconsistent linear systems to a horizon that depends on the size of the largest entry of the noise. An adaptive maximum-residual sampling strategy has also been analyzed for the inconsistent extension [23]. Additionally, RK has been studied in the context of solving systems of linear inequalities [18, 1, 3].
**Robust optimization. ** In optimization problems, practical challenges often arise due to various factors such as errors in data collection and transmission, adversarial or non-responsive workers (also known as stragglers), and corruptions in modern storage systems. To address these challenges, researchers have proposed various mitigation strategies. For instance, to tackle the issue of straggling workers, several encoding schemes have been proposed in literature. For example, Gordon et al. [10] and Karakus et al. [15] introduced methods to embed redundancy directly in the data, while Bitar et al. [5] proposed a gradient-coding scheme for straggler mitigation when stragglers are uniformly random.
Another important branch in the analysis of SGD-type methods is to deal with robustness to adversaries from the data. Chi et al. [6] and Haddock et al. [12] designed quantile-based methods to solve corrupted linear equations. Yang et al. [27] proposed a variant of the gradient descent method based on the geometric median to deal with adversarial workers, while Alistarh et al. [2] discussed the problem of stochastic optimization in an adversarial setting where the workers sample data from a distribution and an fraction of them may adversarially return any vector. However, these methods are limited to scenarios where the adversary rate is less than . Our proposed algorithm, on the other hand, can converge to the exact solution even with an adversary rate higher than by utilizing redundancy.
2 Method
In this section, we present a simple and efficient mode-based method for solving linear systems in the presence of adversarial workers, as well as identifying potential adversarial workers which may be placed in a block-list (more details will be provided later). The method detects the mode category based on the size of the returned result groups. More specifically, for each row, the central worker groups similar results and selects the result from the group with the largest size, referred to as the mode. From these modes across all selected rows, the central worker then updates the guess with the mode with the largest size. If there is only one row, the central worker updates the guess with the mode.
Given the number of used workers for a specific row , the expected number of workers from category is 111The central worker determines the number of different result groups during the first iterations and takes the maximum number and the number of non-adversarial workers is with . In practice, a group with the maximum size is randomly selected and used to update the guess as long as its size is greater than (as shown in Alg. 2 Line 13). If the algorithm is implemented with a block-list, the block-list is updated through a frequency-based approach throughout the iterations: each row has a counter that records whether a worker is selected but fails to be the mode during each iteration. For every updating cycle , the worker with the largest count in each counter is identified as a potential adversarial worker and placed in the block-list (as shown in Alg. 2 Line 2 – 20). Once a worker is in the block-list, it will not be considered in future iterations. The full details of the algorithm can be found in Alg. 2 and the related theoretical results are provided in the following section.
3 Theoretical results
In this section, we provide a rigorous theoretical analysis of the mode distributions and convergence behavior of our mode-based method. To simplify the presentation, we provided a summary of the key notation used in our analysis in Table 1.
3.1 Mode distribution
Algorithm 2 utilizes the mode to identify adversarial workers and achieve convergence. In this section, we discuss the calculation of the probability of a specific category being the mode of a given row during each iteration of the algorithm. For simplicity, let denote the category of “reliable” workers (workers return correct results). For each row , the fraction of reliable workers holding row is . We use rows for the computation per iteration. Recall that each row is held by workers (fixed). Among those workers, workers in the category take up a fraction of . At each iteration, the central worker chooses a set of row indices of size uniformly at random and requests the corresponding workers to return their results. More specifically, given a set of row indices at -th iteration, the central worker first finds the modes among the results from each row and among those modes, chooses the mode with the largest group size (“the majority vote”).
For any row , let be the coefficient of the monomial of the polynomial
[TABLE]
Let be the coefficient of the term of the polynomial
[TABLE]
Lemma 3.1**.**
For row , the probability that the mode is in the category with mode number is .
Proof.
See Appendix C. ∎
Using Lemma 3.1, we obtain the following conclusions by going over all possible mode numbers and all error categories.
Lemma 3.2**.**
For row , the probability that the category is the mode is
[TABLE]
where , and the probability that there is a mode with mode number for the calculation of row is
[TABLE]
Additionally, the probability that there is a mode for the calculation of row is
[TABLE]
where if , for any integer .
In the following lemma, we also calculate the probability that a mode is from row in the category with a mode number when rows are used in the computation and the previous estimate is given. For simplicity, we omit the condition of in the notation and denote by .
Lemma 3.3**.**
Given the previous estimate and row indices , we have
[TABLE]
Proof.
The probability that the mode is produced from category of row with the mode number can be expressed as
[TABLE]
∎
Taking the modes produced from different categories into account, we can easily obtain the following result.
Corollary 3.4**.**
Given the previous estimate and the row indices , the probability that a mode is from row with mode number is
[TABLE]
3.2 Convergence without block-list
In this section, our main goal is to provide theoretical error bound for the method without block-list (i.e., Alg. 2 without block-list). The main result for this section is present below.
Theorem 3.5**.**
Let with and . Assume that we solve via Alg. 2 without block-list; then
[TABLE]
where
[TABLE]
and is the row normalized matrix of and is ’s smallest singular value.
Before we prove Theorem 3.5, we let be the row selected at -th iteration to update the guess and consider solving
[TABLE]
according to some probability distribution. Thus, we have the iteration
[TABLE]
at -th iteration, where , and is the -th row of matrix .
In the following analysis, let denote the expectation with respect to the uniformly random sample conditioned upon the sampled for , and let denote expectation with respect to all random samples for , where is the last iteration in the context in which is applied.
We start our analysis by decomposing the squared error
[TABLE]
Taking the expectation of the above equation, we can easily achieve that
[TABLE]
Therefore, the proof of Theorem 3.5 can be divided into the computations of the conditional expectation of the squared error from the adversarial workers and the residual part separately which are provided in the following lemmas.
Lemma 3.6**.**
The conditional expectation of squared residual can be bounded below:
[TABLE]
where is the row normalized version of . Thus, we have
[TABLE]
where
Proof.
The expectation of the squared residual can be represented as
[TABLE]
Recall that . We thus have . Therefore,
[TABLE]
i.e., (7) is derived. Hence, we also have (8). ∎
Lemma 3.7**.**
The expectation of the squared error from the adversarial workers can be bounded above by , i.e.,
[TABLE]
*where and with
.*
Proof.
The expectation of the squared error from the adversarial workers can be represented as . To simplify the expressions, we let
,
, and . Then (9) can be achieved by
[TABLE]
with . ∎
Notice that
[TABLE]
we thus have , and
[TABLE]
The proof of Theorem 3.5.
Combining Lemma 3.6 and Lemma 3.7, we thus have Theorem 3.5. ∎
Next we provide some remarks for our main result Theorem 3.5.
Remark 3.8**.**
- (i)
, provided that . This results from the facts that (since ) and (by (10)). 2. (ii)
From (6), one can see the relation between and is complicated. An example in Table 2 shows that increasing , to some extent, can decrease and thus, improves the speed of convergence. For more details about the optimal choice of , one can refer to Appendix B. Meanwhile, one should be aware of that larger leads to more communication cost. Thus, in practice, finding an optimal is not just minimizing but also reducing the communication cost. 3. (iii)
When adversaries are relatively small, the method without the block-list can guarantee a convergence error with the same or smaller magnitude as using the right parameters. However, one can conclude from (6), when adversaries are relatively larger, the convergence is not guaranteed. Therefore, it is crucial to introduce the block-list method for excluding the adversarial workers.
3.3 Block-list method
In this section, we use to denote the proportion of workers in category among the total of workers. Thus, the number of workers in category is . As per Algorithm 2, after iterations, the worker with the highest count of non-mode selections, , will be added to the block-list. To assess the effectiveness of the proposed method, it is crucial to calculate the probability that a bad or reliable worker is included in the block-list. This problem can be mathematically reformulated as follows.
Problem 3.9**.**
Let (resp. ) be the counters of the worker being non-mode (resp. mode or in no mode case) among iterations. Then we have and . The probability that is in the block-list after S iterations can be calculated as follows:
[TABLE]
Note that this probability can be calculated by using integer dynamic programming or estimated by Monte Carlo simulations.
Lemma 3.10**.**
Run Alg. 2 with iteration. Then the conditional probability that a reliable worker is in the block-list is . Similarly, the conditional probability that a bad worker in category is in the block-list is .
Proof.
First notice that we have the following two facts: (i)The probability that a reliable worker is in the block-list is . (ii) The probability that a bad worker in category is in the block-list is . Therefore, the probability that a worker, either reliable or bad, i in the block-list is . The conditional probabilities can be easily computed by considering the ratios and . ∎
To illustrate how the quantities changes with respect to in Lemma 3.10, we consider the following example.
Example**.**
Assume that there are two categories of workers i.e. , and 5 workers in total with . Let . Note that and . The probability is estimated by Monte Carlo simulations. We simulated the experiment 100 times and count the numbers of experiments where each worker is listed in the block-list. Those numbers are used to calculate the frequency and estimate the probability. The estimated results are summarized in Table 3.
Table 3 shows that the probability of an adversarial worker in the block-list increases as the number of iterations increases. Meanwhile, the probability of a reliable worker in the block-list decreases. Using the method with the block-list, we are able to avoid choosing the results from the adversarial workers. As a results, the probability of using the adversarial workers decreases, i.e., decreases.
4 Simulations
In this section, we evaluate the performance of our approaches for solving consistent linear systems through simulations. We randomly generate a row-normalized matrix and a vector , both from a normalized Gaussian distribution, and set . For simplicity, we assume that each row has the same number of error categories and the same adversarial rate , where is the total adversarial rate. The linear system is solved using Algorithm 2 with and without the block-list. At each iteration, rows of are chosen uniformly at random, and for each row, workers are selected from the workers to participate in the calculation. We further assume are the same and are equal to for all . The simulation results show how the number of used rows , the number of used workers , the total adversary rate , and the number of error categories affect the performance.
Figs. 1 and 2 illustrate the impact of the number of used rows on the convergence results of our distributed RK method with and without the block-list. In this example, increasing the number of used rows from to improves the convergence rate for both with and without block-list scenarios, regardless of the magnitude of the adversaries . Fig. 1 demonstrates the convergence results when . From the figures, Alg. 2 with the block-list shows fast convergence over all choices of when the adversarial rate (Fig. 1(a)); the larger the number of used rows , the faster the convergence when the adversarial rate (Fig. 1(c)). Fig. 1(b) reveals that, when , the central worker may use a corrupted step-size, resulting in oscillations around the solution and the error converging to a range of magnitude between and . On the other hand, when , the error goes to 0 after 4000 iterations. As seen in Fig. 1(d), the convergence errors of all values lie in the range of . With a small , the method without the block-list can converge by increasing . However, when the magnitude of the adversaries and the adversarial rate are large (as shown in Fig. 2(d)), the convergence is not guaranteed without a block-list. In comparison, the method with the block-list converges quickly to an accuracy of when is sufficiently large (as seen in Fig. 2(a)). This highlights the importance of using the block-list in an environment with larger outliers.
Figure 3 examines the impact of the number of chosen workers on the convergence, with adversary rates of and . As the number of workers increases from to , the convergence becomes faster in both with and without block-list methods. However, the method using the block-list generally provides better convergence compared to the method without block-list. The block-list method requires extra storage, but it is worth the trade-off in terms of improved convergence. Without the block-list, oscillations can be observed in the convergence when and and when and all .
Figure 4 demonstrates the effect of the adversary rate on the convergence. As the adversary rate increases, the accuracy decreases. Even though the adversary rate is large, the final results using the block-list method are still satisfying. Without the block-list, when the adversarial rate , the central worker fails to approach the true solution due to the adversarial workers. This again shows the importance and effectiveness of using the block-list, especially in a highly hostile environment with a higher adversarial rate and a higher magnitude of the adversary. In addition, Fig. 5 shows the effect of the number of error categories with the block-list. One can see that our method converges when is big enough for the adversarial rate being and . In Fig. 6, we use the Wisconsin (Diagnostic) Breast Cancer data set, which includes data points whose features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image (see [7] for more details). Similar to the setup in [12], we set the simulations in the following way: the collection of data points forms matrix . We then normalize and construct and using a Gaussian distribution to form a consistent system. The convergence results in Fig. 6 show the effectiveness of our method solving this linear systems in a relatively safer environment with an adversarial rate (Fig. 6(a)) and a more hostile environment with an adversarial rate (Fig. 6(b)). When , the method converges within iterations, and as increases, the convergence speed becomes faster. Meanwhile, when , the method converges within iterations, and has the fastest convergence speed among all choices of .
Finally, we have investigated the impact of updating cycles on the accuracy of block-list recognition. In Table 4, we calculate the accuracy of the block-list method when . The two examples in the table show that as increases, the accuracy is higher.
5 Conclusion and future work
It is of great significance for optimization algorithms to be robust and resistant to adversaries. In this work, we propose efficient algorithms based on the mode for solving large-scale linear systems in the presence of the adversarial workers. This kind of adversary has plenty of applications in the real world, e.g. Internet of Things (IoT). We provide theoretical convergence guarantee and the theories are supported by our experiments. The methods are capable of handling various levels of adversarial rates. In particular, the method with the block-list is able to provide accurate estimation of solution when the adversarial rate , and at the same time, our method can identify the adversarial workers. Our experiments also highlight the impact of several important parameters of the adversaries and of anti-adversary strategies, namely, the involved row number to update the solution per iteration, the number of error categories , the adversary rate , and the number of chosen workers at each iteration.
Our method can also be adapted to solve the nonlinear problem. In Fig. 7, we applied the method with the block-list to solve the optimization problem with regularization with . In the distributed setting, the problem can be formulated as , where . Fig. 7(a) shows the distance between the solution using our method and the solution using the LASSO solver from the Python package scikit-learn [22] when choosing different number of rows . When the number of used rows , it takes the least iterations to converge. All choices of end up with an relative error around with the block-list. The object function v.s. the iterations are shown in Fig. 7(b).
Additionally, we provide convergence analysis for methods when multiple rows are selected uniformly at each iteration. We also provide the proof of a more general sampling scheme where the row is sampled according to its squared Euclidean norm, although the proof only applies to the case where only a single row is used for computation at each iteration. Interesting avenues for future work include proving convergence under different sampling rules and rigorously generalizing this method to non-linear optimization problems.
Acknowledgements
Authors are listed in alphabetical order. Some of the work for this article was done while Longxiu Huang was Assistant Adjunct Professor and Xia Li was a graduate student at UCLA.
Appendix A Single row convergence without block-list
In this section, we present an algorithm and its corresponding theory for a specific case in which only one row () is utilized for updating the solution at each iteration. Therefore, . The algorithm is based on the assumption that the probability of selecting row for the updating process is proportional to the squared length of the corresponding row. Further details can be found in Algorithm 3. We will then proceed to analyze the convergence of Algorithm 3.
Theorem A.1**.**
Let with and . Assume that we solve via Algorithm 3, then
[TABLE]
where , and is the smallest singular value of .
Additionally, if , we have
[TABLE]
Remark A.2**.**
To provide a quantitative understanding of Theorem A.1, we present several examples in Tables 5 and 6. For simplicity, assume that each error category has the same fraction . Thus, all are equal. Here is the probability that the algorithm chooses the right mode and is the probability that there is a mode. In these two tables, we present the values for and by varying the number of error categories , the number of chosen workers and the adversarial rate . These two tables are generated by solving a linear system with a row-normalized matrix .
As increases, decreases and increases. Therefore, the error bound in (13) decreases with respect to and thus reaches better convergence. When is large enough, . Therefore, when the noise is uniformly random error and there is a mode for the step-size, the mode will be the correct mode. As increases, there is a similar decrease effect and therefore a better convergence.
Proof of Theorem A.1.
To prove (12), at each iteration, we consider solving , ,, with probability , respectively. Therefore, for the -th step, we have the iteration
[TABLE]
or
[TABLE]
for , is the -th row of matrix .
Notice that when , we have
[TABLE]
When , we have
[TABLE]
Combining (14) and (15), we have
[TABLE]
Set . Therefore,
[TABLE]
∎
Appendix B Discussion of the optimal number of used rows
Assume that row is distributed to workers and workers are selected to involve in the computation of row with for all . Additionally, we assume that the probability of each error category () for each row is the same with Moreover, we have the restricted minimal mode number , and
[TABLE]
Let . To study the relation between and the convergence, consider
[TABLE]
If for all , then . This implies that as increases, increases. When , we have
[TABLE]
and to reach the fastest convergence rate, . One can explore the minimizers for in more general cases, where multiple local minimizers could present in the landscape.
Appendix C Proof of Lemma 3.1
Proof of Lemma 3.1.
Given a row , the number of combinations where the mode belongs to category with mode count can be divided into two parts: the combinations of workers in category and the combinations of workers in all other categories excluding .
We start by calculating the number of combinations of workers in the remaining categories. This is equivalent to selecting balls from bins subject to constraints that the -th bin contains balls, with a maximum of balls that can be chosen from this bin for all . With these constraints, there are ways to choose balls from bin . By letting vary and considering the bins, the total number of valid combinations is equal to the coefficient of the term in the polynomial . The number of combinations of workers in category with mode count is . Finally, the total number of combinations is given by . This concludes the proof of Lemma 3.1. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics , 6:382–392, 1954.
- 2[2] Dan Alistarh, Zeyuan Allen-Zh, and Jerry Li. Byzantine stochastic gradient descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
- 3[3] Zhong-Zhi Bai and Wen-Ting Wu. On partially randomized extended Kaczmarz method for solving large sparse overdetermined inconsistent linear systems. Linear Algebra and Its Applications , 578:225–250, 2019.
- 4[4] Zhong-Zhi Bai and Wen-Ting Wu. On greedy randomized augmented Kaczmarz method for solving large sparse inconsistent linear systems. SIAM Journal on Scientific Computing , 43(6):A 3892–A 3911, 2021.
- 5[5] Rawad Bitar, Mary Wootters, and Salim El Rouayheb. Stochastic gradient coding for flexible straggler mitigation in distributed learning. pages 1–5, 2019.
- 6[6] Yuejie Chi, Yuanxin Li, Huishuai Zhang, and Yingbin Liang. Median-truncated gradient descent: A robust and scalable nonconvex approach for signal estimation. 2019.
- 7[7] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
- 8[8] Jonas Geiping, Liam Fowl, W. Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. Witches’ brew: Industrial scale data poisoning via gradient matching. Clinical Orthopaedics and Related Research , abs/2009.02276, 2020.
