Coded Matrix Computations for D2D-enabled Linearized Federated Learning
Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G., Brinton

TL;DR
This paper introduces a novel coded matrix computation method for federated learning that reduces communication delays, enhances privacy, and improves local computation speed, especially with sparse data matrices.
Contribution
It presents a straggler-optimal coded matrix computation approach tailored for D2D-enabled federated learning, addressing privacy and efficiency issues.
Findings
Reduces communication delay in federated learning.
Enhances privacy by minimizing D2D data transmissions.
Improves local computation speed with sparse data matrices.
Abstract
Federated learning (FL) is a popular technique for training a global model on data distributed across client devices. Like other distributed training techniques, FL is susceptible to straggler (slower or failed) clients. Recent work has proposed to address this through device-to-device (D2D) offloading, which introduces privacy concerns. In this paper, we propose a novel straggler-optimal approach for coded matrix computations which can significantly reduce the communication delay and privacy issues introduced from D2D data transmissions in FL. Moreover, our proposed approach leads to a considerable improvement of the local computation speed when the generated data matrix is sparse. Numerical evaluations confirm the superiority of our proposed method over baseline approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Random Matrices and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
CODED MATRIX COMPUTATIONS FOR D2D-ENABLED
Linearized FEDERATED LEARNING
Abstract
Federated learning (FL) is a popular technique for training a global model on data distributed across client devices. Like other distributed training techniques, FL is susceptible to straggler (slower or failed) clients. Recent work has proposed to address this through device-to-device (D2D) offloading, which introduces privacy concerns. In this paper, we propose a novel straggler-optimal approach for coded matrix computations which can significantly reduce the communication delay and privacy issues introduced from D2D data transmissions in FL. Moreover, our proposed approach leads to a considerable improvement of the local computation speed when the generated data matrix is sparse. Numerical evaluations confirm the superiority of our proposed method over baseline approaches.
**Index Terms— ** Distributed Computing, Federated Learning, Stragglers, Heterogeneous Edge Computing, Privacy.
1 Introduction
Contemporary computing platforms are hard-pressed to support the growing demands for AI/ML model training at the network edge. While advances in hardware serve as part of the solution, the increasing complexity of data tasks and volumes of data will continue impeding scalability. In this regard, federated learning (FL) has become a popular technique for training machine learning models in a distributed manner [1, 2, 3]. In FL, the edge devices carry out the local computations, and the server collects, aggregates and updates the global model.
Recent approaches have looked at linearizing the training operations in FL [1, 4]. This is advantageous as it opens the possibility for coded matrix computing techniques that can improve operating efficiency. Specifically, in distributed settings like FL, the overall job execution time is often dominated by slower (or failed) worker nodes, which are referred to as stragglers. Recently, a number of coding theory techniques [5, 6, 7, 8, 9, 10, 11, 12, 13, 14] have been proposed to mitigate stragglers in distributed matrix multiplications. A toy example [5] of such a technique for computing across three clients is to partition as , and to assign them the job of computing , and , respectively. In a linearized FL setting, is the data matrix and is the model parameter vector. While each client has half of the total computational load, the server can recover if any two clients return their results, i.e., the system is resilient to one straggler. If each of clients computes fraction of the whole job of computing , the number of stragglers that the system can be resilient to is upper bounded by [7].
In contemporary edge computing systems, task offloading via device-to-device (D2D) communications has also been proposed for straggler mitigation. D2D-enabled FL has recently been studied [15, 16, 2], but can add considerable communication overhead as well as compromise data privacy. In this work, we exploit matrix coding in linearized FL to mitigate these challenges. Our straggler-optimal matrix computation scheme reduces the communication delay significantly compared to the techniques in [7, 9, 12]. Moreover, unlike [7, 9, 12, 13, 17], our scheme allows a client to access a limited fraction of matrix , and provides a considerable protection against information leakage. In addition, our scheme is specifically suited to sparse matrices with a significant gain in computation speed.
2 Network and Learning Architecture
We consider a D2D-enabled FL architecture consisting of clients, denoted as for . The first of them are active clients (responsible for both data generation and local computation) and the next are passive clients (responsible for local computation only).
Assume that the -th device has local data , where and are the block-rows of full system dataset . Under a linear regression-based ML model, the global loss function is quadratic, i.e., , where the model parameter after iteration is obtained through gradient methods as and is the stepsize. Based on the form of , the FL local model update at each device includes multiplying the local data matrix with parameter . For this reason, recent work has also investigated linearizing non-linear models for FL by leveraging kernel embedding techniques [1]. Thus, our aim is to compute – an arbitrary matrix operation during FL training – in a distributed fashion such that the system is resilient to stragglers. Our assumption is that any active client generates a block-column of matrix , denoted as , , such that
[TABLE]
In our approach, every client is responsible to compute the product of a coded submatrix (linear combinations of some block-columns of ) and the vector . Stragglers will arise in practice from computing speed variations or failures experienced by the clients at particular times [18, 8, 17]. Now, similar to [15, 16, 19], we assume that there is a set of trusted neighbor clients for every device to transmit its data via D2D communications. The passive clients receive coded submatrices only from active clients. Unlike the approaches in [4, 20, 1, 3], we assume that the server cannot access to any uncoded/coded local data generated in the edge devices and is only responsible for transmission of vector and for decoding once the fastest clients return the computed submatrix-vector products.
3 Homogeneous Edge Computing
Here we assume that each active client generates equal number of columns of (i.e. all ’s have the same size in (1)) and all the clients are rated with the same computation speed. In this scenario, we propose a distributed matrix-vector multiplication scheme in Alg. 1 which is resilient to any stragglers.
The main idea is that any active client generates , for and sends it to another active client , if ( modulo . Here we set , thus, any data matrix needs to be sent to only other clients. Then, active client computes a linear combination of (indices modulo ) where the coefficients are chosen randomly from a continuous distribution. Next, active client sends another random linear combination of the same submatrices to (a passive client), when . Note that all clients receive the vector from the server. Now the job of each client is to compute the product of their respective coded submatrix and the vector . Once the fastest clients finish and send their computation results to the server, it decodes using the corresponding random coefficients. The following theorem establishes the resiliency of Alg. 1 to stragglers.
Theorem 1**.**
Assume that a system has clients including active and passive clients. If we assign the jobs according to Alg. 1, we achieve resilience to any stragglers.
Proof.
In order to recover , according to (1), we need to decode all vector unknowns, ; we denote the set of these unknowns as . Now we choose an arbitrary set of clients each of which corresponds to an equation in terms of of those unknowns. Denoting the set of equations as , we have .
Now we consider a bipartite graph , where any vertex (equation) in is connected to some vertices (unknowns) in which have participated in the corresponding equation. Thus, each vertex in has a neighborhood of cardinality in . Our goal is to show that there exists a perfect matching among the vertices of and . We argue this according to Hall’s marriage theorem [21] for which we need to show that for any , the cardinality of the neighbourhood of , denoted as , is at least as large as . Thus, for , we need to show that .
Case 1: First we consider the case that . We assume that where . Now according to Alg. 1, the participating unknowns are shifted in a cyclic manner among the equations. If we choose any clients out of the first clients , according to the proof of cyclic scheme in Appendix C in [8], the minimum number of total participating unknowns is , where . Now according to Alg. 1, same unknowns participate in two different equations corresponding to two different clients, and , where . Thus, for any , we have
[TABLE]
Case 2: Now we consider the case where , . We need to find the minimum number of unknowns which participate in any set of equations. Now, the same unknowns participate in two different equations corresponding to two different clients, and , where . Thus, the additional equations correspond to at least additional unknowns until the total number of participating unknowns is . Therefore, in this case
[TABLE]
Thus, for any (where ), we have shown that . So, there exists a perfect matching among the vertices of and according to Hall’s marriage theorem.
Now we consider the largest matching where vertex is matched to vertex , which indicates that participates in the equation corresponding to . Let us consider a system matrix where row corresponds to the equation associated to . Now we replace this row by which is a unit row-vector of length with -th entry being , and [math] otherwise. Thus we have a matrix where each row has only one non-zero entry which is . Since we have a perfect matching, this matrix has only one non-zero entry in every column. This is a permutation of the identity matrix, and thus, is full rank. Since the matrix is full rank for a choice of definite values, according to Schwartz-Zippel lemma [22], it will be full rank for random choices of non-zero entries. Thus, the server can recover all unknowns from any clients, hence the system is resilient to any stragglers. ∎
Example 1**.**
Consider a homogeneous system of active clients and passive clients. According to Alg. 1, , and client () has a random linear combination of and as shown in Fig. 1. Thus, according to Theorem 1, this system is resilient to stragglers. Note that our scheme requires any active client to send its local data matrix to only up to other clients, thus involves a significantly lower communication cost in comparison to the approaches in [7, 9].
Remark 1**.**
In comparison to [7, 9, 13], our proposed approach is specifically suited to sparse data matrices, i.e., most of the entries of are zero. The approaches in [7, 9, 13] assign dense linear combinations of the submatrices which can destroy the inherent sparsity of , leading to slower computation speed for the clients. On the other hand, our approach assigns linear combinations of limited number of submatrices which preserve the sparsity up to certain level that leads to faster computation.
4 Heterogeneous Edge Computing
In this section, we extend our approach in Alg. 1 to heterogenous system where the clients may have different data generation capability and different computation speeds. We assume that we have different types of devices in the system, with client type . Moreover, we assume that any active client generates columns of data matrix and any client has a computation speed , where is of client type and is an integer. Thus, a higher indicates a “stronger” type client which can process at a times higher computation speed than the “weakest” type device, where is the number of the assigned columns and is the number of processed columns per unit time in the “weakest” type device. Note that and all lead us to the homogeneous system discussed in Sec. 3 where and .
Now, we have clients including active and passive clients in the heterogeneous system. Aligned to the homogeneous system, we assume that the number of passive clients of any type is less than the number of active clients of the same type. Next, without loss of generality, we sort the indices of active clients in such a way so that, if , for . We do the similar sorting for the passive clients too so that if , for . Now if a client is of client type , it requires the same time to process block-columns (each consisting of columns) of as the “weakest” device to process such block-column. Moreover, if it is an active client, it also generates columns of data matrix . Thus, client can be thought as a collection of homogeneous clients of “weakest” types where each of the active “weakest” clients generates equally columns of and each of the “weakest” clients processes equally columns.
Theorem 2**.**
(a) A heterogeneous system of active and passive clients of different types can be considered as a homogeneous system of active and passive clients of the “weakest” type. Next (b) if the jobs are assigned according to Alg. 1 in the modified homogeneous system of “weakest” clients, the system can be resilient to such clients.
Proof.
Each (generated in ) in (1) is a block-column consisting of columns of when client is of client type . Thus, for any , we can partition as , where and each is a block-column consisting of columns of , . Thus using (1), we can write , where . Now from the matrix generation perspective, active clients in a heterogeneous system generating block-columns can be considered as the same as active clients in a homogeneous system generating one block-column each.
Similarly, any client of type can process columns in the same time when the “weakest” type device can process columns. Thus, from the computation speed perspective, active clients and passive clients in the heterogeneous system can be thought as active clients and passive clients, respectively, in a homogeneous system by assigning coded block-columns to each client. Hence, we are done with the proof of part (a). Moreover, part (b) of the proof is straight-forward from Theorem 1 when we have active and passive clients. ∎
Remark 2**.**
The heterogeneous system is resilient to block-column processing. The number of straggler clients that the system is resilient to can vary depending on the client types.
Example 2**.**
Consider the example in Fig. 2 consisting of clients. There are active clients which are responsible for data matrix generation. Let us assume, and are of type clients which generate twice as many columns of than and which are of type [math] clients. The jobs are assigned to all clients (including passive clients) according to Fig. 2(b). It can be verified that this scheme is resilient to two type [math] clients or one type client.
5 Numerical Evaluation
In this section, we compare the performance of our proposed approach against different competing methods [7, 9, 13] in terms of different metrics for distributed matrix computations from the federated learning aspect. Note that the approaches in [1, 4] require the edge devices to transmit some coded columns of matrix to the server which is not aligned with our assumptions. In addition, the approaches in [8] and [11] do not follow the same network learning architecture as ours. Therefore, we did not include them in our comparison.
Communication Delay: We consider a homogeneous system of clients each of which is a t2.small machine in AWS (Amazon Web Services) Cluster. Here, each of active clients generates of size , thus the size of is . The server sends the parameter vector of length to all clients including passive clients. Once the preprocessing and computations are carried out according to Alg. 1, the server recovers as soon as it receives results from the fastest clients, thus the system is resilient to any stragglers.
Table 1 shows the comparison of the corresponding communication delays (caused by data matrix transmission) among different approaches. The approaches in [7, 9] require all active clients to transmit their generated submatrices to all other edge devices. Thus, they lead to much more communication delay than our proposed method which needs an edge device to transmit data to only up to other devices. Note that the methods in [13, 17] involve similar amounts of communication delay as ours, however, they have other limitations in terms of privacy and computation time as discussed next.
Privacy: Information leakage is introduced in FL when we consider the transmission of local data matrices to other edge devices. To protect against privacy leakage, any particular client should have access to a limited portion of the whole data matrix. Consider the heterogeneous system in example 2 where the clients are honest but curious. In this scenario, the approaches in [7, 9, 13, 17] would allow clients to access the whole matrix . In our approach, as shown in Fig. 2, clients and only have access to -th fraction of and clients , and have access to -th fraction of . This provides significant protection against privacy leakage.
Product Computation Time for Sparse Matrices: Consider a system with clients where and . We assume that is sparse, where each active client generates a sparse submatrix of size . We consider three different scenarios with three different sparsity levels for where randomly chosen , and entries of are zero. Now we compare our proposed Alg. 1 against different methods in terms of per client product computation time (the required time for a client to compute its assigned submatrix-vector product) in Table 2. The methods in [7, 9, 13, 17] assign linear combinations of submatrices to the clients. Hence, the inherent sparsity of is destroyed in the encoded submatrices. On the other hand, our approach combines only submatrices to obtain the coded submatrices. Thus, the clients require a significantly less amount of time to finish the respective tasks in comparison to [7, 9, 13, 17].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Saurav Prakash, Sagar Dhakal, Mustafa Riza Akdeniz, Yair Yona, Shilpa Talwar, Salman Avestimehr, and Nageen Himayat, “Coded computing for low-latency federated learning over wireless edge networks,” IEEE Jour. on Sel. Areas in Comm. , vol. 39, no. 1, pp. 233–250, 2020.
- 2[2] Su Wang, Seyyedali Hosseinalipour, Maria Gorlatova, Christopher G Brinton, and Mung Chiang, “Uav-assisted online machine learning over multi-tiered networks: A hierarchical nested personalized federated learning approach,” IEEE Trans. on Net. and Serv. Manag. , 2022.
- 3[3] Jer Shyuan Ng, Wei Yang Bryan Lim, Zehui Xiong, Xianbin Cao, Dusit Niyato, Cyril Leung, and Dong In Kim, “A hierarchical incentive design toward motivating participation in coded federated learning,” IEEE J. Sel. Areas Commun. , vol. 40, no. 1, pp. 359–375, 2022.
- 4[4] Sagar Dhakal, Saurav Prakash, Yair Yona, Shilpa Talwar, and Nageen Himayat, “Coded federated learning,” in IEEE Globecom Workshop , 2019, pp. 1–6.
- 5[5] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Info. Th. , vol. 64, no. 3, pp. 1514–1529, 2018.
- 6[6] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Proc. of Adv. in Neur. Inf. Proc. Syst. , 2016, pp. 2100–2108.
- 7[7] Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Proc. of Adv. in Neur. Inf. Proc. Syst. , 2017, pp. 4403–4413.
- 8[8] Anindya Bijoy Das and Aditya Ramamoorthy, “Coded sparse matrix computation schemes that leverage partial stragglers,” IEEE Trans. on Info. Th. , vol. 68, no. 6, pp. 4156–4181, 2022.
