On Model Coding for Distributed Inference and Transmission in Mobile Edge Computing Systems
Jingjing Zhang, Osvaldo Simeone

TL;DR
This paper explores how coding model information in mobile edge computing can reduce latency in distributed linear inference tasks despite non-deterministic EN processing times.
Contribution
It provides an information-theoretic analysis showing that coding model data can significantly lower total latency in distributed inference systems.
Findings
Coding reduces overall latency in distributed inference.
Cooperative transmission benefits are limited by coding.
Coding is crucial for latency reduction despite non-deterministic EN times.
Abstract
Consider a mobile edge computing system in which users wish to obtain the result of a linear inference operation on locally measured input data. Unlike the offloaded input data, the model weight matrix is distributed across wireless Edge Nodes (ENs). ENs have non-deterministic computing times, and they can transmit any shared computed output back to the users cooperatively. This letter investigates the potential advantages obtained by coding model information prior to ENs' storage. Through an information-theoretic analysis, it is concluded that, while generally limiting cooperation opportunities, coding is instrumental in reducing the overall computation-plus-communication latency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Privacy-Preserving Technologies in Data · Age of Information Optimization
On Model Coding for Distributed Inference and Transmission in Mobile Edge Computing Systems
Jingjing Zhang and Osvaldo Simeone
Abstract
Consider a mobile edge computing system in which users wish to obtain the result of a linear inference operation00footnotetext: The authors are with the Department of Informatics at King’s College London, UK (emails: [email protected], [email protected]). The authors have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 725731). on locally measured input data. Unlike the offloaded input data, the model weight matrix is distributed across wireless Edge Nodes (ENs). ENs have non-deterministic computing times, and they can transmit any shared computed output back to the users cooperatively. This letter investigates the potential advantages obtained by coding model information prior to ENs’ storage. Through an information-theoretic analysis, it is concluded that, while generally limiting cooperation opportunities, coding is instrumental in reducing the overall computation-plus-communication latency.
I Introduction
Introduced by the European Telecommunications Standards Institute (ETSI), the concept of mobile edge computing is by now established as a pillar of the 5G network architecture as an enabler of computation-intensive applications on mobile devices [1]. As illustrated in Fig. 1, with mobile edge computing, users offload local data to edge servers connected to wireless Edge Nodes (ENs). These in turn carry out the necessary computations and return the desired output to the users on the wireless downlink. Most academic work on mobile edge computing has focused on the complex resource allocation problem of orchestrating computing and communication resources at the mobiles and at the ENs (see, e.g., [2] and references therein).
Papers in the line of work introduced above either assume generic applications characterized by given input-output rate requirements (e.g., [2]) or optimize the partition of the computing graph of the applications between local and edge computing. Moreover, this body of research has shown the importance of jointly designing the physical-layer transmission strategy and the computing schedule. Importantly, computing the same output at multiple ENs, while generally increasing the computation time, enables cooperation opportunities in the downlink transmission from the ENs to the users [2].
More recently, in a parallel development in the information-theoretic literature, it has been demonstrated that, if the computation of interest has specific properties, coding of either inputs or outputs can help decrease the overall latency. In particular, reference [3] demonstrated the advantages of Maximum Distance Separable (MDS) coding of input matrices in reducing the latency for distributed matrix-vector multiplication in master-worker systems. The impact of coding computational outputs was instead investigated in [4] for Map-Reduce computing tasks.
In this letter, we investigate the role of coding in the mobile edge computing system illustrated in Fig. 1. In the system, each user wishes to compute a linear inference on a local data vector given a network-side model matrix via offloading. The matrix is generally large and hence it requires splitting across the servers of multiple ENs. Linear operations are practically important, e.g., for the implementation of recommendation systems based on collaborative filtering [5] or similarity search based on the cosine distance [6]. In both cases, the user-side data is a vector that embeds the user profile [5] or a query [6], and the goal is to search through the matrix of all items on the basis of the inner products between the corresponding row of matrix and the user-data . This letter presents an information-theoretic framework that enables the potential advantages of model coding and associated performance trade-offs to be quantified.
II System Model and Performance Criteria
II-A System Model
We consider the distributed edge computing model illustrated in Fig. 1, where users are connected to ENs through a shared wireless channel. For a given input vector of bits provided by a user, the system aims at computing the linear inference operation , where the weight, or model, matrix is static for a sufficiently long period of time. Each EN can store a number of bits equivalent to a fraction of rows of matrix , i.e., bits. Storage of information from matrix takes place offline given the static nature of the model.
Each user , with , has its own personal data , with of bits, which is collected online by the user, and it wishes to obtain the result of the linear operation . The task is offloaded to the ENs as shown in Fig. 1. To this end, the ENs acquire the user data through uplink transmission. Second, the ENs carry out computations on the received users’ data and on the stored data about . Finally, via downlink communication, the ENs deliver the results of the computations to the users, so that each user can recover the required output .
In this letter, we make the simplifying assumption that the time needed to upload to all ENs is fixed and each EN gets the entire matrix . This allows us to focus on the challenging problem of jointly designing offline model coding and storage at the ENs, as well as online edge computing and downlink transmission phases. The problem is formulated as follows.
Model Coding and Storage: In an offline phase, the model matrix is linearly encoded [7] as where we have defined the coding matrix , with integer . Each EN stores the subset , with of coded rows.
Edge Computing: In the online phase, each EN computes inner products between all users’ data received in the uplink and the available coded model rows in set . As in [8], the order in which such computations are carried out is specified by vector , where each element , with , is selected from the set of coded rows available at EN . In particular, each EN starts to compute the inner product and continues computing , for . As in the literature on distributed computing, we refer to each computation as an Intermediate Value (IV) [9]. A computation policy is hence defined by the coding matrix , scheduling matrix , with the th column vector given as , as well as by a stopping criterion, which is used by the ENs to decide when to stop the computing phase and start downlink transmission.
To formulate the stopping criterion, we define as the vector that indicates how many IVs have been computed by the ENs by time , with indicating the start of the computing phase and denoting the number of computations at each EN . Note that we have the inequalities due to the storage constraint. We also define as
[TABLE]
the set of first IVs computed by EN for a given choice of the scheduling vector . A computation vector is said to be feasible if the union of all computed IVs across all ENs contains enough information to enable the recovery of all the outputs , i.e., if the conditional entropy H\big{(}\{\boldsymbol{y}_{n}\}_{n=1}^{N}|\bigcup_{k\in[K]}\mathcal{I}_{k}(m_{k},\mathbf{s}_{k})\big{)} equals zero. Note that, if is feasible, then any , where inequality is element-wise, is also feasible.
A stopping criterion for a given computation policy is defined by a set of feasible computation vectors in the sense that the ENs stop computing at the first time such that is in set , i.e.,
[TABLE]
As a result, the computed IVs at EN by the end of the edge computing phase are given as . As a simple example, a computation policy may require that all ENs complete all local computations, i.e., .
Downlink Communication: In this phase, the ENs send the computed IVs to the users on the downlink so that each user can recover the desired output . To this end, the ENs apply conventional one-shot linear precoding as in [10, 11]. Accordingly, in each downlink transmission block, the transmitted signal at each EN is given as , where is a symbol that encodes a subset of IVs in set , and is the corresponding beamforming coefficients. All the ENs that have computed the same IVs can transmit them cooperatively via joint beamforming [10, 11]. We impose the per-EN power constraint . In each downlink block, the signal received by each user is given as
[TABLE]
where is the channel coefficient from EN to user ; is the defined signal transmitted by EN ; is unit-power additive complex Gaussian noise. The fading channels are drawn from a continuous distribution, constant in each block, and known to all ENs.
II-B Performance Analysis
As in [12], we assume that the computing time needed by each EN to perform computations is given as
[TABLE]
where , independent across ENs, is an exponential random variable with average that models the time needed for setup at each EN ; and is the (deterministic) time required for each computation. Under model (4), given a stopping set , the random duration in (2) of the computation phase can be written as the optimization
[TABLE]
where we have defined the stopping vector for a given vector as
[TABLE]
This follows since the time needed to realize a computation vector is given by .
In the high-SNR regime of interest, we evaluate the downlink phase duration by normalizing for the time needed to deliver one IV, of size bits, to all users, in the absence of mutual interference. Hence, the normalized communication delay is given as
[TABLE]
For comparison, we also normalize the computation time by the time to compute one IV for all users, obtaining the normalized computation delay . Finally, the average total normalized latency of the edge computing system is given as
[TABLE]
where parameter is the ratio between the average time (in seconds) needed to compute one IV at an EN and the average time needed to transmit one IV on an interference-free channel.
III Uncoded vs. Coded Computing
III-A Uncoded Storage and Computing (UC)
Consider first a standard uncoded strategy whereby each EN stores rows directly from the model matrix rows . Following, e.g., [8], the scheduling matrix is designed in a cyclic manner, so that each vector is repeated times across all ENs. As an example, if , and , then the scheduling vector are , , and . The stopping set is defined as the set of all feasible computation vectors, so that every vector ensures that each IV has been computed by some EN.
For each IV and a given feasible vector , we define as the number of times that the IV has been computed across the ENs, i.e., the number of ENs whose set contains the IV. We hence have the constraint . To deliver a single IV computed at ENs, cooperative Zero-Forcing (ZF) precoding allows users to be served at the same time at the maximum high-SNR rate , where represents the minimum between the two arguments and . This is done by choosing the precoding matrix across the transmitting ENs to equal the inverse of the (square) channel matrix, upon appropriate power scaling. Hence, the normalized downlink latency (7) for this IV is given as [10, 11]. As a result, the total latency can be characterized as follows.
Proposition 1
With the described uncoded strategy, the average total normalized latency (8) is given as
[TABLE]
where the stopping vector is given in (6), and the expectation is taken over the distribution of the random vector .
III-B MDS coded Storage and Computing (MC)
We proceed to consider an MDS-coded scheme that aims at enhancing robustness to straggling ENs [9, 12, 7]. In this scheme, the coding matrix is selected as the generator matrix of an MDS code; each EN stores distinct coded rows; and the computing order at each EN is arbitrary. Furthermore, the stopping set is defined such that, given the fractional cache size , the system waits for the fastest ENs to finish all their computations. By definition of an MDS code, this guarantees that all the required output elements in can be obtained from the IVs computed at the ENs by treating the missing IVs from the slower ENs as erasures.
With this scheme, there is no redundancy in the set of IVs computed at the ENs and hence no cooperation opportunities are available for downlink transmission. It follows that the IVs need to be sent sequentially to each user in the downlink using orthogonal transmission, and thus the communication latency is given as .
Proposition 2
With the described MDS coded scheme, the average total latency (8) is given as
[TABLE]
Proof:
Since only the fastest ENs are required to execute their full computations, the average computation time is given as , where is the order statistics of exponential random variables , and is the harmonic number (see [12]). ∎
III-C Hybrid Scheme (HS)
We now propose a hybrid scheme whose aim is to combine the robustness to stragglers afforded by the MDS-coded scheme and the cooperative downlink transmission advantages of the uncoded scheme. The proposed hybrid scheme allows the reduction in computing time via MDS coding to be traded off for savings in communication time via EN cooperation. To this end, we concatenate an MDS code for some with a repetition code that replicates each coded vector to ENs. Controlling the design parameters , the scheme ranges from uncoded storage to MDS coding .
More precisely, following [7], in order to ensure an even distribution of coded rows, the coded rows are split into disjoint subsets. Each subset consists of coded rows, and is indexed by a subset of size , i.e., . Each EN stores all the rows in the set , with cardinality . Due to the storage constraint at each EN, we have the constraint
[TABLE]
We select the stopping set in a manner similar to the MDS coded strategy, so that the computing phase is completed as soon as ENs complete all their computations, where is a design parameter. Following [7, Proposition 1], the three design parameters need to satisfy the constraint
[TABLE]
in order to ensure that distinct coded IVs are computed across the ENs and hence all desired outputs can be recovered. It can be observed that the choice of parameters depends on system parameters and , which are constant, and design parameter . These parameters are expected to be constant for long periods of time and hence frequent re-encoding is not necessary.
At the end of the computing phase, each computed IV is available at ENs, where can be shown to lie in the interval , with and in a manner similar to [7]. Moreover, for any , the number of computed IVs is since there are subsets of ENs that have computed the same IVs. For downlink transmission, in order to maximizing cooperative opportunities, the computed IVs are sent in descending order of redundancy by using cooperative ZF precoding to serve users simultaneously.
Proposition 3
With the described hybrid scheme, the average total latency (8) is given as
[TABLE]
where we have defined r_{q}=\inf\big{\{}r:\sum_{r_{i}=r}^{r_{max}}B_{i}\leq m\big{\}}; and the optimization over parameters , , and is constrained by Condition (11) and (12).
Proof:
Given any design parameter , the average computation time is evaluated as in Proposition 2, with the computing latency given as in (10). Using downlink transmission, the IVs with redundancy require a communication latency using cooperative ZF as explained in Section III-A. In order to deliver IVs, the IVs with redundancy are sent in full, while only IVs with redundancy need to be delivered. The corresponding total communication latency is optimized over all design parameters that satisfy Condition (11) and (12). ∎
IV Example and Discussion
In this section, we present a numerical example for a system with ENs and users, row vectors in model matrix , and fractional cache size . We also set the per-IV computation time to and the average set-up time to different values of . In Fig. 2, we plot the overall average latency as a function of the ratio between normalized computation and communication times.
As seen in Fig. 2, as increases, the total latencies of both UC in (9) and MC in (10) grow linearly, and the relative performance depends on the values of and . When is small, i.e., , the variability in the computing times of the ENs is high, and MDS coding for the most part outperforms the UC scheme due to its robustness to stragglers. This is unless is large enough, in which downlink transmission latency becomes dominant and the UC scheme can benefit from redundant computations via cooperative EN communication. In contrast, for larger values of , the computing times have low variability and MDS coding is uniformly outperformed by the UC scheme.
We also observe that the proposed hybrid coding strategy is effective in trading off computation and communication latencies by controlling the balance between robustness to stragglers and cooperative opportunities via the design of parameters . In fact, by increasing and , this approach can decrease the communication latency at the cost of a larger computing latency. Apart from very small values of for large , the scheme is seem to outperform both MDS and UC strategies.
An interesting open problem is to design a hybrid strategy that generalizes both the proposed MDS and UC schemes by properly optimizing the scheduling matrix in a manner akin to UC. Other aspects that are left for future work include the investigation of coding schemes that enable the use of ENs’ partial computations [12]; of transmission strategies that carry out simultaneous edge computing and downlink communications; of the impact of partial uplink connectivity; and of protocols able to accommodate an arbitrary number of computing tasks.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. Taleb and et al, “On multi-access edge computing: A survey of the emerging 5G network edge cloud architecture and orchestration,” IEEE Commun. Surveys Tutorials , vol. 19, no. 3, pp. 1657–1681, May 2017.
- 2[2] S. Sardellitti, G. Scutari, and S. Barbarossa, “Joint optimization of radio and computational resources for multicell mobile-edge computing,” IEEE Trans. Signal Inf. Process. Over Netw. , vol. 1, no. 2, pp. 89–103, June 2015.
- 3[3] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf. Theory , vol. 64, no. 3, pp. 1514–1529, March 2018.
- 4[4] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coding for distributed fog computing,” IEEE Commun. Magazine , vol. 55, no. 4, pp. 34–40, April 2017.
- 5[5] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative filtering recommender systems,” in The Adaptive Web . Springer Berlin/Heidelberg, 2007, pp. 291–324.
- 6[6] R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in WWW , 2007, pp. 131–140.
- 7[7] J. Zhang and O. Simeone, “Improved latency-communication trade-off for map-shuffle-reduce systems with stragglers.” [Online]. Available: http://arxiv.org/abs/1808.06583
- 8[8] E. Ozfatura, S. Ulukus, and D. Gündüz, “Distributed gradient descent with coded partial gradient computations.” [Online]. Available: https://arxiv.org/abs/1811.09271
