Exploiting Network Loss for Distributed Approximate Computing with   NetApprox

Ke Liu; Jinmou Li; Shin-Yeh Tsai; Theophilus Benson; Yiying Zhang

arXiv:1901.01632·cs.NI·July 1, 2022

Exploiting Network Loss for Distributed Approximate Computing with NetApprox

Ke Liu, Jinmou Li, Shin-Yeh Tsai, Theophilus Benson, Yiying Zhang

PDF

Open Access

TL;DR

NetApprox introduces an approximate-aware network layer that leverages the tolerance of approximate applications to loss, significantly improving job completion times and co-running workload performance in data center environments.

Contribution

It is the first to design an approximate-aware network layer integrating transport protocols, resource allocation, and scheduling policies for data center applications.

Findings

01

Up to 80% faster job completion times.

02

79% improvement in non-approximate workload performance.

03

Effective in both simulation and real-world deployment.

Abstract

Many data center applications such as machine learning and big data analytics can complete their analysis without processing the complete set of data. While extensive approximate-aware optimizations have been proposed at hardware, programming language, and application levels. However, to date, the approximate computing optimizations have ignored the network layer. We propose NetApprox, which to the best of our knowledge, is the first approximate-aware network layer comprising transport-layer protocol, network resource allocation schemes, and scheduling/priority-assignment policies. Building on the observation that approximate applications can tolerate loss, NetApprox's main insights are to aggressively send approximate traffic (which improves the performance of approximate applications) and to minimize the network resources allocated to approximate traffic (which simultaneously limits…

Equations6

T L R_{i + 1} = \frac{( 1 - M S R ) \times N - T o t a l N u m L os tM s g}{N - T o t a l N u m D e l i v er e d M s g}

T L R_{i + 1} = \frac{( 1 - M S R ) \times N - T o t a l N u m L os tM s g}{N - T o t a l N u m D e l i v er e d M s g}

R_{i + 1} = (1 - m) \times R_{i} + m \times R_{ma x}

R_{i + 1} = (1 - m) \times R_{i} + m \times R_{ma x}

R_{i + 1} = R_{i} \times (1 - \frac{ρ _{i}}{2})

R_{i + 1} = R_{i} \times (1 - \frac{ρ _{i}}{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices

Full text

Exploiting Network Loss for Distributed Approximate Computing with NetApprox

Ke Liu# Jinmou Li¶ Shin-Yeh Tsai $\mathsection$ Theophilus Benson‡ Yiying Zhang¶

#Institute of Computing Technology ¶UC San Diego $\mathsection$ Facebook ‡Brown University Most work done while at Purdue UniversityWork done while at Purdue University

Abstract

Many data center applications such as machine learning and big data analytics can complete their analysis without processing the complete set of data. While extensive approximate-aware optimizations have been proposed at hardware, programming language, and application levels; however, to date, the approximate computing optimizations have ignored the network layer.

We propose NetApprox, which to the best of our knowledge, is the first approximate-aware network layer comprising transport-layer protocol, network resource allocation schemes, and scheduling/priority-assignment policies. Building on the observation that approximate applications can tolerate loss, NetApprox’s main insights are to aggressively send approximate traffic (which improves the performance of approximate applications) and to minimize the network resources allocated to approximate traffic (which simultaneously limits the impact of aggressive approximate traffic while freeing up resources that, in turn, improve non-approximate applications’ performance). We ported Flink, Kafka, Spark, and PyTorch to NetApprox and evaluated NetApprox with both large-scale simulation and real implementation. Our evaluation results show that NetApprox improves job completion times by up to 80% compared to network-oblivious approximation solutions, and improves the performance of co-running non-approximate workloads by 79%.

1 Introduction

Today, big-data workloads, such as stream data processing, data analytics, and machine learning training, dominate network communication in data centers and clouds. Despite significant efforts in designing and tailoring data-center networks to optimize their performance [15, 28, 45, 40, 20, 30, 44, 65, 61, 63, 3, 33, 38, 27, 41, 16, 14] network systems continue to overlook a key property of these big-data workloads: their approximate nature. These workloads (often called approximate computing workloads) are capable of completing a job with incomplete or imprecise input data or intermediate data, a property we call slack tolerance. For example, ML training can achieve high accuracy even without certain (e.g., tiny or zero) gradients when updating weights [42, 17]. Similarly, data-analytics jobs like MapReduce workloads can achieve satisfactory accuracy even when some input data or intermediate data between mappers and reducers are lost [25, 52].

Recent efforts to leverage approximation slack tolerance have demonstrated significant performance and energy benefits. However, these efforts have focused on the programming languages [13, 18, 55, 56], hardware [56], and application [4, 8, 19, 22, 25, 32, 48, 52, 53] levels, all within a single-server setting. However, most data-center approximate applications run in a distributed manner, and it is imperative to exploit slack tolerance in a distributed setting. Existing distributed approximate applications have been treating the network as a blackbox by either dropping some data before sending the rest to the network [25] or sending all the data and then dropping some at the receiver before the computation [52]. These existing solutions reuse existing networking protocols, scheduling principles, and queuing disciplines, missing the opportunities to potentially incorporate slack tolerance at the network layer.

Motivated by this observation, our work aims to answer these questions: “Instead of treating the network as a black box, is it beneficial to make approximate computing network-aware and the network approximate-computing aware? If so, what design choices are required to effectively integrate approximate computing with production networks?”

Our answer to these questions is a network system called NetApprox that is designed for data-center approximate computing. Behind NetApprox, our insight is that existing reliable transports and queuing disciplines are conservative and thus generally underutilize the network [44]; they aim to deliver all packets while avoiding network loss. These design choices are suboptimal for approximate workloads which can tolerate data loss and can potentially run on a more aggressive network layer. We explore and propose two distinct points in the network design space for approximate computing: first, approximate transports which can send at a rate that slightly exceeds network capacity, and second, a network which can allocate significantly fewer queuing resources to approximate traffic.

Although these high-level design choices are simple, there are admittedly several critical challenges in realizing them in today’s production networks in a manner that generalizes across the broad spectrum of approximate computing applications ranging from streaming and batch to machine learning. First, although approximate applications can tolerate some slack, dropping data arbitrarily is not acceptable as these applications still need to meet certain accuracy requirements. Thus, NetApprox needs to accurately infer the approximation slack and deliver optimal performance without violating accuracy requirements. Second, an application’s slack varies over time due to packet drops and network dynamics. Thus, NetApprox needs to automatically and effectively adjust approximate traffic’s behavior to react to evolving network dynamics and variability. Third, the fundamental shift in approximate traffic’s behavior towards more aggressive rates potentially incurs queuing delays and build-up. Thus, NetApprox must intelligently allocate network resources to approximate and non-approximate traffic in a manner that is efficient and starvation-free while simultaneously addressing the spectrum of accuracy requirements for different approximate jobs. Finally, NetApprox should offer an abstraction that is generic and expressive enough to support different types of approximate applications while being easy to use.

Intuitively, addressing the above challenging in an application-specific manner may be trivial. However, addressing them in an application-agnostic manner requires synthesizing a set of novel slack-based network techniques across several layers of the network stack, including a slack-based transport layer, a slack-based switch queue design, and a slack-based abstraction and accompanying libraries. At the core of NetApprox is a dynamically adjusted, slack-based metric, Target message Loss Rate, or $TLR_{t}$ , which captures an application’s instantaneous slack tolerance at time $t$ . Crucially, $TLR_{t}$ evolves dynamically as the network state changes, with the goal that the final target loss rate is equal or slightly smaller than application’s overall tolerable amount of loss.

NetApprox exposes a simple yet powerful interface to application frameworks such as Spark and PyTorch. To use NetApprox, we include a developer-friendly user-level library, ApproxLib, which acts as a bridge between application frameworks and our network system. In §6, we demonstrate NetApprox’s ease of use by porting four big-data application frameworks.

At the transport layer, we propose a new Approximate Transmission Protocol, or ATP, which adapts the application’s sending rate to utilize all available network bandwidth opportunistically based on the instantaneous target loss rate. Internally, ATP further exploits approximate applications’ tolerance of un-ordered data to minimize their job completion time with a new approximate-aware scheduling policy.

At the network layer, our slack-based network resource allocation uses the target loss rate (and implied aggressiveness) to determine an application’s priority and appropriately queues its traffic. To support this, NetApprox configures lower-priority switch queues to a tiny queue size and associates approximate flows with more loss tolerance to these queues. The highest priority queue and almost all switch buffer space are reserved for non-approximate traffic. Together, these techniques ensure minimal retransmission, efficient bandwidth utilization, and fast job completion. In particular, the switch settings improve the performance of non-approximate workloads while not affecting the performance or accuracy of approximate workloads.

We implemented NetApprox as a user-level library at end hosts and by changing existing switches’ configurations. We ported four data-processing/ML frameworks, Kafka [10], Flink [9], Spark [11], and PyTorch [24] to NetApprox. In addition, we implemented NetApprox on the ns2 simulator [47] to understand our techniques under production scales. In our simulation, we evaluated NetApprox using large Fat-tree [5] and 2-layer CLOS topologies and two large-scale real-world traces [12, 54]. We compared NetApprox to UDP, DCTCP [6], pFabric [7], Aeolous [33], and two sender-drop approaches, one that samples packets to be sent uniformly throughout a job and one that sends data as early as possible. We evaluated our real implementation on a small cluster with two real-world workloads and three types of approximate computation. Our evaluation results show that NetApprox improves approximate application job completion time by up to 80%. Meanwhile, NetApprox’s measured loss rate is small and always below application-specified max loss rate. Moreover, it achieves fairness across different (approximate and non-approximate) jobs and improves the performance of co-running non-approximate traffic by 79%.

This paper makes the following contribution.

•

Analysis of the limitations of existing approximation systems that are network oblivious.

•

The first proposal of incorporating approximate computing at the network layer.

•

ATP, a new transport protocol designed for distributed approximate computing, and a set of approximate-aware switch resource allocation policies.

•

Implementation of NetApprox with both simulation and real implementation, and four real datacenter distributed frameworks ported to NetApprox.

We will make our simulation and real implementation source code publicly available upon publication.

2 Today’s Approximate Computing

This section provides background on approximate computing and briefly describes the state of the art.

2.1 Datacenter Approximate Computing

Today, approximate computing paradigms in datacenters mainly target data analytics and machine learning applications running on batch and streaming platforms where end-users do not need a precise answer and are perfectly content with an approximate answer. For example, an online service provider often needs to determine, in real-time, the top viewed webpages and runs a stream-processing job in their datacenter to do so. This service provider does not care about the actual number of views, just their relative frequencies

Figure 2 shows the accuracy of running a Top-K (top 15) calculation on pick-up locations from the NYC taxi itinerary dataset [21]). It can reach close to 100% accuracy when the sampling rate is more than a half. Uniform sampling can further improve accuracy over a sampling scheme which only takes the first set of data points.

As another example, a large-scale machine-learning training job running on TensorFlow [26] or PyTorch [24] can generate an accurate-enough model even with incomplete or imprecise training data or gradients. We train the Resnet50 model [31] with the MNIST [39] dataset and drop gradients that are smaller than a threshold. Figure 2 shows that approximate training has similar convergence rate as no approximate if the threshold is not too big, and the test-set accuracy is similar.

Although the designs of different approximate computing systems differ, most of them follow a similar high-level work flow [51, 52, 36, 25, 49, 35]. First, users specify their approximate workloads’ requirements, e.g., a maximum error rate. Then, the system chooses its own way of carrying out the approximation. Finally, the system calculates or estimates the actual error from the execution and reports it to the user.

2.2 Network-Oblivious Approximation

Existing distributed approximate computing proposals [4, 8, 25, 52] all build on a reliable transport (e.g., TCP) and treat the network as a black box. They either sample data before sending them out or drop data after receiving them, both of which have their own problems. Figure 3 illustrates the high-level flow of these approaches and NetApprox.

Receiver-based sampling. This type of system sends all the data through the network, and afterwards, the receiver discards a part of the data before processing it. The receiver could either drop some sampled data after all data is received (SampleRecv Figure 3(A)) or discard all data received beyond a threshold (DiscardRecv Figure 3(B)). For example, StreamApprox [52] improves Spark by dropping parts of the received input workload data and then start a Spark job to process the remaining data. With receiver-based sampling, all data is transferred over the (reliable) network, resulting in significant resource waste in the network. Moreover, with the scheme of Figure 3(A), computation at the receiver will be delayed, impacting the entire application’s performance.

Sender-based sampling. Opposite to receiver-based sampling, sender-based sampling performs sampling at the sender before sending data out. The sender could sample data as they are made ready (e.g., read from disk or received from another server) and send only the sampled ones, a scheme we call SampleSend (Figure 3(C)). The sender could also send data as soon as they are ready and stop sending once enough data has been received, what we call EarlySend (Figure 3(D)). ApproxHadoop [25] is an example of EarlySend. It optimizes Hadoop’s map phase by sending map results to reducers as soon as mappers generate them and stops the mapper phase when user-specified accuracy is met.

By sending only a subset of data, SampleSend reduces network bandwidth consumption. However, it is as slow as the speed of the entire data production process. It misses opportunities to more aggressively use network bandwidth. EarlySend improves application performance over SampleSend by sending data as soon as they are ready. However, by treating the network as a black box (e.g., using a reliable transport with a congestion-control algorithm that is approximate oblivious), data could be sent at a conservative rate (because today’s congestion control algorithms have the goal of minimizing packet loss) and/or unnecessary retransmission would be involved (because today’s reliable transports deliver every packet). Both these scenarios would delay the completion of application workloads.

Takeaway. Both receiver- and sender-based approximate computing schemes are network oblivious, resulting in excessive network bandwidth consumption, inefficient bandwidth consumption, floods of retransmission, and/or delayed job completion time. We argue that network support for approximate computing can offer new opportunities in improving application performance and achieving better network resource allocation. However, the lack of network awareness of today’s approximate computing paradigms minimizes their ability to achieve these goals.

3 NetApprox Overview

This section presents an overview of our proposed framework, NetApprox. The intuition behind NetApprox’s design is to use the slack allowed by approximate applications to determine (1) an appropriate sending rate (by our new transport protocol), (2) a policy to selectively retransmit lost packets (by our new packet scheduling policy), and (3) priority assignment and queue buffer space (by our new switch resource allocation mechanism).

3.1 Overall Architecture

NetApprox consists of several components (Figure 4), all of which are approximate-aware and have slack-centric designs. ATP, the core of NetApprox, is a new transport protocol that is neither unreliable (like UDP) nor completely reliable (like TCP). Its core idea is to allow approximate traffic to send at rates that exceed network capacity (in order to aggressively try completing approximate jobs sooner), but doing so in a controlled manner (so as to still meet application-specific accuracy requirements).

We perform auxiliary support for certain approximate applications’ sampling requirements (§3.3.3). Together with ATP’s end-host logic, they form a user-level library called ApproxLib running at end hosts.

At the network layer, to provide isolation and prevent starvation, we propose a new slack-based switch queue configuration and scheduling algorithm (§5). The core idea is to categorize flows based on their current slack requirements and to assign them to switch priority queues accordingly.

3.2 NetApprox Workflow

To leverage NetApprox, framework developers would port an approximate framework such as PyTorch and Spark to run on ApproxLib. End users use these ported frameworks in almost the same way as they do before. Before a workload starts, an end user specifies the overall accuracy requirement of their job to the approximate framework, which translates it to workload approximation configurations for ApproxLib. At run time, the framework sends messages to the network via ApproxLib and may dynamically adjusts the approximation configurations. ApproxLib determines the dynamic target loss rate, packet sending rate, sending order, and retransmission criteria based on the approximation configurations given by the framework and the current network congestion. At the receiver side, ApproxLib reports lost messages to the framework, which is used to adjust new sampling criteria. After a job completes, the framework reports the application-level accuracy, profiling statistics such as the measured loss rate, and the computation results to the user.

3.3 NetApprox Interfaces

ApproxLib provides two distinct interfaces; one for end users, to specify acceptable parameters for their specific job, and one for the approximate framework developers to integrate NetApprox into their framework. A fundamental challenge in designing the framework abstraction and interface lies in ensuring that the interface is simple but powerful enough to support the broad set of approximate applications which exist today. In particular, while batch and streaming often adhere to similar principles, deep learning explores a different point in the design space.

3.3.1 User Interface

Users carry out approximate computation on top of ported frameworks by first defining certain accuracy requirements: the maximum acceptable error rate for a job ( $e_{max}$ ). Error bounds are applicable to almost all approximate applications such as ML, batch and streaming data-processing systems. When a user specifies $e_{max}$ , then NetApprox tries to finish the job as soon as possible while ensuring that the error is within $e_{max}$ . Afterwards, users submit jobs to the ported frameworks in the same way they would on the original framework. For example, a job can be a word count Spark job, one minibatch of machine-learning training task executed by PyTorch, or the processing of a window of streaming data.

3.3.2 Framework Interface

Per-job description. The approximate application framework ported to ApproxLib first informs ApproxLib about the size of a job $N$ before starting the job. $N$ can be the total dataset size for batch systems or window size for streaming systems, both of which are known by the corresponding frameworks. The framework is also responsible for translating user-specified accuracy requirements (i.e., $e_{max}$ ) to a minimal message sampling rate ( $MSR$ ) that will be passed to ApproxLib. Different frameworks would have their own ways of doing this translation. A common practice we suggest is to keep estimating the potential error based on the received and the dropped data, compare it with the user-specified $e_{max}$ , and adjust $MSR$ in a way that would minimize the error difference. We elaborate in § 6.

Per-message description. At run time, when the framework executes a job, it sends multiple messages to ApproxLib, each of which in turn includes one or more packets. Only when all the packets in a message are successfully delivered does NetApprox treats the message as being delivered, and NetApprox guarantees to deliver at least $MSR\times\ N$ messages for a job. Frameworks that desire fine-grained sampling could make one data unit as an individual message (with the tradeoff of potential performance overhead). Frameworks running on ApproxLib can use unmodified POSIX socket APIs to send and receive messages in a job. In the beginning of the socket API payload, we leave a header field for frameworks to specify per-message approximate information, which we will describe next.

3.3.3 Expressing Sampling Criteria

By default, ApproxLib treats every message in the same way, i.e., ApproxLib is free to choose any messages to drop. However, many approximate computing frameworks have specific requirements on how data should be sampled, and for these frameworks, not all messages can be treated in the same way. For example, frameworks like BlinkDB [4] perform queries using stratified sampling, where data is separated into different subsets of data (e.g., based on certain database fields) and samples are drawn with an equal probability within each subset. Other frameworks could choose quota sampling, where the sampled data size (not probability) is the same across subsets of data. Yet, some frameworks may desire to precisely control what data can and cannot be dropped, e.g., a machine-learning framework may want to ensure that all messages carrying large gradients are delivered and is OK with the loss of small gradients.

We introduce a concept of group to support a broad range of sampling patterns and requirements effectively. To use this construct, the framework specifies a GroupID in the header of each message, and for different groups, the framework could specify different $MSR$ s (e.g., 0.8 for GroupID 1, 0.5 for GroupID 2, 1 for GroupID 3). NetApprox separates messages into groups based on their GroupID and guarantees to deliver the minimal samples within each group based on their $MSR$ s. If $MSR$ is 1, then NetApprox will treat the group as a non-approximate workload and delivers every message in order. The ability to employ different $MSR$ s for different group allows NetApprox to implement a broad range of sampling techniques. For example, to express stratified sampling, a framework could separates its dataset into different groups and use the same $MSR$ for each group, while for quota sampling, a framework would use different $MSR$ s to achieve the same sampled data size across groups. Frameworks could also put data that cannot be lost (e.g., certain metadata) into a group with $MSR$ 1. If no GroupID is specified, ApproxLib will treat all messages as part of one global group that uses the global $MSR$ specified by the framework.

4 Approximate Transport Protocol (ATP)

Here we discuss the principles and design choices underlying our Approximate Transport Protocol (ATP). Our core contribution in designing ATP is to introduce a method for enhancing existing rate-control algorithms with the notion of slack. Specifically, while existing algorithms focus on detecting and reacting to network state, we provide principles for showing how to integrate the application’s approximate nature – thus allowing rate-control and packet scheduling to react both to network loss and application’s inherent slack – and illustrate these principles by enhancing DCTCP.

Our design for ATP centers around several principles: first, our protocol should seamlessly co-exist with other transport protocols, e.g., DCTCP; second, to easily interoperate with modern approximate applications (e.g., streaming, batch, and ML frameworks), it should be message-oriented with each message consisting of one or more packets111Traditionally, each packet’s header includes metadata required by the receiver to reconstruct the message, e.g., SeqNum and MsgID.; and, third, our protocol should support jobs with varying levels of slack (even those that can not tolerate any loss).

To optimize application job completion time while preserving accuracy requirements (i.e., successfully send $MSR$ amount of a job’s data as specified by the approximate framework), an ideal approximate transport should send data at an aggressive a rate as possible – essentially in a rate that results in the network dropping at most the number of packets that the application can tolerate, i.e., (i.e., $(1-MSR)\times N$ ).

This differs distinctly from traditional transports, which attempt to prevent packet loss. The core fact that approximate transport embraces loss significantly transforms and simplifies congestion control. First, unlike existing congestion control which includes complex mechanisms for loss recovery, e.g., timeout and dupACKs, an approximate transport does not need to include these. Instead, the transport comprises of a simple but more aggressive rate control algorithm whose aggressive arises because it reacts to the application’s approximate nature in additional to network state (§ 4.2). Second, the fact that in-order delivery is not crucial implies that the protocol can also eliminate the mechanism required to pace and ensure in-order delivery (§ 4.3).

4.1 Strawman Protocol

Based on this intuition, our strawman protocol, what we call ATP_FixTLR, determines the target message loss rate, or $TLR$ , for a job to be $1-$ MSR (specified by framework). Given this $TLR$ , ATP_FixTLR alters the rate control of an existing transport to adjust sending rate in response to $TLR$ instead of traditional signal of congestion (e.g., packet loss). Specifically, it then measures the rate of actual data loss and increases (decreases) the sending rate if the measured loss rate is lower (higher) than $TLR$ . Moreover, if the measured loss rate is still too high after decreasing the sending rate, it would retransmit lost packets. This minor change to rate control enables an existing transport to be approximate aware.

The key issue with this strawman approach is that $TLR$ is statically defined; however, in production networks, there is no way to predict how much data the network will drop perfectly. Additionally, fluctuations in background traffic will also lead to excessive or insufficient loss. As a result, the application’s slack tolerance changes over time as more packets are delivered, or more packets are lost. Thus, we need to dynamically adjust $TLR$ to capture the application’s instantaneous slack.

4.2 Dynamic Slack-Based Rate Control

We propose a novel approximate-aware, slack-based rate control algorithm that simultaneously adapts to network states, controls the magnitude of packet loss, and ensures efficient network resource utilization for both approximate and non-approximate workloads. Unlike existing congestion-control algorithms that use network signals (i.e., ECN, loss, or RTT) as the sole input for their congestion control algorithm (approximation-unaware), ATP uses workload-specific (instantaneous) slack-information in addition to network signals to inform its congestion control algorithm.

Our congestion-control intelligence revolves around the notion of a dynamically-adjusted $TLR$ . Intuitively, $TLR_{t}$ at time $t$ should account for how actual loss rate in the past has deviated from $TLR$ s at those times. However, it is difficult to incorporate all past deviations in an equation to deduct a new $TLR$ . Instead, we could approach the problem from a different angle: how much data could potentially be lost in the future? With this idea, we have the following equation, which determines $TLR_{i+1}$ at the end of every epoch222By default, an epoch is one RTT in our algorithm. $i$ :

[TABLE]

where $N$ is the total number of messages in a job and $(1-MSR)\times\ N$ is the maximum number of messages that could be lost for the entire job. The numerator of this equation is thus the number of messages that can still be lost in the future. The denominator is the total number of messages that could potentially be sent, which include both unacknowledged messages and new (i.e., future) messages that have never been sent out before. Thus, $TLR_{i+1}$ is the potential message loss rate that the job could tolerate in the future. Note that with this equation, $TLR_{0}$ is initialized to $1-MSR$ , the same as the static value used in ATP_FixTLR.

Enhancing DCTCP: Recall that existing rate-control algorithms for today’s transport protocol, e.g., DCTCP, adjust the sending rate by comparing the current loss against a predefined threshold. Instead, with ATP, current loss is compared against a dynamic target threshold, essentially $TLR_{i}$ . More specifically, ATP continuously monitors the current loss rate, $\rho_{i}$ 333 by comparing the ACKs received and the total number of packets sent in an epoch $i$ , with $\rho_{i}=1-\frac{N_{ack}^{i}}{N_{sent}^{i}}$ . If $\rho_{i}$ is less than $TLR_{i}$ , then ATP increases its sending rate:

[TABLE]

where $m$ is a parameter which controls ATP’s reaction speed, and it trades off between convergence speed and network utilization. $R_{max}$ is the highest rate that the host can send at, i.e., the NIC port capacity.

On the other hand, if the current loss rate, $\rho_{i}$ , is higher than the $TLR_{i}$ , then ATP decreases its sending rate:

[TABLE]

Note that the above rate adjusting algorithms follows DCTCP’s (although the criteria for invoking these algorithms are different and are approximate aware). Our choice of DCTCP over other contemporary algorithms stems from its proven stability and convergence properties. Also DCTCP fact that it does not explicitly focus on short flows.

Preventing Starvation: There is one caveat that needs to be taken in the above algorithm. In a highly congested network, ACKs may not be delivered promptly. Without getting ACKs, ATP would decrease both the sending rate (equation 3) and $TLR$ (equation 1); the latter would further cause the sending rate to decrease. With a sending rate that is too small, ATP will not be able to detect or react to network state, causing a starvation problem. To prevent this starvation case, ATP mandates a minimum sending rate ( $R_{min}$ ), which allows flows to probe the network periodically. We set $R_{min}$ to 1 packet per RTT.

4.3 Approximate-Aware Packet Scheduling

Apart from the tolerance to lost data, most444Certain approximate applications like live video analytics [64, 57] have dependencies across messages. They can disable our MRDF scheduling. approximate applications can also tolerate re-ordering of data, e.g., $Avg(A,B,C)$ is the same as $Avg(C,A,B)$ . We exploit this feature by introducing a new scheduling policy called minimal-remaining-data-first (MRDF), which exploits the opportunity of approximate applications’ tolerance of unordered data to send messages out of order for better job completion time. Specifically, an ATP sender (its ApproxLib) calculates the amount of data that is left to be sent for each message (i.e., the total size of the message minus the size of data successfully delivered to the receiver). Given information about message sizes, ATP then sends packets using the shortest remaining data policy. Note that in this process ATP can choose both un-acknowledged packets and new packets that have been never sent out before. Unlike other possible transports that either always re-transmit lost packets or always wait for new packets to arrive, our MRDF scheduling mechanism could achieve the best performance.

5 Slack-Centric Network Resource Allocation

Fundamentally, approximate applications place different demands on the network than traditional applications: they require fewer network resources (because of their loss tolerance). However, if not controlled, they could end up unfairly consuming too many network resources (because of their aggressive sending rates). These features motivate us to rethink switch resource allocation mechanisms.

5.1 Strawman Approach

Approximate workloads could work with small network resources (e.g., lower switch priority queue, less switch queue buffer), but its more aggressive sending rate would unfairly impact non-approximate traffic. Based on this intuitive idea, our strawman approach separates approximate and non-approximate traffic into two switch priority queues. It devotes the highest priority and most (or even all) switch buffer space to non-approximate traffic so as to improve the performance of non-approximate workloads. This seemingly feasible approach has a key issue: all approximate traffic share the same queue, but not all approximate traffic have the same slack tolerance. An approximate application that tolerates more loss could potentially send its traffic more aggressively than one that tolerates little loss and starve the latter.

An improved solution (what we call ATP_FixPrio) is to use multiple queues to differentiate approximate workloads with different slack tolerance and statically assign a job to a priority queue based on its $MSR$ . However, as network states and application states change, a job’s tolerance to loss is dynamic, as discussed in §4.2. Thus, we need a mechanism that could adjust a job’s switch queue assignment dynamically.

Insight. Our insight is that it is beneficial to: (1) allocate more buffer space and assign higher priority to applications that can tolerate less or zero slack, (2) assign applications with similar slack to the same priority to ensure that only application with similar level aggressiveness compete with each other, and (3) dynamically adjust an application’s priority based on its instantaneous loss tolerance.

5.2 Dynamic Priority and Queue Allocation

We propose a dynamic mechanism that assigns different network resources based on a job’s instantaneous slack tolerance. Below, we discuss the detailed design.

Slack-based priority assignment. First, like the strawman approach, we reserve the highest-priority queue only for non-approximate jobs. We use the remaining switch queues555Existing commodity switching chips supports multiple queues (typically 4 to 8 [15, 16, 14]) per egress port. for different approximate traffic and associate different $TLR$ * thresholds* to each of them, i.e., $ThreshTLR_{k}$ for the $k$ th priority queue, with $k=0$ being the highest priority queue. Intuitively, lower priority queues could result in more packet loss and should be used for jobs with more slack tolerance. Thus, $ThreshTLR_{k}$ should be higher for larger $k$ . We empirically evaluated different approaches assigning thresholds to the queue using the data center traces and workloads described in § 7.1. We observed that a simple heuristic which evenly distributes threshold values across 0 and 1 performed best and this was comparable to exponential assignment used in prior works. For example, with 8 switch queues, we have $ThreshTLR_{0}=0$ , $ThreshTLR_{1}=0.125$ , …, $ThreshTLR_{7}=0.875$ .

Instead of statically assigning jobs to switch queues, we dynamically direct a job’s traffic to an appropriate queue based on its current $TLR$ . Specifically, for each epoch, $i$ , we direct flows with $TLR_{i}$ that is greater than $ThreshTLR_{k-1}$ but no smaller than $ThreshTLR_{k}$ to the $k$ th queue.

Slack-based queue buffer space allocation. To determine the buffer size for the different queues, we empirically evaluated different allocation strategies and found the following simple scheme that works well across workloads. We set the switch’s lowest priority queue to use a 1-MTU buffer size and increase the buffer size by 1 MTU for every higher priority (i.e., the $k$ th lowest priority queue size is $k$ MTU), until the second highest priority queue. Then, we leave all the remaining switch buffer to the highest priority queue (non-approximate messages).

A smaller switch queue size for lower-priority queues would cause more data loss, which is acceptable for flows with larger $TLR_{i}$ because they can tolerate more loss. On the other hand, if we were to assign a larger queue to these flows, they would fill up the queues and increase the queue delay, causing packet loss signals to propagate slowly to the receiver and back to the sender. This delayed signal further prevents the ATP rate control from quickly making the right decision, causing more packet loss and retransmission. Lastly, a large switch buffer allocation for approximate flows underestimates loss rate estimation, as it saves packets, which makes ATP overestimates its sending rate. By using small queues for approximate traffic, switches can save most of its buffer space for non-approximate traffic. Doing so improves non-approximate traffic’s performance (i.e., reduces delay) and/or enables switches to handle more non-approximate traffic.

Allocating buffers to queues and assigning thresholds are currently empirically derived parameters. We believe that these parameters will need to be re-evaluated in drastically different topologies or workloads. As part of future work, we plan to employ existing system approaches that use machine learning to tune system configurations.

6 Porting Frameworks to NetApprox

To demonstrate NetApprox’s ease of use and to evaluate NetApprox with real applications, we ported four popular datacenter big data frameworks to NetApprox: one batch (Spark [11]), two streaming systems (Kafka [10], Flink [9]) and one ML system (PyTorch [24]). Below we elaborate on how these frameworks leverage ApproxLib interface.

Batch data-processing frameworks. Batch data-processing frameworks like Spark processes data in batches which are large sets of data of whose total size is known apriori [23]. To port batch frameworks to ApproxLib, their developers modify the framework to first specify the total job size (i.e., batch size) to ApproxLib. Then, at runtime, the user specifies an acceptable error, $e_{max}$ , when a job is submitted and the framework translates this into $MSR$ which is also pass to ApproxLib along size the job size. While different computations require different amount of data to reach $e_{max}$ , we can deduce a good sampling rate based on sampling theories for most computations like Avg and Rank in the following way. We first set the initial $MSR$ based on heuristics e.g., $MSR$ =0.9 for $e_{max}=10\%$ . As more data is received, we use well understood sampling theory [60, 43], similar to prior work on approximation [36, 25, 51], to estimate the error based on the mean and variance of the received data. We then adjust $MSR$ to make the error closer to $e_{max}$ . ApproxLib determines $TLR$ by applying this adjusted $MSR$ in Equation 1.

Batch frameworks often employ grouping (e.g., grouping on a data field). NetApprox supports grouping and allows the user to specify acceptable error for individual groups. At the framework level, developers associate each message with a GroupID. NetApprox optimizes the job completion time while meeting this group-based sampling specification.

We ported Spark to NetApprox by performing approximation at two stages:

the data input stage, i.e., network communication when Spark workers read input data, and 2) the intermediate stage where mappers send data to reducers. Optionally, NetApprox allows users to enable approximation at either or both stages. In total, porting Spark took 320 lines of code and 3 engineering days.

Streaming data-processing frameworks. Unlike batch frameworks, streaming frameworks like Kafka and Flink process data over rolling windows (e.g., time-based or count-based). At the end of a window, streaming frameworks usually perform some computation of the data in the window, which often allows some approximation, e.g., counting the number of web page views per minute. A NetApprox job for streaming frameworks would thus be data to be sent in a window. Streaming frameworks specify the number of messages in a window (i.e., job size) to ApproxLib, together with $e_{max}$ . For streaming frameworks that work with time-based windows, they could estimate the total size of the data in a window based on statistics collected from previously windows (e.g., data arrival rate).

We ported Kafka and Flink by performing approximation at the stages where Kafka feeds data to Flink workers and where Flink performs the flat-map and reduce-by-key operations. In total, porting these frameworks took 473 lines of code and 3 engineering days.

Machine-learning frameworks. Large-scale machine-learning training usually involves a distributed set of servers, each equipped with one or more GPUs [50, 29, 34]. A significant performance overhead in distributed ML training frameworks is the communication cost across servers [46, 58]. Specifically, one of the most costly communication tasks is the allreduce step in data-parallel ML training frameworks [2, 24], which involves different servers exchanging their locally computed gradients. However, not all gradients are of the same importance: smaller gradients (whose values are close or equal to zero) are less important than big gradients. Thus, smaller gradients could potentially be dropped with little or no impact on training accuracy [17, 42, 62].

We ported PyTorch to NetApprox by changing the gloo ring-allreduce functionality. We use different groups to categorize gradients (e.g., a group of gradients that are smaller than 0.001, a group in between 0.001 and 0.0025, and a group that is larger than 0.0025) and use different $MSR$ s for them. The PyTorch ring-allreduce groups gradients into segments and then sends a segment at a time over the network. When all the gradients in a segment fall within the threshold of a group, we mark the segment to use the corresponding $MSR$ of the group, When a receiver is notified (by ApproxLib) of a lost segment, it will use zero as the value for all the gradients in it. In total, porting PyTorch took us 811 lines of code and eight engineering days.

With our ported PyTorch, instead of directly generating grouping criteria (gradient thresholds and $MSR$ for each group) from a user-specified accuracy target which is extremely hard if not impossible, we view them as hyper-parameters that can be tuned in a similar way as traditional hyper-parameters like mini-batch size and learning rate.

7 Evaluation

We evaluate NetApprox using a combination of large scale simulations and testbed experiments which allows us to understand how our system works at a large scale and in practical settings with popular approximate applications.

Implementation details. We implemented ApproxLib as a user-space library in 1554 lines of C++ code. Applications can dynamically link ApproxLib using LD_PRELOAD which intercepts POSIX system calls. ApproxLib supports common POSIX socket APIs like connect, accept, close, write, read, writev, sendfile64, send, sendmsg, and recv. Our current implementation of ApproxLib transparently intercepts network system calls and call corresponding ATP network APIs. We implement ATP as a user-space network stack using DPDK in 5386 lines of C++ code.

7.1 Simulation Results

Simulation details. For our simulator, we extended ns-2 [47] to include NetApprox’s end-host and switch functionalities. Our simulations are performed on two common datacenter topologies; a k=12 fat-tree topology [5] with 192 hosts and a traditional two-tier Clos topology [15, 16, 14] with 144 hosts. For each topology, we evaluated our system with 40 Gbps and 100 Gbps line rate. Because of space constraints, we only present 40 Gbps FatTree results in the paper. The other topologies’ results are all similar qualitatively.

We set the switches to use eight priority queues [15, 16, 14] with a total buffer space of 1.5 MB [20]. NetApprox configures the eight queues in an approximation-aware way as described in § 5. The non-approximate queue applies the ECN marking scheme with a marking threshold of 65 [6, 20], while the approximate queues starts to drop packets if the queues are full.

We compare NetApprox with two sender-side approximation schemes on three reliable transports and simple UDP (which provides a performance baseline). The two sender-side approximation schemes are EarlySend (ES) and SampleSend (SS). ES sends messages as early as possible over a reliable transport and stops sending once MSR data have been received. SS samples messages uniformly as they are ready (initiated by the application) with a sampling rate of MSR; it sends out the sampled message over a reliable transport. The three reliable transports are DCTCP [6], a widely used data-center transport that performs ECN-based adaptive congestion control; pFabric [7], a priority-based transport that schedules packets based on their priority and performs simple rate control; and Aeolous [33], a recently proposed transport aimed for high-speed network which initially sends BDP amount of packets at the line rate and then switches to a credit-based rate control algorithm [20] afterwards. For Aeolous and pFabric, we reuse their opensource codebases.

Simulation workloads. We simulate two workloads using the Facebook key-value-store trace (FBKV) [12] and the Facebook Hadoop trace (FBHD) [54]. We use the distributions of these traces to determine the size and inter-arrival times of messages in the jobs that we simulate. These two traces represent two extreme points in the spectrum of workloads: at one end FBKV consists of jobs with small message sizes and number of messages, whereas FBHD provides the other end of the spectrum with larger jobs.

7.1.1 Overall Performance

We begin by evaluating the general performance of NetApprox (Figures 8 and 8). For these experiments, we use an all-to-all traffic pattern (i.e., all hosts in the network are senders, each randomly selecting a host as the receiver). Unsurprisingly, NetApprox outperforms all reliable-transport-based techniques for both workloads. NetApprox outperforms SampleSend-based techniques because they drop packets at the sender at a constant rate even when the network has sufficient bandwidth to support more approximate traffic. NetApprox’s sending rate is adaptive to the network’s status, resulting in better network utilization and shorter JCT.

NetApprox also outperforms EarlySend-based schemes because EarlySend sends data aggressively in the initial period, which could cause congestion and performance overhead of retransmission. Moreover, NetApprox saves more network resources, since it actively works to ensure that these flows use less than their network fair share, whereas pFabric, DCTCP, and Aeolous try to ensure that they use their “fair” share of network bandwidth. Aeolous has the worst performance for FBKV. Although Aeolous’ credit-based rate control for non-first RTT traffic reduces buffer occupancy, it introduces a new challenge: when a sender has credits but has no data to be sent out in a timely manner, the allocated bandwidth will be wasted instead of giving to other senders [44]. FBKV has lighter load with smaller messages, causing this problem to happen more often.

Less intuitively, NetApprox also outperforms UDP. This is because 1) UDP unnecessarily creates congestion in the network, and 2) when loss happens, NetApprox can retransmit unacknowledged packets if necessary, while UDP has to wait for new packets to be produced by the application. Moreover, UDP does not provide accuracy guarantees. We observe that UDP’s loss rate is 35%, while NetApprox’s loss rate is significantly lower (8.8%).

To better understand NetApprox’s performance improvements, we examine the queuing effect of NetApprox and the other four schemes in Figures 8 and 8. As expected, NetApprox has the smallest queue length (less than 1 MTU). pFabric causes significant queuing, and the queue size grows as MSR gets higher. This is because higher MSR implies more data will be sent and pFabric always starts each flow by sending at line rate, which results in significant queuing, congestion and ultimately timeouts. While not as severe as pFabric, DCTCP also has high queueing because its queue can build up when multiple flows compete for a switch output port, i.e., incast – a well studied drawback of DCTCP [30, 44]. UDP’s queuing is high, especially with FBHD (for which UDP is similar to pFabric), because UDP also sends at the line rate and FBHD has a more intensive data arrival rate than FBKV Aeolous has a much smaller queue length than the other non-ATP schemes, because after the first RTT, senders only send messages based on credits received from the receiver, which allows Aeolous to control queuing size.

In addition to the all-to-all traffic pattern, we evaluate an all-to-one pattern, i.e., incast, where all hosts in the network send to the same host. Figure 12 shows the JCT of NetApprox, DCTCP-SS, and DCTCP-ES. We do not include pFabric or Aeolous results in this figure because their performance is much worse than DCTCP’s (in fact, Aeolous fails to complete the test). In particular, the incast pattern introduces a bottleneck and significant queuing at a specific Top-of-Rack switch. With pFabric, we observed that this incast resulted in significant queue build up and ultimately triggered timeouts which impacted performance. While Aeolous was able to avoid this level of build-up with its credit based mechanisms, we observed that Aeolous has slow performance because of a significant amount of credit waste. Digging deeper, we observed that a combination of the short flows and inter-arrival times of FBKV resulted in scenarios were some flows received credits but were unable to use the credits. NetApprox limited buffer allocations prevents packet build up but incurs loss which approximate applications can tolerate and NetApprox exploits by avoiding re-transmissions and delivering packets out of order.

7.1.2 Dynamic Adjustment of $TLR$ and Priority

A key technical contribution of NetApprox is its ability to dynamically adjust the target loss rate (and as a result, the sending rate) and priority based on network states and a job’s changing slack tolerance. To evaluate the effectiveness of NetApprox’s dynamic mechanisms, we create synthetic, controlled background traffic, while running the foreground job of FBKV with $MSR$ 0.5. Figure 12 shows the timeline of how measured loss rate, $TLR$ , sending rate, and priority assignment change for the foreground job. In the beginning, the foreground job is assigned to switch queue4 (because its initial TLR is 0.5). We create the background traffic to initially also go to queue4. Because of the competing background traffic, the foreground job’s actual loss rate gets higher than its $TLR$ even when ATP quickly slows down the sending rate. ATP then reduces its $TLR$ , its sending rate, and shifts the job to queue3 (one priority higher than queue4). The job’s loss rate then drops (since there is no other competing traffic in queue3) and stays close to its $TLR$ ; as a result, it sends at a more aggressive rate. Then at time 0.59, we shift the background traffic to queue 3 which starts to compete with the foreground job again. With the loss rate again higher than $TLR$ , ATP adjusts the $TLR$ and moves the job to queue2, which then results in the drop of loss rate. Towards the end of the test, we shift the background traffic to queue2, and ATP adjusts the $TLR$ and moves the job to queue1.

To understand how effective our dynamic mechanisms are, we perform a set of experiments similar to the one above, i.e., by having background traffic that starts at the same queue as the foreground job’s and change to one priority higher in the half way. For the foreground job, we use FBKV with different $MSR$ s (thus, they fall into different priority queues initially). Figure 12 plots the JCT of four schemes: DCTCP-ES, ATP_FixTLR which statically sets $TLR$ to $1-MSR$ , ATP_FixPrio which dynamically sets $TLR$ but never moves the job from its initially assigned queue, and ATP_Full which includes both the two dynamic settings. As expected, ATP_Full performs the best, and both dynamic mechanisms improve JCT. Dynamic setting priority (i.e., the difference between ATP_FixPrio and ATP_Full) has huge improvements on JCT, especially for small $MSR$ . This is because ATP_Full can use a much higher sending rate (esp. for smaller $MSR$ ) after it shifts the foreground traffic to higher-priority queues and avoids being in the same queue as the background traffic. Dynamic setting of $TLR$ (i.e., the difference between ATP_FixTLR and ATP_FixPrio) is more effective with larger $MSR$ , because a not-so-well-set $TLR$ as in ATP_FixTLR could demand more retransmission to achieve a higher $MSR$ than when the $MSR$ is smaller and more slack can be tolerated. Finally, even without dynamic mechanisms, ATP still outperforms DCTCP-ES because of its approximate-aware rate control.

7.1.3 Effect of ATP Techniques

To further understand where ATP’s performance gain comes from, we dissect the effect of ATP’s various components besides dynamic $TLR$ and priority-queue setting. We compare (1) ATP_Base, a base protocol that uses raw UDP and after sending out all packets, retransmit lost packets until $MSR\times\ N$ messages are delivered, (2) ATP_RC, which adds approximate-aware rate control to ATP_Base with $TLR$ statically set to $1-MSR$ , (3) ATP_MRDF, which adds the MRDF scheduling policy on top of ATP_RC, and (4) ATP_Full, the final ATP protocol with dynamic $TLR$ and priority settings. Figure 12 shows the JCT of these schemes when running the FBKV workload with different $MSR$ . As expected, rate control, i.e., ATP_RC, largely improves the basic protocol (by up to 42%); the inclusion of our scheduling algorithm, i.e., ATP_MRDF, improves performance by up to 39% over ATP_RC; and our dynamic mechanisms, i.e., ATP_Full, further improves performance by up to 47%. When $MSR$ is small, the effect of different techniques is not obvious, as there is more slack tolerance with small $MSR$ and a simple technique could just work when there is no competing traffic in the network.

7.1.4 Impact on Non-Approximate Traffic

One of the key benefits of NetApprox is in its improvement of non-approximate traffic performance when co-running with approximate traffic. To demonstrate this benefit, we change the ratio of non-approximate and approximate traffic and run the approximate traffic (FBKV) using ATP and DCTCP-ES, with different $MSR$ s. As shown in Figure 16, ATP largely improves non-approximate’s JCT for all the $MSR$ s and ratios, especially when there are more approximate traffic in the network and when $MSR$ is larger.

7.1.5 Job-Level Fairness

A key benefit of NetApprox’s slack-based switch priority queue design is the avoidance of starvation, i.e., workloads with more slack (and thus more aggressive sending rate) should not starve workloads with little or no slack. To evaluate this design, we perform an experiment that runs three approximate jobs together: a FBKV job with $MSR=0.875$ , a FBHD job with $MSR=0.75$ , and a FBHD job with $MSR=0.5$ . The FBKV job starts after the two FBHD jobs, and the FBHD workload is more intensive than FBKV.

Figure 16 plots the JCT of the FBKV workload under four schemes. ATP_Full effectively prevents the light FBKV workload from being affected by data intensive traffic FBHD and achieves the closest performance as when running FBKV alone, since it moves the FBKV traffic to a higher priority queues. ATP_FixPrio and ATP_FixTLR both result in worse FBKV performance, because they do not move the traffic to another priority queue. ATP_FixPrio performs worse than ATP_FixTLR for lighter FBKV traffic load, e.g., 0.25, it is because ATP_FixPrio keeps decreasing its rate when heavy traffic occupies the same queue to minimize the retransmissions.

7.2 Real Implementation Results

Environments. We evaluate NetApprox on a lab cluster with five servers and one 100 Gbps, 32-port N8560-32C Ethernet switch. Each server has two 12-core CPU and 64 GB memory. Three servers are equipped with one Nvidia A6000 GPU respectively. Our switch is configured to use 32 MB shared buffer for all the 32 ports. To model a real switch’s load, we generate background traffic according to production data-center network traffic distributions [66].

7.2.1 Spark Results

We use four servers as spark workers and three servers as kafka broker which are the data sources. We use the NYT taxi dataset [21] for our Spark and Kafka/Flink experiments. It consists of itinerary information of all rides across yellow and green taxi’s in New York City from year 2017 and 2018. The total volume of raw data is about 21 GB. We evaluated two workloads on Spark: Avg of hourly ride distance of the NYT trace, Rank to find the top 15 taxi pickup locations within every hour of the NYT trace. Because space constraints, we only present the result of Avg; the Rank results are qualitatively similar.

Sampling without group. We first use simple sampling for these jobs, i.e., NetApprox treats all messages as coming from a single group with the same importance. Users specify an $e_{max}$ , which NetApprox iteratively adjusts its $MSR$ to meet. Once it is met, the job is considered finished. Figure 16 shows the JCT and error rate of the Avg Spark job with different $e_{max}$ using DCTCP-ES, DCTCP-SS, NetApprox applied only at the input stage (denoted by NetApprox-S1), and NetApprox applied at both the input and mapper-to-reducer stages (denoted by NetApprox-S1&2).

Performing approximation with NetApprox largely improves Spark jobs’ overall performance while only incurring less than 1% accuracy reduction. NetApprox outperforms DCTCP-SS and DCTCP-ES by 21.1% to 54.8%. This is because when switch ports are congested (with background traffic), DCTCP incurs excessive timeouts and retransmissions, while NetApprox adapts to congestion well with our slack-tolerance transport. As expected, using NetApprox for both stages improves JCT over when applying NetApprox only for input stage. NetApprox also achieves better accuracy than DCTCP-ES, because DCTCP-ES only sends the exact amount to reach user-specified $e_{max}$ and stops. In contrast, NetApprox can send more data if doing so would not affect JCT (when the network has extra room). As a result, NetApprox can achieve higher accuracy than $e_{max}$ . NetApprox’s accuracy is slightly worse than DCTCP-SS because DCTCP-SS performs exact uniform sampling, which results in the best sampling quality for this problem. Our sampling is not as uniform to trade for better performance, while still controlling errors within $e_{max}$ .

Sampling with groups. The above results treat all data points the same when sampling. As discussed in §3.3.3, users can have many different sampling criteria for which we offer the group semantics. To demonstrate this usage, we perform a Spark data analytics that computes the Avg of data with two different keys (i.e., using ReducebyKey), as shown in Figure 16. When using groups, NetApprox achieves the same sampling quality within each key, thereby improving the overall accuracy. Without groups, NetApprox samples across all the data from all keys and can end up not meeting the $e_{max}$ requirement within each key.

7.2.2 Kafka and Flink Results

We run three Kafka producers, three Kafka brokers, and four Flink workers. Each producer generates some input data and sends it to a broker in a streaming way. The brokers forwards the received data to the Flink workers. We run a window-based workload on this platform which calculates the Avg of ride distance data received in every one-minute window using the NYT taxi dataset. Since streaming jobs like this do not have a fixed “job” or JCT, we report the average goodput achieved during the streaming.

Figure 18 plots the average goodput and measured accuracy with NetApprox applied only to the Kafka-Flink-data-feeding stage, NetApprox applied to both this stage and the data-processing stage within Flink, DCTCP-ES, and DCTCP-SS. Similar to the Spark results, NetApprox outperforms both DCTCP-SS and DCTCP-ES (higher goodput), and applying NetApprox to both stages further improves goodput. For this workload, NetApprox’s accuracy is better than both DCTCP-SS and DCTCP-ES.

7.2.3 PyTorch Results

We use the three GPU-equipped servers in our cluster to run distributed DNN training on PyTorch. We train the VGG19 model [59] on the CIFAR-10 dataset [37] with minibatch size 80, learning rate 0.00002, cosine annealing, and Adam optimizer. Figure 18 shows the convergence timeline with X-axis as real wall-clock time and Y axis as the training-set accuracy when using NetApprox and DCTCP-ES. For NetApprox, we use a group policy where gradients less than 0.001 have $MSR$ 0, i.e., can all be dropped, gradients between 0.001 and 0.0025 have $MSR$ to 0.125, and all gradients above 0.0025 have $MSR$ to 0.925. We obtained these hyper-parameters from few rounds of tuning. We also tested another simpler group policy, where all gradients below 0.003 have $MSR$ 0 and all other ones have $MSR$ 1; the results are only slightly worse than the more complex group policy, demonstrating the robustness of approximation.

As seen, NetApprox reaches 90% training-set accuracy at 46 minutes, while DCTCP-ES takes 75 minutes to reach the same accuracy. The test-accuracy convergence (not shown for space reason) exhibit similar trends: NetApprox reaches the same test-set accuracy 45% faster than DCTCP-ES.

8 Conclusion

This paper presents NetApprox, the first network system that is designed for approximate computing. NetApprox leverages the inherent slack in approximate applications by embracing loss in the network and by assigning tiny switch resources to approximate traffic. Our large-scale simulation and real implementation evaluation demonstrate that NetApprox simultaneously improves both approximate and non-approximate applications’ performance while guaranteeing user-specified accuracy requirements.

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2Abadi et al . [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke,
3Addanki et al . [2022] Vamsi Addanki, Oliver Michel, and Stefan Schmid. 2022. Power TCP: Pushing the Performance Limits of Datacenter Networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) . Renton, WA, 51–70.
4Agarwal et al . [2013] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. Blink DB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Euro Sys ’13) . Prague, Czech Republic.
5Al-Fares et al . [2008] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A Scalable, Commodity Data Center Network Architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication (SIGCOMM’08) . Seattle, WA, USA.
6Alizadeh et al . [2010] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). In Proceedings of the ACM SIGCOMM 2010 Conference (SIGCOMM ’10) . New Delhi, India.
7Alizadeh et al . [2013] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick Mc Keown, Balaji Prabhakar, and Scott Shenker. 2013. p Fabric: Minimal Near-optimal Datacenter Transport. In Proceedings of the ACM SIGCOMM 2013 Conference on Data Communication (SIGCOMM’13) . Hong Kong, China.
8Ananthanarayanan et al . [2014] Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI ’14) . Seattle, WA, USA.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Exploiting Network Loss for Distributed Approximate Computing with NetApprox

Abstract

1 Introduction

2 Today’s Approximate Computing

2.1 Datacenter Approximate Computing

2.2 Network-Oblivious Approximation

3 NetApprox Overview

3.1 Overall Architecture

3.2 NetApprox Workflow

3.3 NetApprox Interfaces

3.3.1 User Interface

3.3.2 Framework Interface

3.3.3 Expressing Sampling Criteria

4 Approximate Transport Protocol (ATP)

4.1 Strawman Protocol

4.2 Dynamic Slack-Based Rate Control

4.3 Approximate-Aware Packet Scheduling

5 Slack-Centric Network Resource Allocation

5.1 Strawman Approach

5.2 Dynamic Priority and Queue Allocation

6 Porting Frameworks to NetApprox

7 Evaluation

7.1 Simulation Results

7.1.1 Overall Performance

7.1.2 Dynamic Adjustment of TLRTLRTLR and Priority

7.1.3 Effect of ATP Techniques

7.1.4 Impact on Non-Approximate Traffic

7.1.5 Job-Level Fairness

7.2 Real Implementation Results

7.2.1 Spark Results

7.2.2 Kafka and Flink Results

7.2.3 PyTorch Results

8 Conclusion

7.1.2 Dynamic Adjustment of $TLR$ and Priority