Reduced I/O Latency with Futures
Kyle Singer, Kunal Agrawal, I-Ting Angelina Lee

TL;DR
This paper introduces a method using futures to hide I/O latency in interactive applications, providing theoretical guarantees and an efficient implementation that outperforms prior approaches.
Contribution
It proposes a novel futures-based algorithm for hiding I/O latency, with theoretical analysis and a practical, efficient implementation on the Cilk-F runtime.
Findings
The algorithm offers better execution time guarantees than previous methods.
The prototype implementation demonstrates significant efficiency improvements.
Experimental results validate the effectiveness of the approach.
Abstract
Task parallelism research has traditionally focused on optimizing computation-intensive applications. Due to the proliferation of commodity parallel processors, there has been recent interest in supporting interactive applications. Such interactive applications frequently rely on I/O operations that may incur significant latency. In order to increase performance, when a particular thread of control is blocked on an I/O operation, ideally we would like to hide this latency by using the processing resources to do other ready work instead of blocking or spin waiting on this I/O. There has been limited prior work on hiding this latency and only one result that provides a theoretical bound for interactive applications that use I/Os. In this work, we propose a method for hiding the latency of I/O operations by using the futures abstraction. We provide a theoretical analysis of our algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed systems and fault tolerance
Reduced I/O Latency with Futures
Kyle Singer
Washington University in St. Louis
,
Kunal Agrawal
Washington University in St. Louis
and
I-Ting Angelina Lee
Washington University in St. Louis
Abstract.
Task parallelism research has traditionally focused on optimizing computation-intensive applications. Due to the proliferation of commodity parallel processors, there has been recent interest in supporting interactive applications. Such interactive applications frequently rely on I/O operations that may incur significant latency. In order to increase performance, when a particular thread of control is blocked on an I/O operation, ideally we would like to hide this latency by using the processing resources to do other ready work instead of blocking or spin waiting on this I/O. There has been limited prior work on hiding this latency and only one result that provides a theoretical bound for interactive applications that use I/Os.
In this work, we propose a method for hiding the latency of I/O operations by using the futures abstraction. We provide a theoretical analysis of our algorithm that shows our algorithm provides better execution time guarantees than prior work. We also implemented the algorithm in a practically efficient prototype library that runs on top of the Cilk-F runtime, a runtime system that supports futures within the context of the Cilk Plus language, and performed experiments that demonstrate the efficiency of our implementation.
scheduling, work stealing, futures, performance bounds
††copyright: none
1. Introduction
With the prevalence of multicore processors, task parallelism has become increasingly popular. With task parallelism, the programmer expresses the logical parallelism of the computation and let an underlying runtime system handles the necessary load balancing and synchronizations. Modern parallel platforms that implement task parallelism include but are not limited to OpenMP (OpenMP 4.0, 2013), Intel TBB (Intel Corporation, 2012), various dialects of Cilk (Intel Corporation, 2013; Lee et al., 2010; Leiserson, 2010; Danaher et al., 2008) and Habanero (Barik et al., 2009; Cavé et al., 2011), X10 (Charles et al., 2005), and Java Fork/Join framework (Lea, 2000). These platforms often schedule parallel computations using work stealing, which provides provable bounds on execution time (Blumofe and Leiserson, 1994, 1999; Arora et al., 1998, 2001), good space bounds (Blumofe and Leiserson, 1999), good cache locality (Acar et al., 2000, 2002a), and allows for an efficient implementation (Frigo et al., 1998).
Research on task parallel platforms has traditionally focused on optimizing for compute-intensive and throughput-oriented applications, such as ones found in the domain of high-performance and scientific computing. Multicore processors have become commonplace and used in personal computers and servers, however, and a fundamental component of desktop software is its frequent interactions with the external world, done in the form of input/output (I/O), such as obtaining user input through key strokes or mouse clicks, waiting for a data packet to arrive on a network connection, or writing output to a display terminal or network.
The classic work stealing algorithm does not account for I/Os. I/O operations are typically done via low-level system libraries (e.g., the GNU C library) or through system calls provided by the Operating Systems (OS). While one can directly invoke functions provided by these libraries within a task parallel program, doing so has performance implications. In particular, when a worker thread — surrogate of a processing core managed by the scheduler — encounters an I/O operation, it can block for an extended period of time, leaving one of the physical cores underutilized. 111Low-level system support for asynchronous (non-blocking) I/O exists, but resuming the context on getting the I/O completion (typically a signal) can be complex.
In this work, we design a scheduler that hides the I/O latency — when a worker encounters a blocking I/O, it simply suspends the current execution context and works elsewhere in the computation. When the I/O completes, some worker (not necessarily the worker that suspended it) picks up the suspended context and resumes it. Moreover, the programming model seamlessly integrates both blocking and nonblocking I/Os into the task parallel programming model. Finally, the scheduler provides provably good performance bounds and efficient implementation.
As far as we know, only one prior result provides provably efficient scheduling bound of task parallel programs with I/Os. Muller and Acar (2016) present a cost model for reasoning about latency incurring operations (such as I/Os) in task parallel programs. In their work, given a computation with work — the total computation time on one core — and span222The term span is sometimes called “critical-path length” and “computation depth” in the literature. — the execution time of the computation on infinitely many cores — the scheduler executes the computation in expected time , where the is the maximum number of latency incurring operations that are logically in parallel. Their bound is latency-hiding in that, the latencies of I/O only appear in the span term and not the work term. If no latency-incurring operations are used, their bound is asymptotically equal to the standard work stealing bound of .
In this work, we improve the latency-hiding bound by using a scheduling algorithm based on ProWS (Singer et al., 2019), a recently developed work-stealing scheduler that efficiently supports futures. We implement I/O operations seamlessly within task parallel code using futures while getting nearly asymptotically optimal completion time. In particular, we were able to prove that our latency-hiding scheduler provides an execution time bound of in expectation; this bound is independent of the number of I/Os in the system. In particular, compared to the standard work-stealing bound, it just has an additional term of on the span term. This implies that while the standard work-stealing scheduler provides linear speedup when , our scheduler provides linear speedup when . ProWS has the same bound, but the analysis does not directly apply here due to the latency of the I/Os. We extend ProWS’s bound to futures with I/Os.
Once we extend the analysis, we essentially inherit the bound on “deviations” from ProWS. Intuitively, deviations are points at which the parallel execution of a program differs from its sequential execution. Spoonhower et al. (2009) argue that the number of deviations provides a good metric for evaluating practical performance because it is highly correlated to scheduling overheads and cache misses during parallel executions. ProWS (and we) guarantee that the expected number of deviations is where is the total number of futures which are logically in parallel.
The high-level intuition on why using futures to do I/Os and then using ProWS to schedule these futures provides better bounds is as follows: The work-stealing algorithm by Muller and Acar is parsimonious — a worker never steals unless it runs out of work to do. In contrast, ProWS’s and our work-stealing algorithm is proactive — whenever a worker encounters a blocking I/O operation, it suspends the entire execution context and finds something else to do by work stealing. This behavior may seen counter-intuitive since it potentially increases the number of steal attempts. It turns out, however, that that by carefully managing deques, one can amortize the steal cost against the work term sometimes, thereby obtaining a better bound. More importantly, we can get good bounds on deviations for the following reason: In the earliest result on deviations, Acar et al. (2002b) related the number of deviations to the number of steal attempts for fork-join programs. However, this relationship does not hold in parsimonious work-stealing if the program uses unstructured blocking operations like futures or I/Os making it difficult to bound the number of deviations. In proactive work-stealing, we can again bound deviations using the number of steals, allowing us to bounds deviations.
Our prototype system Cilk-L is based on Cilk-F (Singer et al., 2019), an extension of Intel Cilk Plus (Intel, 2013) that supports futures and implements ProWS. Cilk-L is able to defines a special type of futures, called IO futures, which utilize the parallelism abstraction provided futures to schedule I/Os in a latency-hiding manner which is composable with the rest of parallel constructs supported in Cilk-F (spawn, sync, fut-create, and get; we will briefly discuss these in Section 2). When a worker invokes an I/O operation using IO futures, a handle is returned, and the I/O can be done either synchronously by calling get on the handle immediately, or asynchronously, calling get at a later time when the result is needed in order for the control to proceed.
We empirically evaluated Cilk-L with microbenchmarks that interleave compute-intensive kernels with operations that incur I/O latencies. The empirical results indicate that, Cilk-L is effective at latency hiding. When we compare the execution times of Cilk-L with the “idealized’ execution times (where the I/O does not incur latency), we find that Cilk-L incurs little overhead, indicating the the I/O latencies are mostly hidden and occur in the background. In order to support future I/O, Cilk-L necessarily needs to incorporate additional system support for scheduling I/O asynchronously. We also provide an detailed breakdown of overhead.
Summary of contributions:
We extend the scheduling algorithm in Cilk-F to incorporate the latency-hiding cost model, and show that with I/O latency, the algorithm can schedule the computation in time on cores, independent of the number of I/O operations active in parallel. This bound is an improvement over the prior state-of-the-art by Muller and Acar. Since is a lower bound on the execution of this program on processors, this bound is nearly asymptotically optimal except for the overhead on the span. Moreover, our algorithm provides bounds on stack space and deviations, whereas the algorithm by Muller and Acar does not (Section 4).
We developed Cilk-L by extending Cilk-F to incorporate support for scheduling I/O in a latency-hiding way. By utilizing the abstraction of futures, one can perform asynchronous I/Os in task parallel code in a way that is composable with other parallel constructs (Section 3).
We empirically evaluated Cilk-L using microbenchmarks. The empirical results indicate that Cilk-L hides I/O latencies effectively and incurs little scheduling overhead in doing so (Section 5).
2. Preliminaries
This section provides the necessary background. We first discuss the syntax and semantics for the parallel control constructs supported by Cilk-F (Singer et al., 2019) and how one can represent a computation expressed with these keywords abstractly as a DAG. We then discuss how a parsimonious work stealing runtime schedules the computation assuming no latency-incurring operations are present.
Parallel Control Constructs:
Cilk-F, and by extension Cilk-L, support a small set of parallel control constructs: spawn, sync, fut-create, and get.333The keyword cilk_for also exists to indicate parallel loops, but it is just a syntactic sugar that translates to binary spawning of iteration space using spawn and sync. In Cilk-F, these keywords operate at the level of function calls. When a function spawns off a function by prefixing the call with the spawn keyword, may execute in parallel with the continuation of (the statements after the spawn). The keyword sync is the counterpart of spawn; it indicates that control cannot pass beyond the sync statement until all previously spawned children have returned. In Cilk-F, there is an implicit sync at the end of every function, ensuring that all children spawned via spawn return before this function returns.
The keyword fut-create works in a similar fashion as spawn. When a function spawns off a function by prefixing the call with the fut-create call, may execute in parallel with the continuation of . Unlike spawn, however, the execution of a sync has no effect on fut-create. The control can pass beyond sync even if a function previously spawned off via fut-create has not returned. Moreover, a fut-create returns a handle , which is an object that the execution of is associated with. The handle can later be used to ensure termination of and retrieve its result. In particular, when finishes execution, the last instruction is implicitly a put call which puts the result of into and marks the future as ready. By invoking get on the handle, the control cannot pass beyond the get until the execution of terminates and the future is marked as ready.
Execution DAG:
Parallel computations generated by programs written with these primitives can be represented using a directed acyclic graph (DAG). Vertices of the DAG represent a unit time computation task444This is an assumption of convenience — longer operations can be represented as a chain of unit time operations. and edges represent dependences between nodes. We make the standard assumptions: there is a single root node and the out-degree is at most 2.
We classify nodes into a few different categories. Regular nodes are simply computation nodes. A spawn node executes a spawn and has two children — the left child is the first node of the spawned function and the right child is the continuation node. A join node represents the continuation after a sync call and has multiple parents — sync node itself and the last nodes of all the functions being synced. The fut-create keyword behaves similarly to spawn and generates a future spawn node that has two children: the left child is the first node of the future task and right is the continuation. A future join node is the node immediately after the invocation of get and has two parents — the get node (called the local parent) and the future put node — the last node of the corresponding future task that puts the result of the future in the future handle.
We say that a node is ready or enabled if all its predecessors have executed. The work of the computation DAG is the total number of nodes in the DAG and is represented by — it is the total time to execute the DAG on 1 processor. The span of the weighted DAG is the longest path in the DAG and is represented by .
Parsimonious Work Stealing:
As mentioned in Section 1, parsimonious work-stealing works by doing local work first. In computations with no latency forming blocking operations, each worker maintains a single double ended-queue (or deque) of ready nodes. For the most part, a worker operates on its deque. In particular, when a worker finishes executing a node, it may enable 0, 1 or 2 of its children. If it enables one child, the worker next executes the child. If it enables two children, it puts the right child on the deque and executes the left child. If it enables no children, it pops the bottom node from its deque and executes that node. Only when a worker runs out of work (its deque comes empty), does it turn into a thief. At this time, it randomly chooses a victim to steal work from. Upon steal, the thief steals the ready node from the top of the victim’s deque and executes it. If victim deque has no ready nodes, then the worker tries another random steal.
3. The System Implementation
This section describes Cilk-L, a prototype system that extends Cilk-F (Singer et al., 2019) to incorporate support for performing I/Os with latency hiding. The I/O support in Cilk-L consists of two main components: the IO futures library and runtime support for doing asynchronous I/Os. We first discuss the programmer API for using IO futures, its implementation, and then the runtime support for asynchronous I/Os. Since the I/O operations are typically supported via low-level system libraries and by the underlying Operation System, currently Cilk-L only targets Linux platform and utilizes various file-I/O related facilities from Linux.
The IO Futures Library
We use an example to illustrate the programming API provided by the library. Figure 1 shows the distributed map-and-reduce example used by Muller and Acar (2016). Our microbenchmark effectively uses the same parallel structure to generate workload with I/O latencies; doing so allows us to indirectly compare our results with the empirical results in Muller and Acar (2016) (discussed in Section 5).
In this example, the function distMapReduce takes in five parameters: , , , , and . The computation works as follow. The code obtains a set of input values from different network connections. In Linux, all I/O devices are presented as files, including network connections, which allows for a uniform interface for performing I/O (Bryant and O’Hallaron, 2015, Chp. 10). The call to openConnection in line 1 abstracts away the sequence of steps to open a network connection, which returns a file descriptor representing the connection once it’s open. For each value in the set the code applies the map function , and then combine the resulting values of applying using a binary reduction operation .
The IO futures library exposes one data type to the programmer, the handle for IO futures io_future, and two I/O functions, cilk_read and cilk_write. The cilk_read and cilk_write functions are analogous to the Unix read and write system calls, except that they are asynchronous, i.e., non-blocking. Upon calling, both functions return an io_future handle representing the on-going I/O operation once it’s set up but the function itself does not block on the I/O. However, when the result is needed in order to proceed, the programmer can invoke get on the io_future handle. Like get on an ordinary future, control cannot pass beyond get on the returned io_future until the corresponding I/O operation completes.555The worker itself does not block when this happens — the worker takes actions according to the proactive work-stealing strategy described in the next section.
Every call to a Cilk-L I/O function first creates an io_future to represent the non-blocking I/O request, and bundles it with the corresponding data required to carry out the I/O request (such as the file descriptor and the buffer to store input). This data bundle is then inserted into a lock-free single-producer/single-consumer queue, which we refer to as the communication queue, to be processed by the runtime. The io_future is then returned to the caller. If the user needs the result from the future or wants to ensure that it has completed, it can perform a get on this handle io_future. The instruction immediately after this get (the continuation of get, in other words, the future join node) can not execute until the I/O has completed.
Runtime Support for Hiding I/O Latencies
At runtime startup, normally Cilk-F creates persistent threads for workers. In Cilk-L, persistent threads are created — for every worker a corresponding I/O thread is created, and this persistent thread is pinned to the same core as its worker.666If the hardware has hyperthreading enabled, Cilk-L pins them to separate hardware threads (hyperthreads) associated with the same physical core. The I/O thread is only used to process I/O requests (via the IO futures library) generated by the worker’s execution of user code. Thus, in the library implementation described above, the communication queue is used as means for the worker thread to communicate I/O requests to its I/O thread.
When an I/O thread runs, it dequeues items from the communication queue and attempts to perform an I/O operation as soon as it is received. If an I/O operation cannot be completed immediately (e.g., the next package has not arrived on the network channel yet), however, we would like to put the request aside and process it later when the I/O device becomes ready (has more input to be consumed).
In order to describe how the actual mechanism works, we need to briefly discuss how I/O works on Linux. As mentioned earlier, any I/O device on Linux is represented as a file descriptor. Obtaining an input (read) from a file descriptor is effectively copying data from the corresponding device into memory (e.g., the in the example). If the device is not ready to be read (e.g., the next package has not arrived on the network channel yet), the system call read will block. One could make the system call with the correct flag so that the system call would simply return instead of blocking, with a return value indicating input not ready. However, in this case, we must make the system call to check back periodically in order to know when a device becomes ready.
On possibility is to periodically wake up the I/O thread and have it poll the device via non-blocking read. This scheme is not ideal, as a system call can be expensive. Moreover, if the device is not ready, checking would simply cause the I/O thread to take up processor cycles that could be better used by its worker working on the actual computation. Thus, we would like to avoid the periodic wake up and the unnecessary system calls. Ideally, we would like the I/O thread to simply sleep and not use any processor cycles unless one of the following conditions happen: a) one of the file descriptors with pending operations becomes ready; or b) its worker inserts a new I/O request into the communication queue.
To achieve part a), we use the Linux epoll (Lin, 2019a) facility which allows the I/O thread to monitor a set of file descriptors (an epoll set). Adding a file to be monitored is an operation, where is the number of file descriptors currently in the epoll set. The I/O thread can go to sleep by calling epoll_wait, and it will be woken up when one of the monitored file descriptors become ready. Determining which file descriptors monitored has become ready is a operation — adding a file descriptor to the epoll set registers a callback with the file’s underlying system driver; this callback will move the file into a ready list and wake the monitoring thread when I/O on that file becomes possible. Once the I/O thread is woken up, it can query epoll to obtain the list of ready file descriptors, which allows the I/O thread to determine which pending I/O operations can continue. In summary, each I/O thread maintains its own epoll set. When an I/O thread receives an I/O request but the corresponding file descriptor is not ready, the I/O thread adds the file descriptor to its epoll set to be monitored. Once an I/O thread has processed all I/O requests in the communication queue, it goes to sleep via epoll_wait. Doing so achieves part a).
One last piece puzzle is how one to avoid having the I/O thread check the communication queue periodically and yet still allow submitted I/O requests to be processed quickly whenever it is received. We solve this by using an event wait/notify mechanism called eventfd provided by Linux (Lin, 2019b). The eventfd mechanism is used to create a file descriptor that can read by a I/O thread and written to by its worker. This file descriptor can be opened with semaphore-like semantics, in which writes will increment a backing counter and reads will decrement the same counter. When used with epoll, a write to an eventfd file descriptor will cause the I/O thread to wake up whenever the backing counter is incremented from 0 to 1. By writing to an eventfd file descriptor associated with a communication queue whenever an I/O operation is enqueued, and by symmetrically reading from the same file descriptor whenever an operation is dequeued, epoll can also be used to monitor the state of the communication queue. Thus, we use this combination of eventfd and epoll to achieve part b).
By combinations of these mechanism, we achieve the effect that, an I/O thread is only woken up and take up processor cycles when either there is a new I/O requests from the worker or when one of the previously processed I/O that was pending now becomes ready. When a I/O thread completes an I/O operation, it performs a put on the corresponding io_future handle. From its worker’s perspective, a call to get can cause the current execution to suspend, but the worker would just go find something else to do. Cilk-L schedules the execution of the IO futures just as how Cilk-F schedules ordinary futures, which we briefly review in Section 4.
4. Algorithm and Analysis
In this section, we will describe how to represent a program with I/Os abstractly, the high level scheduling algorithm, and the runtime analysis. For scheduling, we will essentially use ProWS, the proactive work-stealing algorithm described by Singer et al. (2019). The algorithm described in that paper schedules programs with futures in a time and space efficient manner. For completeness, we will briefly describe the algorithm here. However, the analysis in that paper handles futures but not I/Os. Here we will show how that analysis can be extended to also appropriately handle I/O latencies.
Execution DAG
We will extend the model from Section 2 and add weighted edges in a manner similar to Muller and Acar (2016). In our model, I/O operations are performed within future tasks. The invocation of an I/O function (cilk_read and cilk_write) creates an io_future implicitly, sets up the necessary data for the I/O request, inserts the request into the communication queue (discussed in Section 3), and returns. We will call the last node of this future task before it returns the I/O setup node. However, unlike in non-I/O future tasks, this future itself is not ready. The future is ready when the I/O thread and executes put upon the I/O completion — we will call the put node of an I/O future an I/O put node. We will have a heavy edge between the I/O setup node and the corresponding I/O put node — the weight on this edge represents the amount of time it took for the I/O to complete. All other edges are light in that they have weight of one.
We can define work and span — the work is unchanged: it is the total number of nodes in the DAG. Therefore, it is unaffected by the latencies on the edges. The span of the weighted DAG is the longest weighted path in the DAG and is the only parameter affected by the latencies.
Again a node is ready if all its predecessors have executed, except for the I/O put node. An I/O put node is suspended once its predecessor (the corresponding I/O setup node) finishes executing. If is the weight of the incoming edge to the put node, it remains suspended for time steps. After these time steps, it is considered to have finished executing since the I/O thread will write the result into the future handle after these time steps. This definition of suspension of a put node is simply for the ease of analysis and has no impact on the scheduler since the put node is executed by an I/O thread and not by the worker thread.
Proactive Work Stealing
We use Singer et al. (2019)’s proactive work stealing scheduler unchanged. The main difference between proactive and parsimonious work stealing is the handling of a blocked future get. In parsimonious work stealing, when a worker’s current node executes a get and the future is not ready, the subsequent future join node is not enabled. Therefore, the current node enables 0 children and (as described in Section 2) the worker pops the next node from the deque and continues working on it. Muller and Acar (2016)’s algorithm is a variant of this — when a worker blocks on an I/O, it pops the next node off its deque and keeps working on it.
A proactive work-stealer behaves differently on executing a get where the future handle which is not ready.777There are other circumstances where the execution of the node enables no other nodes such as when a worker returns from a spawned or future function — in all these circumstances proactive work stealng behaves as the parsimonious one and pops the bottom node from its deque. instead of popping the next node from its active deque , the worker work steals. In particular, the worker (1) marks the current deque suspended; (2) it randomly picks another worker and donates this suspended deque with this worker; and (3) allocates a new active deque for itself and randomly work steals. When the handle become ready (the future finishes), then the corresponding put node marks the deque resumable and pushes the future join node to the bottom of .
Therefore, in a proactive work-stealing scheduler, each worker has potentially many deques. One of these is active — this is the deque the worker is currently working on. In addition, it many have many suspended and resumable deques — collectively, the suspended and resumable deques are called the worker ’s inactive deques. In addition, any suspended deques that have no ready nodes are unstealable; all other deques are stealable. The reason for this distinction is that unstealable deques have no ready nodes, so stealing from them is a waste of time. Note that any resumable deques with no ready nodes are simply de-allocated. However, a suspended deque with no ready nodes cannot be deallocated for the following reason. Deque is suspended since some get executed, but the corresponding future has not completed. When this future completes, the corresponding put node will enable the future join node and push it at the bottom of and mark it resumable. Therefore, if we deallocate it, we would not have a targeted place to push this future join node.
A steal attempt also works slightly differently compared to traditional work stealing. When work stealing, a thief first picks a random victim and then picks a random stealable deque to steal from among the deques that the victim has. If the target deque is suspended, then the worker simply steals the top node from the deque. If the deque has no more ready nodes, then this deque is marked unstealable. There are additional details on how to handle resumable deques in order to get the correct bounds on running time and deviations — however, these details do not change in our analysis and we refer the reader to Singer et al. (2019) for those details.
The important bits from the perspective of our understanding are the following: (1) Every worker has potentially many deques: one deque is active, and there are potentially many inactive deques (either suspended or resumable and some of the suspended deques may be unstealable); and (2) due to random throws when the deques are suspended, all workers have approximately equal number of deques. We will use these two facts in the analysis.
Analysis
The analysis of the proactive work-stealing scheduler is, to a large extent, an extension of the analysis by Singer et al. (2019) (henceforth, we will call them SXL) with proper accounting for latency edges. Muller and Acar do account for latency on edges, but do not use futures for I/O’s, use parsimonious work stealing, and do not rebalance deques between workers. Therefore, the running time on processors is , where is the maximum suspension width — the number of I/O’s that can be pending at the same time in the DAG. There is no bound on the number of deviations. The way they handle the potential function in order to hide the latency is a little different from our method.
Here, we are using parsimonious work stealing and want to get a running time bound of and the deviation bound of where is the total number of futures logically in parallel. For the special case where all futures are I/O futures, .888SXL provide separate deviation bounds for structured and general futures and they both carry over. If all our I/Os are done using structured futures, then the bound is better than this — here we only provide the general bound. Their analysis doesn’t work out of the box, however, since it does not consider the latency on heavy edges. Therefore, here, we will rely on the lemmas proved in that paper, but modify the potential function in order to handle the heavy edges appropriately.
In general, in work stealing, a worker is always either working or stealing. The main point of the analysis is to bound the total number of steal attempts, say by . Since the total work is , the total running time is . In addition, bounding the total number of successful steals also gives us a bound on the total number of deviations (for proactive work stealing, though not for parsimonious work stealing).
ProWS Potential Function and Analysis
SXL’s analysis uses a potential function similar to the one used by Arora et al. (1998) (henceforth called ABP) to bound the number of steal attempts. The potential function there is based on the enabling tree — we say that enables if is the last parent of to execute and, in this case, we add an edge between and in the enabling tree. It turns that, for technical reasons, when using proactive work stealing, we cannot use the enabling tree. Instead, we will use the DAG itself to decide the potential of the node.
The potential function is based on the depth of nodes in the DAG. The depth of the node with one parent is . The depth of a node with multiple parents is similar, except that we add 1 to the depth of the deepest parent. The weight of node is .
We say that a node is the assigned node for deque if is the active deque for some worker and is currently executing . The potential of a node is defined as follows: if is assigned and is ready. For technical reasons, we will say that the assigned node for deque is at the bottom of deque even though it can not be stolen. The total potential of a deque is the sum of the potential of all nodes on including the assigned node if is active. The total potential of the computation is the sum of the potentials of all the ready and assigned nodes on all the deques.
Some of the key results from ABP carry over with these changes in definitions.
Lemma 4.1.
The initial potential is and the final potential is [math]. In addition, the potential never increases.
Lemma 4.2.
Top Heavy Deques The top most node in the deque has a constant fraction of the total potential of the deque.
The intuition is that the top of the deque contains the node that was pushed on the deque farthest in the past and therefore, it is the shallowest node in the DAG. Since the potential decreases geometrically with the depth, this node contains most of the potential of the deque.
The following lemma is a straightforward generalization of Lemmas 7 and 8 in ABP (Arora et al., 1998). The high-level intuition is that since the top node of each deque contains constant fraction of its potential, if we steal and execute the top node from each deque with reasonable probability, the overall potential is likely to reduce by a constant fraction.
Lemma 4.3.
Let denote the potential on deques at time and say that the probability of each deque being a victim of a steal attempt is at least . Then after steal attempts, the potential of deques is at most with probability at least .∎
In ABP, since there are only deques, one for each worker, this lemma shows that random steal attempts reduce the potential by a constant factor with constant probability. However, in ProWS, there are potentially many deques. Therefore, we may need many more steal attempts to reduce the potential. In addition, it is difficult to design a way to steal from all deques with equal probability if the deques are distributed across many workers.
In ProWS, however, recall that that when a deque is suspended, the worker picks a random worker and donates the deque to that worker. Therefore, even if one worker suspends many deques, it does not hold on to them — the deques are approximately evenly distributed among all workers. When a worker steals, it picks a random victim worker and then a random stealable deque from the victim. Therefore, each deque has an approximately equal chance of being a victim of a steal attempt. In particular, SXL show the following:
Lemma 4.4.
Given workers and stealable deques in the system, each worker has at most stealable deques with probability at least .
Another insight SXL uses is that a steal attempt from a stealable deque is generally successful if it is not an active deque. Therefore, if there are many (more than ) stealable deques in the system (and only of them are active), then most processors have at least one stealable deque and most steal attempts are successful. These periods are called work-bounded phases and SXL argue that the total number of steal attempts in work-bounded phases can be bounded by in expectation.
Therefore, we only need worry about decreasing the potential when there are not too many stealable deques — these times are called steal-bounded phases. SXL use Lemma 4.4 to argue that each stealable deque during a steal-bounded phase has at this times has at least chance of be of being a victim of a steal attempt (for some constant ) since no worker has more than stealable deques. Therefore, using Lemma 4.3, the potential of deques reduces by a constant factor after steal attempts (since unstealable deques are empty and have no potential). Given that the initial potential is , the expected number of steal attempts during steal bounded phases is . Therefore, considering both work- and steal-bounded phases, the total number of steal attempts is . In addition, they also separately bound the expected number of successful steals in work bounded phases is . This allows them to bound the deviations by .
Changes to potential and analysis to handle weighted edges
We want to show the same bounds when we use futures for I/O. The bounds on steals in work-bounded phases carry over unchanged. In particular, the expected number of steal attempts in work-bounded phases is still and the expected number of successful steals is still . However, for steal bounded phases, where we rely on potential to bound the number of steal attempts does not apply directly for somewhat technical reasons.
Consider the following scenario. Some worker with active deque executes a get on an I/O future handle and blocks since the future is not ready. It suspends this deque and steals. At some point, the I/O thread completes the I/O corresponding to , executes the put, enables the future join node (say ) for , and puts it at the bottom of . Note that this deque’s potential now suddenly increases, and our analysis strongly depends on the potential never increasing.
This is not a problem for SXL for the following reason: If a particular future join node is not ready, then some deque must have some ancestor of on its deque (either as a ready or an assigned node). Therefore, does not appear on from nowhere — some ancestor executes, this ancestor’s potential is larger than ’s potential and therefore, even though becomes ready, the overall potential of the computation does not increase. In our case, no ancestor of is ready or assigned anywhere in the system since the reason is not ready is due to the latency on an I/O edge. This is problematic since being enabled increases the potential of the system.
To fix this problem, we have to give potential to put nodes for I/O futures (even though they are executed by I/O threads) and handle them in a special way. In particular, recall that the only heavy edges in our DAG are between I/O setup nodes and the corresponding I/O put nodes. For I/O put nodes, we will define two notions of depth: the initial depth of an I/O put node with enabling parent (which is always an I/O setup node) is . The depth starts out as and reduces on every time step while the I/O is pending and this I/O put node is suspended. If the weight of the heavy edge (the latency of the corresponding I/O) between and is , then is suspended for steps. Therefore, ’s final depth is . When the I/O completes, this put node completes.
Now consider the child node of the I/O put node — is always a future join node. When deciding the depth of , we always use . That is, if is ’s other parent (the node generated by the get operation), then .
The potential of a pending put node is defined just like other ready nodes. At any time, if the depth of the put node is , its weight is and potential is . However, since the depth of the node changes over time, so does its weight and potential. The total potential of the computation is the sum of the potentials of all the ready and assigned nodes on all the deques as well as the potentials of all the put nodes (which are not on any deque).
We now get back the following lemma:
Lemma 4.5.
The potential never increases.
Proof.
We only need consider the case when a future join node is enabled by a I/O put node . By definition, has lower depth and therefore higher potential than . is only enabled once finishes. Therefore, the potential of the system does not increase. ∎
However, adding these put nodes creates a problem. These put nodes are not on any deque; therefore, steal attempts do not reduce the potential associated with these nodes directly. We must also now argue that the potential of I/O put nodes decreases appropriately during steal-bounded phases. This is the reason why we designed the potential of these put nodes in the funny way where their potential starts out high and reduces on every time step.
Lemma 4.6.
During steal-bounded phases, if the total potential at time is (including potential of assigned, ready and suspended put nodes), then after steal attempts, for some constant , the potential is at most with probability at least .
Proof.
SXL already argued that the potential of deques reduces appropriately. Therefore, we only need to consider the suspended I/O put nodes. Note that if the latency of a weighted edge is , then the corresponding put node remains suspended for time steps. Its potential starts at and decreases by a factor of on every time step. When the I/O completes and the future is ready, the potential of the put node is . Since it takes at least time steps to do steal attempts, the potential of this put node reduces by a large fraction during this time. This is true for all put nodes, giving the desired result. ∎
This lemma allows us to bound the expected number of steal attempts (and therefore, expected number of successful steals) during steal bounded phases by — the same result as SXL. Since the expected number of steal attempts and successful steals for work-bounded phases remains unchanged, we get the same time and deviation bounds as SXL.
Theorem 4.7.
The expected number of steal attempts is . Therefore, the expected running time is . In addition, expected number of deviations is .
5. Empirical Evaluation
This section empirically evaluates our prototype implementation of Cilk-L using a microbenchmark map-reduce that closely resembles the example shown in Section 3 Figure 1. We would like to answer the following three questions in the evaluation: 1) how well does Cilk-L hide latency; 2) how much latency-hiding Cilk-L can do compared to an “idealized” system that hides all the latency but incurs no additional overhead; and 3) how much each mechanism used to hide latency contributes to its overhead. How we measure each is explained in its respective subsections. Overall, the empirical results indicate that, Cilk-L hides latency well and can obtain significant speedup compared to running on Cilk-F. When the latency is short, one could use the oversubscribing strategy (allocating more workers than number of cores and let the OS scheduling hide latencies) and obtain benefit. However, Cilk-L breaks even compared to the oversubscription strategy at a latency of milliseconds using map-reduce, and outperforms the oversubscription strategy after that. The implementation of Cilk-L has been demonstrated to be lightweight, incurring minimal overhead comparing to an idealized version (that incurs zero latency with no system mechanism overhead).
Experimental setup.
We ran our experiments on a machine with two Intel Xeon Gold 6148 processors, each with 20 2.40-GHz cores, with a total of 40 cores. Each core has a 32 KB L1 data cache, 32 KB L1 instruction cache, and a 1 MB L2 cache. Hyperthreading is enabled. Both sockets have a 27.5 MB shared L3 cache, and 768 GB of main memory. Cilk-L and map-reduce are compiled with LLVM/Clang 3.4.1 with -O3 -flto running on Linux kernel version 4.15. Each data point is the average of runs. All data points have standard deviation less than except for a couple data points at .
Benchmark
We use a microbenchmark with very similar code structure to the map and reduce example (map-reduce) described in Section 3 Figure 1), which is also used by Muller and Acar (2016). Like Muller and Acar, we emulate remote server connections with simulated delays. At line 1 in Figure 1, rather than opening a true network connection, we used a timed file descriptor which becomes ready for I/O when the timer expires.999This functionality is provided by the Linux timerfd (Lin, 2019c). We replace the parameter with a parallel version of the naive recursive implementation of Fibonacci with a serial base case of 15, and used it to compute the 30th Fibonacci number. In place of calling function g (line 1), we return the sum of r1 and r2.
How Well Can Cilk-L Hide I/O Latencies
To answer question 1), we compare Cilk-L with two versions of Cilk-F:101010Cilk-F extends Cilk Plus to support futures. Singer et al. (2019) empirically evaluated Cilk-F and showed that it performs comparably to Cilk Plus. one uses the same number of workers (Cilk-F) and one uses twice as many workers so as to oversubscribe the system (Cilk-F (O)) and let the underlying OS perform scheduling to hide latency. The Cilk-L executes map-reduce that uses IO futures to hide latencies whereas the two versions of Cilk-F execute the baseline code that simply uses blocking read. Note that, since read is used in place of IO futures, the baseline version contains only spawn and sync.
We ran map-reduce with simulated I/O latencies of millisecond, milliseconds, milliseconds, and milliseconds. Figure 1 shows the speedup of Cilk-L compared to the one-worker execution time of running the baseline version of map-reduce on Cilk-F. Similar to what was observed by Muller and Acar (2016), we see little advantage to using Cilk-L to hide I/O latency at ms. In fact, for a latency of ms, we see that it may be preferable to utilize both hyperthread contexts as traditional workers rather than using one for I/O thread. By oversubscribing, at least for map-reduce, Cilk-F can hide latency better than Cilk-L. We took separate measurements to see where the “break even” points are, and Cilk-L and oversubscribing Cilk-F break even at around latency of ms.
However, by the time latency hits ms, there is a significant advantage to using the I/O functions provided by Cilk-L. With , we already see a speedup greater than , which increases to over at . On the other hand, oversubscribing Cilk-F achieves less than speedup. This pattern continues as we increase the latency to ms and ms, reaching speedups greater than and respectively.
One reason why we chose this benchmark and these latency configurations is so that we could indirectly compare to prior work by Muller and Acar (2016). They measured their latency hiding prototype implementation in parallel ML using similar experimental setup and were only able to achieve speedups around with the latency of ms. Note that, however, this indirect comparison may is not apple-to-apple, since their implementation is based on parallel ML, which has its inherent overhead in memory management.
Cilk-L’s Proximity to Ideal
Now we evaluate how close Cilk-L is to an “idealized” version at hiding I/O latencies. We obtain the ideal measurement by running Cilk-F with a timed file descriptor with zero delay.111111Technically we used nanosecond latency, which is the smallest latency one could specify with the timed file descriptor on Linux, but it effectively causes the read to become ready immediately. Moreover, since it is run with Cilk-F, it also does not incur any overhead of setting up IO futures, epoll, nor waking up and context switching to I/O threads.
Figure 1 shows the raw execution time of the ideal version and that of Cilk-L with different latencies. The overhead ( of Cilk-L divided by of ideal) starts out small with small and increases as gets larger. This is in part due to the fact that, the relative ratio between I/O latencies and compute time increases as the latency increases. By measuring fib of running on one worker, the total amount of work is about milliseconds, which is small compared to the millisecond latency. As the theoretical bound predicts, if there is high latency on the span, the execution time could be dominated by the span. We believe this is what happens in the case of millisecond latency with high number of cores. We confirm this hypothesis by running the same experiments with fib of , and the gap between Cilk-L with latency ms on cores decreases to be within .
Overhead in Latency-Hiding
The use of IO futures in Cilk-L has some inherent overhead: 1) setting up and tear down of IO futures, 2) invoking the epoll mechanism, which has its inherent system call overheads, and 3) waking up and context switching into I/O threads. Likely these overheads contribute to both the less preferable performance comparing to oversubscribing Cilk-F when the latency is small and the additional overhead comparing to the ideal version. To figure out how much overhead contributed by each source, we measure different versions of Cilk-L and compare that to the ideal version (Cilk-F running the baseline map-reduce with nanosecond latency). The +future version is similar to ideal except with the overhead of using IO futures (fut-create, placing the result into future handles, and get). Building on the +future version, the +epoll version then adds the overhead of using epoll. Finally, the +IO Thread version adds the overhead of using a separate thread to handle the I/O. Note that, however, since the latency is effectively zero, the I/O thread will be woken up only once per request when the request is inserted into communication queue. Figure 2 shows the comparison. The empirical results show that, the overhead from use of futures is negligible. The overhead from epoll and I/O thread are comparable, but both are small.
6. Related Work
Interesting Use of Futures:
Since its proposal in the late 70th (Friedman and Wise, 1978; Baker and Hewitt, 1977), the use of futures has been incorporated into various task parallel platforms (Chandra et al., 1994; Kranz et al., 1989; Halstead, 1985; Charles et al., 2005; Spoonhower et al., 2008; Fluet et al., 2010; Taşırlar and Sarkar, 2011; Cavé et al., 2011; Lu et al., 2014). Futures are typically used as a high-level synchronization construct to allow parallel tasks to coordinate with one another in a way that is more flexible than pure fork-join parallelism.
Researchers have proposed interesting uses of futures. Blelloch and Reid-Miller (1997) used futures to generate “non-linear” pipeline. Using futures to pipeline the split and merge of binary trees, they developed a parallel algorithm of tree merge with better span than a fork-join parallel marge algorithm. Surendran and Sarkar (2016) proposed using futures to automatically parallelize pure function calls in programs and developed the corresponding compiler analyses. Kogan and Herlihy (2014) described a use of futures in the concurrent programming setting, called linearizable futures, that allows a concurrent data structure to be shared among threads via the use of futures and formalized the correctness guarantees for such use. Milman et al. (2018) proposed an algorithm for batched lock-free queue concurrent data structure using futures.
Supporting synchronization primitives in work-stealing:
Researchers have also proposed runtime schedulers for scheduling programs with blocking synchronization. For instance, Agrawal et al. (2010) proposed a runtime system for helper locks where when a worker tries to get a lock which is not available, it tries to help complete the critical section that is currently holding the lock. They proved that this scheduler was efficient if large critical sections had sufficient internal parallelism.
X10 (Charles et al., 2005) and Habanero (Cavé et al., 2011) variants support synchronization primitives such as conditional blocks, clocks and phasers are supported. Most of these implementations do not have provably efficient performance bounds. Initially, in X10 (Charles et al., 2005) and Habanero Java, synchronization primitives (e.g., conditional atomic blocks or barriers) may cause the worker to simply block, and the runtime compensated by creating a new worker thread to replace the blocked worker. Later, Tardieu et al. (2012) proposed better compiler and runtime support for X10 for suspending a task blocked on synchronizations. However, the suspended tasks are stored in a centralized queue. For Habanero Java, (Imam and Sarkar, 2014) describe an alternative support: when suspended tasks become resumable, they are pushed onto the deque of the worker that executed the operation to unblock the tasks.
Zakian et al. (2016) extend Intel Cilk Plus (Intel, 2013) to provide support for a low-level library which allows a worker to suspend the current execution context upon encountering a blocking I/O and find something else to do. In this case, the multiple suspended contexts (deques) are stored with the worker which suspended them. However, only the active deque is exposed to be stolen from. Thus, a worker may end up with many suspended deques with high potential nodes that cannot be stolen.
Work-stealing schedulers with multiple deques per worker:
Various work-stealing runtime systems have used multiple deques per worker for different reasons. The runtime system for helper locks (Agrawal et al., 2010) (discussed above) used multiple deques per worker. When a worker is blocked on a lock, it is only allowed to work on the critical section, that is holding the lock (assuming this critical section has internal parallelism) and does so by allocating another deque specifically for this critical section . Therefore, in a program with nested locks with nesting depth , workers could have as many as deques each. However, the scheduler is designed so that each worker can steal from at most one deque of each of the other workers. In a similar vein, Agrawal et al. (2014) proposed Batcher runtime system to handle parallel programs that make data structure accesses. In this case, workers can be working on either the program work or the data structure work and this work is kept on different deques. But again, at any given time, a worker steals randomly among deques.
The closest work to this work is Porridge processor-oblivious record and replay system for dynamic multithreaded programs using work stealing (Utterback et al., 2017). Porridge allows multithreaded program with locks to be executed on some number of processors while recording all the happens-before relationship between critical sections. Later the execution can be replayed on a different number of processors but guarantees the same happens-before relationships. During record, the vanilla workstealing algorithm that just blocks on un unavailable lock can be used. However, during replay, a vanilla work-stealing scheduler can lead to deadlocks. Therefore, if a critical section tries to acquire a lock and can not acquire it since the critical section with a happens-before edge to has not finished, the processor must find something else to do. The runtime system there also uses proactive work-stealing and achieves similar bounds.
7. Conclusion
In order to support modern desktop and server software, I/O operations should be supported as a fundamental component of task parallel platforms. In this paper, we show how one one may incorporate I/O into a task parallel platform seamlessly with efficient scheduling to hide I/O latencies. In particular, our platform, Cilk-L, provides a programming API for performing I/O that works harmoneously with existing parallel control constructs. In addition, the underlying runtime system efficiently schedules both the computation and the I/O operations to provide nearly optimal execution time guarantee and a bound on the number of deviations. We achieve this by using proactive work stealing schedulers recently developed for scheduling computations with futures. Empirical evaluation of our prototype system shows that, I/O can be supported efficiently with effective latency hiding.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Lin (2019 a) 2019 a. Linux Programmer’s Manual EPOLL(7). http://man 7.org/linux/man-pages/man 7/epoll.7.html . (2019). Accessed in January 2019.
- 3Lin (2019 b) 2019 b. Linux Programmer’s Manual EVENTFD(2). http://man 7.org/linux/man-pages/man 2/eventfd.2.html . (2019). Accessed in January 2019.
- 4Lin (2019 c) 2019 c. Linux Programmer’s Manual TIMERFD _ CREATE(2). http://man 7.org/linux/man-pages/man 2/timerfd_create.2.html . (2019). Accessed in January 2019.
- 5Acar et al . (2002 a) Umut Acar, Guy E. Blelloch, and Robert Blumofe. 2002 a. The Data Locality of Work Stealing. Theory of Computing Systems 35, 3 (2002).
- 6Acar et al . (2000) Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000) . 1–12.
- 7Acar et al . (2002 b) Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002 b. The Data Locality of Work Stealing. Theory Comput. Syst. 35, 3 (2002), 321–347.
- 8Agrawal et al . (2014) Kunal Agrawal, Jeremy T. Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. 2014. Provably Good Scheduling for Parallel Programs That Use Data Structures Through Implicit Batching. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’14) . ACM, New York, NY, USA, 84–95.
