Efficient Race Detection with Futures
Robert Utterback, Kunal Agrawal, Jeremy Fineman, I-Ting Angelina Lee

TL;DR
This paper introduces efficient algorithms, MultiBags and MultiBags+, for on-the-fly race detection in task parallel programs using futures, with theoretical guarantees and empirical validation.
Contribution
It presents two new algorithms that improve race detection efficiency for programs with restricted and general futures, respectively.
Findings
MultiBags runs in near-constant time for restricted futures.
MultiBags+ extends to general futures with acceptable overhead.
Empirical results confirm the practical efficiency of both algorithms.
Abstract
This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that it can be race detected more efficiently than general use of futures. Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time , where is the sequential running time of the program, is the inverse Ackermann's function, is the total number of memory accesses, is the dynamic count of places at which parallelism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
1
Efficient Race Detection with Futures
Robert Utterback
Monmouth College
,
Kunal Agrawal
Washington University in St. Louis
,
Jeremy Fineman
Georgetown University
and
I-Ting Angelina Lee
Washington University in St. Louis
(2019)
Abstract.
This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that we can detect races more efficiently than with general use of futures.
Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time , where is the sequential running time of the program, is the inverse Ackermann’s function, is the total number of memory accesses, is the dynamic count of places at which parallelism is created. Since is a very slowly growing function (upper bounded by for all practical purposes), it can be treated as a close-to-constant overhead. MultiBags+ is an extension of MultiBags that target programs with general use of futures. It runs in time where , , and are defined as before, and is the number of future operations in the computation. We implemented both algorithms and empirically demonstrate their efficiency.
††journalyear: 2019††copyright: acmcopyright††conference: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; February 16–20, 2019; Washington, DC, USA††price: 15.00††doi: 10.1145/3293883.3295732††isbn: 978-1-4503-6225-2/19/02
1. Introduction
Races constitute a major source of errors in parallel programs. Since they lead to nondeterministic program behaviors, they are extremely challenging to detect and debug. In this work, we focus on the problem of race detection for task-parallel programs, where the programmer denotes the logical parallelism of the computation using high-level parallel control constructs provided by the platform, and lets the underlying runtime system perform the necessary scheduling and synchronization. Examples of task parallel platforms include OpenMP (OpenMP 4.0, 2013), Intel’s TBB (Reinders, 2007; Intel Corporation, 2012), IBM’s X10 (Charles et al., 2005), various Cilk dialects (Frigo et al., 1998; Danaher et al., 2008; Leiserson, 2010; Intel Corporation, 2013), and Habanero dialects (Barik et al., 2009; Cavé et al., 2011).
In the context of task parallel programs, the focus is typically on detecting determinacy races (Feng and Leiserson, 1997) (also called general races (Netzer and Miller, 1992)), which occur when two or more logically parallel instructions access the same memory location and at least one access is a write. In the absence of a determinacy race, a task parallel program for a given input behaves deterministically.
Over the years, researchers have proposed several determinacy race algorithms (Mellor-Crummey, 1991; Feng and Leiserson, 1997, 1999; Raman et al., 2010, 2012; Bender et al., 2004; Fineman, 2005; Utterback et al., 2016; Surendran and Sarkar, 2016b; Xu et al., 2018) for task parallel code. These algorithms perform race detection on the fly as the program executes, and consist of two main components: (1) an access history that keeps track of previous readers and writers for each memory location; and (2) a reachability data structure for maintaining and querying whether two instructions are logically in parallel. On each memory access, the detector checks whether the current access is logically parallel with the previous accessors (stored in the access history) to determine whether a race exists.
Most prior work focuses on a restricted set of computations, namely computations that can be represented as series-parallel dags (SP dags) (Valdes, 1978) with nice structural properties, such as ones generated using fork-join parallelism (i.e.,spawn/sync or async/finish). Prior works show that one can race detect computations that are SP dags efficiently by exploiting the nice structural properties. In particular, the reachability data structure can be maintained and queried with no asymptotic overhead for both serial (Bender et al., 2004; Fineman, 2005) and parallel executions (Utterback et al., 2016). Moreover, the access history needs to store only a constant number of accessors per memory location to correctly race detect for such computations (Feng and Leiserson, 1997, 1999; Mellor-Crummey, 1991).
The use of futures has become a popular way to extend fork-join parallelism. Since their proposal (Friedman and Wise, 1978; Baker and Hewitt, 1977) in the late 70s, futures have has been incorporated into various parallel platforms (Lu et al., 2014; Cavé et al., 2011; Fluet et al., 2010; Charles et al., 2005; Chandra et al., 1994; Kranz et al., 1989; Arvind et al., 1986; Halstead, 1985). Researchers have studied scheduling bounds (Blelloch et al., 1997; Arora et al., 1998) and cache efficiency (Spoonhower et al., 2009; Herlihy and Liu, 2014) for using futures with fork-join computations. Kogan and Herlihy (2014) study linearizability of concurrent data structures accessed using futures. Surendran and Sarkar (2016a) proposed using futures to automatically parallelize programs.
The use of futures can form arbitrary dependencies, and thus computations generated by a parallel program that uses futures are no longer series-parallel. However, not much work has been done on race detecting programs with more general dependence structures.
Two prior works exist on race detection for programs that use futures and both are sequential (no known parallel algorithms exist). An algorithm proposed by Surendran and Sarkar (2016c) has high overheads — the running time is where is the work, or sequential running time of the program without race detection, is number of future objects and is the number of future operations. That is, the running time of the race detection algorithm increases quadratically with the total number of futures used in the program. More recently, Agrawal et al. (2018) present a sequential algorithm to perform race detection on SP dags with added non-series-parallel edges in time, which is the best known running time. The algorithm is difficult to implement however, since it requires storing all the nodes in the computation graph and traversing the graph during execution to update labels. Thus, no actual implementation of the algorithm exists to date.
Contributions
While prior work on race detection has focused on either structured SP dags or unrestricted use of futures that generates arbitrary dependences, we consider a restricted use of futures. Researchers have observed in other contexts (Herlihy and Liu, 2014) that using futures in a restricted manner can reduce scheduling and cache overheads. We define a specific structured use-case of futures that allows us to perform race detection much more efficiently than general use of futures. This class of futures is quite natural and can be checked with program analysis. We provide the precise definition in Section 2; informally, it requires that the instruction that creates the future is sequentially before the instruction that uses the handle.
We present two practical algorithms for race detecting programs with futures: MultiBags and MultiBags+. The main contribution for both algorithms is a novel reachability data structure. Both algorithms run the program sequentially for a given input and report a race if and only if one exists, following the same correctness criteria as prior work. MultiBags focuses on structured use of futures and incurs very little overhead — a multiplicative overhead in the inverse Ackermann’s function, which is upper bounded by for all practical purposes (Cormen et al., 2009). MultiBags+ is an extension of MultiBags, which handles general use of futures and has overhead comparable to the state-of-the-art theoretical algorithm (Agrawal et al., 2018) (i.e., multiplicative overhead of the inverse Ackermann’s function) and can be implemented efficiently. We have implemented both algorithms and empirically evaluated them. The empirical results show that both algorithms can maintain reachability efficiently for their designated use cases.
Specifically, we make the following contributions:
MultiBags: We propose MultiBags, an algorithm to race detect programs that use structured futures (Section 4). We prove its correctness and show that it race detects in time where is the work, is the inverse Ackermann’s function (upper bounded by ), is the number of memory accesses, and is the dynamic count of places at which parallelism is created. Since is a very slow-growing function, this bound is essentially for all intents and purposes.
MultiBags+: We propose MultiBags+, an algorithm to race detect programs that use general futures (Section 5). We prove its correctness and show that it race detects in time where , , , and are defined as above, and is the number of future operations in the computation (Section 5). Again, since the inverse Ackermann’s function is slow growing function, the running time is for all intents and purposes. Compared to the state-of-the-art proposed by Agrawal et al. (2018), MultiBags+’s running time has a multiplicative overhead of the inverse Ackermann’s function. Unlike the state-of-the-art, however, MultiBags+’s relative simplicity allows it to be implemented efficiently in practice. We provide a more detailed comparison between our MultiBags+ algorithm and the state-of-the-art (Agrawal et al., 2018) in Section 5.
FutureRD: We have built a prototype race detector called FutureRD based on MultiBags and MultiBags+. Empirical evaluation with FutureRD shows that our algorithms allow reachability to be maintained efficiently, incurring almost no overhead (geometric means of and overhead for MultiBags and MultiBags+, respectively). The overall race detection incurs geometric means of and overhead, respectively.
2. Preliminaries and Definitions
Parallel control constructs:
Our algorithms are described assuming parallelism in programs is generated using four primitives: spawn, sync, create_fut and get_fut. The algorithms themselves are general and can be applied to platforms that use other constructs that generate similar types of dags. We assume that spawn and sync are used to generate fork/join or series/parallel structures. In particular, for the purposes of this paper, function can spawn off a child function , invoking without suspending the continuation of , thereby creating parallelism; similarly, can invoke sync, joining together all previously spawned children within the functional scope.111Some constructs, such as async/finish primitives have slightly different restrictions, they still generate SP dags and our algorithms can be modified to apply to these programs. We assume create_fut and get_fut primitives are used to create and join futures, respectively. Like spawn, one can precede a function call to in with create_fut, which allows to execute without suspending . Unlike spawn, however, parallel function calls created with create_fut can escape the scope of a sync — a subsequent sync joins together previously spawned functions but does not wait for function calls preceded by create_fut to return. Instead, create_fut returns a future handle , which the program must explicitly invoke get_fut on to join with the corresponding computation (i.e., ). If has not completed when get_fut is called, then get_fut blocks until finishes and a result is obtained.
Modeling parallel computations:
One can model the execution of a parallel program for a given input as a dag (directed acyclic graph) , whose nodes are strands — sequence of instructions containing no parallel control – and edges are control dependencies among strands. The dag unfolds dynamically as the program executes. A strand is sequentially before another strand (denoted by ) if there is a path from to in the dag; two nodes and are logically parallel if there is no path from one to the other. The performance of a computation can be measured in two terms: the work of the computation is the execution time of the computation on a single processor; the span (also called depth or critical-path length) of the computation is its execution time on an infinite number of processors (or, longest sequential path through the dag).
Series-parallel dags:
Computations which use only spawn and sync can be modeled as series-parallel dags (SP-dag) (Valdes, 1978) that have a single source node with no incoming edges and a single sink node with no out-going edges. Upon the execution of a spawn, a fork node is created with two outgoing edges: one leads to the first strand in the spawned child function and one leads to the continuation of the parent. Upon the execution of a sync, a join node is created, that has two or more incoming edges, joining the previously spawned subcomputations.222Technically, a sync can join multiple children; therefore, a join node can have more than two parents. For simplicity, in this paper, we assume each join has exactly two incoming edges. One can modify all our algorithms to the more general case easily.
SP dags can be constructed recursively as follows.
Base Case: the dag consists of a single node that is both the source and the sink.
Series Composition: let and be SP-dags on distinct nodes. Then a series composition is formed by adding an edge from to with and .
Parallel Composition: let and be SP-dags on distinct nodes. Then the parallel composition is formed as follows: add a fork node with edges from to both sources, and a join node with edges from both sinks to . and . We refer to and as the left subdag and right subdag, respectively, of both the fork and join .
Adding futures:
We model computations that employ futures in addition to spawn and sync as a set of independent SP dags connected to each other via non-SP edges due to create_fut and get_fut calls. If a function spawns a function , then the strands of and are part of the same SP dag. However, if function calls using a create_fut call, then the first strand, say , of is the source of a different SP dag. The last strand of will be the sink node of this SP dag. Therefore, if a program calls create_fut times — that is, it creates total futures in addition to the main program — then it has SP dags which are connected to each other via non-SP edges.
These non-SP edges are incident on strands that end with create_fut and ones that immediately follow strands that ended with get_fut. A strand in function that ends with has two outgoing edges — one non-SP edge to the first strand in , and one SP edge to the continuation in . We say that is the creator of denoted by . The first strand of is the source of a new SP-dag which contains all strands of and the functions it calls (recursively) using spawn and the last strand of is the sink of this SP-dag. Similarly, strand in immediately follows a call where is the future handle for future has two incoming edges — one SP edge from the strand that ended with the get_fut call in the current function and one non-SP edge from the last strand of the . We say is the getter of denoted as . We say that if there is a path from to using only SP edges.
This model is quite general and subsumes computations that can arise from futures (Lu et al., 2014; Cavé et al., 2011; Fluet et al., 2010; Charles et al., 2005; Chandra et al., 1994; Kranz et al., 1989; Arvind et al., 1986; Halstead, 1985) or other future-like (such as “put” and “get” (Budimlić et al., 2010; Taşırlar and Sarkar, 2011)) parallel constructs proposed in the literature. Therefore, our algorithm would work on all of these primitives.
Structured futures:
We place the following restrictions on structured futures: (1) Single-touch: Every future handle is called with get_fut at most once. (2) No race on future handles: There is a sequential dependence in the program from the point where a future is created (via create_fut which initializes a future handle) to the point where it is read (via get_fut).More precisely, if strand terminates with a call and strand terminates with a call, then in the computation.
Eager execution:
Both our algorithms execute the computation sequentially and execute the program in depth-first eager order. When the execution reaches (after executing ) call or a call, it always executes the function . When returns, then the next node of the parent function (the one after the continuation edge) is executed. This execution order automatically has the property that all functions that must join at a sync point have already returned when the execution reaches the sync; therefore, the execution never blocks at a sync. Similarly, for structured futures, this execution has the property that the execution will never block at a get_fut. For general futures, we restrict our attention to computations where the use of futures is forward-pointing: for every future , executes before in the depth-first eager execution. Without this restriction, sequential execution of the original program could deadlock, in which case our algorithm race detects up to the point where it deadlocks.
3. Managing Access History
As mentioned in Section 1, there are two important components in a race detector: access history and reachability data structure. MultiBags and MultiBags+ differ in how they maintain reachability but manage access history similarly. This section discusses how they manage access history — for each memory location , the access history maintains enough information about the previous accesses to so that future accesses to can detect races.
When race detecting a series parallel program, it is sufficient to store a constant number of previous reader strands and a single previous writer strand in the access history(Feng and Leiserson, 1997; Mellor-Crummey, 1991). When a strand accesses a memory location , it checks if some subset (based on whether is reading or writing) of ’s previous accessors are in parallel with . Therefore, each memory access leads to at most a constant number of queries into the reachability data structure.
This property no longer holds for programs with futures, however. In particular, the access history for a memory location still holds only one writer strand, namely the most recent writer strand, . However, it must now store an arbitrarily large reader-list. Race detection proceeds as follows. Whenever a strand reads from a memory location , the detector checks the reachability data structure to determine whether is logically parallel with ; if so, a race is reported. Otherwise, is added to . When a strand writes to a memory location , the race detector must check against all readers in and with . If is in parallel with any of them, then it declares a race. Otherwise, the is set to and is stored as . We can empty the reader list without missing any races because anything that executes later that would be in parallel with these readers must also be in parallel with (the new ), and a race will be reported with .
A key thing to notice here is the following: the total number of queries into the reachability data structure (i.e., checking one access against another for race) is bounded by the total number of memory accesses in the computation. Since we can empty all the readers whenever we encounter a , a reader is checked against some writer at most twice: when it is inserted into the reader list, and right before it is removed. Thus, the total number of queries to the reachability data structure is bounded by , where is the work of the computation. We shall relate this observation formally to the performance bound of MultiBags and MultiBags+ later in Sections LABEL:sec:structuredOmitted and A.
4. MultiBags for Structured Futures
We now describe MultiBags, which can race-detect programs with structured futures in time where is the work of the program, is the inverse Ackermann’s function, is the number of memory accesses in the program and is the number of spawn and create_fut calls. Since the inverse Ackermann’s function is a very slowly growing function, the bound is close to optimal.
Notation:
Note that programmatically, spawn and sync are subsumed by create_fut and get_fut since we can convert a spawn to create_fut and sync to a series of get_fut calls, one on each function spawned in the current function scope. In the case of general use of futures discussed in Section 5, we distinguish between SP edges (generated by spawn and sync) and non-SP edges generated by create_fut and get_fut since the bound depends on , the number of get_fut calls, and converting all sync calls to get_fut calls will increase this number. For structured futures, however, the bound does not depend on ; therefore, for simplicity in this section, we assume that we only have create_fut and get_fut constructs to create parallelism.
The computation dag consists of three kinds of nodes — regular strands with one incoming and one outgoing edge, creator strands which end with a create_fut call with two outgoing edges, and getter strands (that come immediately after a get_fut call) with two incoming edges. It also consists of three kinds of edges: spawn edges are edges from creator nodes to the first strand of the future; join edges are edges from last strand of a future to getter nodes; all other edges (that go between strands of the same function instance) are continue edges.
4.1. Algorithm
This algorithm is similar to the SP-Bags algorithm for detecting races for series-parallel programs (Feng and Leiserson, 1997). As with that algorithm, we will use a the fast disjoint-set data structure (Tarjan, 1975). The data structure maintains a dynamic collection of disjoint sets and provides three operations:
Creates a new set and adds it to the disjoint sets data structure .
Unions the set into and destroys . We will sometimes overload notation and say where and are elements in the set instead of sets. This means that we union the sets containing and into the set containing .
returns the set that contains the element .
In this section, we only have one disjoint set data structure; therefore, is implicit.
As mentioned in Section 2, and like SP-Bags, MultiBags depends on the depth-first eager execution of the computation. MultiBags maintains a bag (a set in the union-find data structure) for each function instance which has been created and for which get_fut has not yet been called (these bags can be stored with the future handle). This bag is labeled either an -bag, represented by or a -bag, represented by . The algorithm maintains these bags as shown in Figure 1. The strands of a particular function are always added to before they execute.
The algorithm looks similar to SP-Bags (Feng and Leiserson, 1997). The main difference is that when the function returns, its -bag is renamed as bag; in SP-bags, would be unioned with , the parent function of . However, this small difference is crucial for handling programs with structured futures rather than series-parallel programs. In addition, the proof for correctness of this algorithm for structured futures is significantly different than the proof of correctness of SP-bags for series-parallel programs.
Figure 2 shows the operation of this algorithm on an example program which uses structured futures. Each rectangle is a function instance and nodes are strands. The straight dashed lines going towards the left represent create_fut edges while the curved dashed lines represent get_fut edges. Consider step 12 when the first node of function is executing. All nodes except node 4 are sequentially before this strand and are correspondingly in some -bag. Node 4 is in parallel with this strand and is in a -bag.
4.2. Proof of Performance and Correctness
First, we state the performance bound — the proof is essentially identical to the proof of performance of SP-Bags, since the algorithm is quite similar.
Proof of Performance of MultiBags
Theorem 4.1.
The running time of MultiBags when detecting races for a program with work is where is the number of memory accesses and is the number of create_fut calls.
Proof.
The fast disjoint-sets data structure provides the bound of amortized time per operation where is the number of operations and is the number of sets. For our program, is at most the number of memory accesses and is the number of create_fut calls. Note here, again, that unlike series parallel computations, each write may generate multiple queries; however, for the reason as explained in Section 3, the total number of queries is bounded by the two times the total number of memory accesses since each writer removes the entire reader-list. Therefore, the total running time is . ∎
Intuition for the proof
In order to argue that MultiBags is correct, we must prove the following theorem.
Theorem 4.2.
If the currently executing strand is , then a previously executed strand is currently in an bag iff .
In order to prove this theorem, we define two more terms. A node is a spawn predecessor of a node if there is a path from to which consists of only spawn and continue edges. A node is a join predecessor of if there is a path from to that consists of only join and continue edges. Spawn and join successors are defined in the symmetric way. We will overload notation and say that a strand is a spawn predecessor of a function if there is a path from to the first strand of that consists of only spawn and continue edges and similarly a strand is a join successor of if there is a path from the last strand of to . Each node is its own spawn and join predecessor and successor.
The algorithm works due to the following observations.333We implicitly assume that when we refer to any strand (or function), it is either currently executing or has already executed (we have no knowledge of strands or functions that are still to execute). We say that a function is active if it has started executing, but has not completed (returned). While a function is active, exists and doesn’t because is only created upon ’s return. All strands of an active function are in . After a function has returned, is destroyed; exists if get_fut has not been called on ’s future handle and if exists, then all strands of are in . After a function has joined (get_fut has been called) then neither nor exist.
Property 1 is a property of eager executions.
Property 1.
When a strand is currently executing, all spawn predecessors of are part of some active function. The converse is also true; all strands that are part of active functions are spawn predecessors of .
We need two other properties. The first one is a static property of paths in a program with structured futures as follows. If there is a path from a strand to stand ,444In general, there can be many paths from to . then there must be a path where the first (possibly empty) part of the path consists of only join and continue edges, while the second (possibly empty) part of the path contains only spawn and continue edges. In other words, there is a path where no join edges follow spawn edges. More formally, for any two nodes and , if , then we can find a node where is a join-predecessor of and is a spawn predecessor of . Combining with Property 1, we get:
Observation 1.
Consider a completed strand and a currently-executing strand where . The furthest join successor of , say , must be part of an active function.
This observation allows us to concern ourselves only with paths that go through nodes of active functions. In particular, to detect races, it is sufficient to try to check if the furthest join successor of any previously executed node is part of an active function. If so, the observation implies that and we already know from Property 1 that ; therefore, . If not, then .
The second property is a dynamic property of MultiBags which allows to precisely check this. In particular, it states the following (which combined with the previous observation gives us the theorem):
Observation 2.
Consider an already completed node . Say, at time , ’s furthest join successor is part of a function . If is active, then is in ’s bag, otherwise is in ’s bag.
An informal argument about why this property is true follows: In Line 1, MultiBags unions into when joins with an active function . This suggests the following: Consider a particular strand in function . Say at time , is the furthest join successor strand of which has executed (or is executing). (This strand is well defined since there is no branching in a path that contains only join and continue edges.) Say is part of function . Then at time , is in ’s bag. ’s bag is an bag if is active. Therefore, to check if ’s furthest join successor is active, it suffices to check if is in an bag, which is precisely what MultiBags does.
If we combine Observations 1 and 2, we get the theorem. Say is executing at time and consider a previously executed node and say is the furthest join successor of . If , then by the first observation, is part of an active function and therefore, by the second observation, is in an bag. On the other hand, if , then can not be part of an active function — if it was, then by Property 1, and therefore, (a contradiction). Therefore, is in a bag.
When the computation does a memory access, it simply checks if the previous accesses are in an In order to do queries, the algorithm operates as follows:
- (1)
Currently executing strand reads location : check if is in a bag; if so, declare a race. Otherwise, append to 2. (2)
Currently executing strand writes to location : check if any reader is in a bag; if so, declare a race. Otherwise, empty the reader list and set .
Proof of Correctness of MultiBags
First, we prove the static property stated in Observation 1. In order to prove it, we first define a canonical order on the futures of the program and show that we can always order the functions in this canonical order.
Lemma 4.3.
We can always find a canonical order.
Proof.
The order starts with the main function; we then progressively add futures to the computation. A future can only be added if its creator strand and getter strand have already both been added.
We will induct on adding new futures. We can always start since the main function is added first. At some point, we have already added some futures, say a set . Next, we will add the future such that there is no future such that where both creator nodes are in . There must be some such future (if there are more than one, we can choose arbitrarily). Since no future was created sequentially after was created, and the strand before the is sequentially after , must also be in ; therefore, it is legal for us to pick as the next future in our canonical order. ∎
We now use this canonical order to induct on the futures. In particular, we can show that the static property stated in Observation 1 (stated more formally and completely in the following lemma) by inducting on the futures in the canonical order.
Lemma 4.4.
If , then there exists a path from to that contains two sections: the first path (possibly empty) contains only join and continue edges and the second part (possibly empty) contains only spawn and continue edges. In other words there is never a spawn edge followed by a join edge on this path. In addition, this path is unique. Therefore, If , then there is some node (possibly or ) which is a join successor of and a spawn predecessor of .
Proof.
Induct on futures in the canonical order (which we can always find according to Lemma 4.3) and show that this is true as we add futures one by one.
Base case: We first have only the main strand, so this is true trivially.
Inductive case: Assume that after we have added a set of futures, the statement is true. We now add a new future .
Consider any nodes and in this new dag where . If neither nor are in , then the addition of does not add any new paths between and (since the only new path added is between and and there was already a path between them before we added ). In addition, any new path added does have a spawn followed by a join — therefore, the uniqueness is preserved. Therefore, we only need consider pairs where either or are in . If is in and is not, then the path from to must go from the last strand of to and then to . By inductive hypothesis, the path from already follows the desired property and the path from to only contains join and continue edges. Therefore, the property still holds. A symmetric argument applies when is in . ∎
We then prove the dynamic property stated in Observation 2 by looking at the execution as it unfolds. For each function , we define its operating function as the function containing the “furthest join descendant” of the last executed strand of . An active function is its own operating function. If a function is not active (it has returned), it may be confluent or non-confluent. It is confluent if its operating function is active; otherwise it is non-confluent. By definition, the operating function of a non-confluent function is always non-confluent. A confluent function can never be its own operating function, but a non-confluent function may be its own operating function if its has not yet executed.
The following lemma is proved by induction on the program as it executes.
Lemma 4.5.
(a) When a function is active, all its strands are in its bag. (b) If a function is confluent, then all its strands are in its operating function ’s bag. (c) If a function is non-confluent, then all its strands are in its operating function ’s bag.
Proof.
When a function is first called, it has an bag and its strands are placed in the bag. They remain in this bag while it is active. Once the function returns, all its items move to a bag. For the other two statements, we induct on time after returns.
Base Case: When returns, has not yet been called. Therefore, it is its own operating function; it is non-confluent; and all its strands are in its own bag.
Inductive Case: We will do this by two cases:
Case 1: is non-confluent and is its (non-confluent) operating function; by inductive hypothesis, all strands of are in ’s bag. The only thing that can make changes the location of its strands is if executes, say by function . At this point, (which is currently active) becomes the operating function for both and — therefore, is now confluent. All strands of (and incidentally ) move to ’s bag.
Case 2: is confluent and is its (active) operating function; by inductive hypothesis, all strands of are in ’s bag. The only thing that changes the location of ’s strands is if returns. At this point becomes non-confluent (since it is no longer active); therefore also becomes non-confluent. All of ’s strands move to ’s bag. ∎
The combination of static and dynamic properties leads to the proof of correctness. The intuition is that if a function is confluent, then there is some strand in its (active) operating procedure which is a join successor of all strands of and a spawn predecessor of currently executing strand.
Theorem 4.2 0.
If the currently executing strand is , then a previously executed strand is currently in an bag iff .
Proof.
By Lemma 4.4, we know that if , then we can find a node such that is a join predecessor of and is a spawn predecessor of . By Property 1, since is executing, the function containing , say , is still active. Therefore, by definition, the function containing is confluent. Therefore, by Lemma 4.5, is an bag.
If does not precede , then there is no path from to . Therefore can not have a path to any strand in any active function (otherwise by the second statement of Property 1, since has path to , will also have a path to ). Therefore, by definition, is non confluent. By Lemma 4.5, is in a bag. ∎
5. MultiBags+ for General Futures
We now consider general use of futures for programs that use both spawn/sync constructs and also futures. In particular, we consider programs where most of the parallelism is created using spawn and sync, but there are also future get_fut operations. For these programs, we provide a race detection algorithm that runs in total time , where is the work of the program, is the inverse Ackermann’s function, is the number of memory accesses in the program and is the number of spawn and create_fut calls. To put this bound in context, a series-parallel program has — in this case (and in fact, for any program where ), the MultiBags+ runs in time . Since the inverse Ackermann’s function grows slowly (upper bounded by ), this bound is close to asymptotically optimal.
As mentioned in Section 2, MultiBags+ depends on eager execution of the computation and we assume that our futures are forward-pointing. Therefore, the depth-first execution never blocks on a get_fut call since the corresponding future has already finished executing.
Notation:
Unlike in Section 4, we must distinguish between spawn and create_fut (similarly, between sync and get_fut) for MultiBags+. The computation dag consists of five kinds of nodes: (1) regular strands with one incoming and one outgoing edge; (2) spawn strands which end with a spawn instruction and have with two outgoing edges; (3) creator strands which end with a create_fut instruction and have with two outgoing edges; (4) sync strands which begin immediately after a instruction and have with two incoming edges; and (5) getter strands which begin immediately after get_fut instruction and have with two incoming edges. Some strands can have two incoming and two outgoing edges (if they start immediately after a get_fut or sync instruction and end with a spawn or create_fut); these strands are correspondingly in both categories.
The computation dag also consists of five kinds of edges: spawn edges are from spawn nodes to the first strand of the corresponding spawned function; join edges are from last strand of a spawned function to the corresponding sync node; create edges from the creator strand to the first stand of the future function; and get edges from the last strand of a future function to the corresponding getter node. Each future can have multiple get edges if it is a multi-touch future.555For context, in Section 4, both spawn and create edges were called spawn edges and both join and get edges were called join edges.
Reachability data structures:
Recall that we can model computations that employ futures as a set of series-parallel dags (SP dags) plus some non-SP edges (Section 2). When we need to check if , if they are already in the same SP dag (i.e., , as defined in Section 2), the disjoint-sets data structure maintained by MultiBags can readily answer the reachability query correctly. We only run into trouble due to use of general futures when but a path exists between them via (possibly more than one) non-SP edges.
Thus, MultiBags+ maintains two data structures, and both utilize fast disjoint-sets data structure described in Section 4. The first disjoint-sets data structure, called , is virtually identical to the data structure used in Section 4: we maintain and bags for each function, and the operations performed on are identical to those described in Section 4 in the case of create_fut, and spawn is treated in the same way as create_fut. In addition, sync is treated like get_fut. The only difference is that we do not perform any operation on this data structure on get_fut (since we allow multi-touch futures). We can use to correctly answer reachability query between two strands if they are in the same SP dag. Intuitively, for reasons similar to the structured case, all strands that are currently stored in an bag are sequentially before the currently executing strand. Note that all strands that are stored in a bag are not necessarily in parallel with the current strand, due to non-SP edges — we will use the second data structure to answer that query.
The second data structure handles the additional complication in reachability query when two strands are connected via non-SP edges. This data structure has two components: (1) a disjoint-sets data structure called that maintains a collection of disjoint-sets, and each strand is added to when encountered; and (2) a separate dag called that contains some of the sets from . The high-level idea is that these sets are made of connected series-parallel subdags of the original dag . For any two nodes and in different SP dags, MultiBags+ ensures that in iff in (the sets they are in are connected in ).
We call sets also in as the attached sets, which store nodes that are subdags which start and/or end with creator or getter strands.666This is not quite accurate; for technical reasons, some attached sets start/end with regular, spawn, and join nodes as well. explicitly maintains reachability relationship that arises due to non-SP edges between nodes in the attached sets. is simply a dag (with each node being an attached set), but it is not series-parallel. Thus, to answer reachability queries quickly between nodes in , MultiBags+ maintains a full transitive closure of all sets in — whenever a set is added to , its reachability from all sets already added to is explicitly computed and stored. Therefore, one can check if in in constant time.
If every set could be in we would be done. We must keep small, however, since every time we add a set to we compute a full transitive closure, which is expensive. It turns out that it is difficult to simultaneously put all strands in attached sets and keep small. In order to cope with this, some strands are in unattached sets, which are only stored in . Intuitively, an unattached set contains nodes of a complete series-parallel subdag which have no incident non-SP edges. Each unattached set has two additional fields, attached predecessor and attached successor, which point to attached sets that act as ’s proxies when querying . ’s attached predecessor, denoted as , is set when is created; therefore, it always points to some attached set. ’s attached successor, denoted as , is set at some later point; it either points to some attached set or may be null. An attached set is always its own attached predecessor and successor. We will overload notation and say that node ’s attached predecessor is ’s attached predecessor (and similarly for attached successor).
Answering queries:
Figure 3 shows how the reachability data structures are queried to find out if a path exists between some previously executed node and the currently executing node . In the first part of the query (lines 3–3), we query and if is in the bag, then we can conclude that and return. If is in the bag, then we check if attached successor of precedes the attached predecessor of in ; if so, we say that . Otherwise is in parallel with .
Maintaining , , and :
Figure 4 shows the code for maintaining reachability relationships between nodes in the computation. The first thing we do during a spawn, create_fut, return and sync is to manipulate (lines 4, 4, 4, and 4) in a manner identical to Section 4.777Since we are assuming binary forking, we sync with one function at a time.
Now lets consider the manipulations of and . It uses an auxiliary function , which simply checks if is an unattached set, and if so, converts it into an attached set by adding it to and adding an edge from to in .
The attached and unattached sets change as the execution continues. MultiBags+ unions sets in growing both attached and unattached sets. Two attached sets are never unioned together. Whenever we union an attached set and an unattached set, we always union the unattached set into the attached set; therefore, the resulting set is attached and remains in . On the other hand, an unattached set contains nodes of a complete series-parallel subdag which have no incident non-SP edges. In particular, consider a parallel composition of two series-parallel subdags and . Say has no incident non-SP edges. Then all nodes of constitute an unattached set if either the join node that joins and has not executed yet, or if has an incident non-SP edge. At a high-level, this design allows us to have the property that each non-SP edge only leads to a constant number of attached sets. Since only attached sets are in , this idea allows us to keep small, allowing us to get good performance.
Figure 5 shows a computation dag with futures and its corresponding at the end of the computation. As can be seen in this example, sets and are all unattached and each have the above property. In Appendix A, Figure 9 shows the dag when it has been partially executed while nodes 23 and 31 are executing, respectively and all unattached sets in those examples also have this property.
Due to space limitations, all the proofs are omitted to Appendix A. Here we simply state the performance and correctness theorems.
Theorem 5.1.
MultiBags+ detects races in time for programs with work, get_fut calls, memory accesses, and number of strands.
Theorem 5.2.
If the program is executed in a depth first eager execution order, returns true iff in .
Comparison to algorithm by Agrawal et al. (Agrawal et al., 2018):
As we have fully described the MultiBags+ algorithm, we discuss the differences between MultiBags+ and the the-of-the-art algorithm by Agrawal et al. (Agrawal et al., 2018) and provide an analytical analysis as to why the algorithm by Agrawal et al. is much more challenging to implement in practice.
The algorithm by Agrawal et al. utilizes the following data structures to answer reachability queries: 1) an order-maintenance data structure for answering series-parallel queries; 2) the full computation DAG to update and maintain “anchor-predecessors” and “proxies” used to infer “anchor-successors;” and 3) a reachability matrix which contains anchor nodes to answer reachability queries involving non-SP edges. The functionalities served by these data structures are similar to that of , , and in MultiBags+; in particular, their algorithm utilize anchor-predecessors and anchor-successors to allow for correct reachability queries involving non-SP edges, sharing similar roles as the attached predecessors and attached successors in MultiBags+. The main difference is in the second data structure and how the anchor-predecessors and anchor-successors are maintained.
In the algorithm by Agrawal et al., the mechanism for maintaining anchor-predecessors and proxies (which are used to infer anchor-successors) are more complex. In particular, to maintain anchor-predecessors, the algorithm maintains the full computation dag, and each strand (a node in the dag) explicitly stores its anchor-predecessor. However, anchor-predecessors can sometimes change as the program executes. When that occurs, the algorithm must explicitly traverse subpart of the dag and update some of the predecessors explicitly. The asymptotic complexity of such updates is still ok because the paper argues that a strand’s anchor predecessor can only change a constant number of times.
Similarly, the algorithm maintains a proxy per strand, used to infer a strand’s anchor-successor. A proxy for a strand is stored instead of its anchor-successor is because, while an anchor-predecessor of a node can change a constant number of times, its anchor-successor can change many times. Thus instead, the algorithm maintains a proxy, which indirectly allows the algorithm to deduce its anchor-successor. Like the anchor-predecessor, a proxy of a node can only change a constant number of times, and when that occurs, the algorithm again explicitly traverses the relevant subdag and updates the proxies explicitly.
We argue that this algorithm is harder to implement and likely has higher overheads due to the following reasons. First, explicitly maintaining the entire program dag and also storing each strand’s anchor-predecessor and proxy would be more memory intensive than keeping these strands in union-find data structures which are tagged appropriately. Second, explicit dag traversals in order to update proxies and anchor-predecessors of nodes would be expensive (even though the asymptotic complexity is manageable). This prior work establishes the state-of-the art time bound for race detecting programs that use general futures, but no implementation exists.
6. Experimental Evaluation
This section empirically evaluates FutureRD that implements MultiBags and MultiBags+ described earlier. We first evaluate the practical efficiency of these algorithms and then the performance difference between them, focusing on the impact of the additional overhead that MultiBags+ incurs, where is the number of get_fut operations.
Implementation of FutureRD
FutureRD works by instrumenting parallel program executions: upon the execution of a parallel construct (i.e., spawn, sync, create_fut, and get_fut), it invokes the necessary operations to update the reachability data structures; likewise, upon the execution of a memory access, it invokes the necessary operations to update the access history data structure and query both data structures.
We use Intel Cilk Plus (Intel, 2013) as our language front end, which is a C/C++ based task parallel platform that readily supports fork-join parallelism. Cilk Plus does not currently support the use of futures, however, so we have implemented our own future library. Since our race detector executes the program sequentially with eager evaluation of futures, the future library never actually interacts with Cilk Plus runtime during race detection.
Both MultiBags and MultiBags+ utilize disjoint-sets data structures to maintain reachability as described in Section 4). MultiBags+ additionally needs to maintain as part of its reachability data structure (defined in Section 5). Conceptually, is simply a boolean reachability matrix where each cell indicates whether there is a path from attached set to attached set . FutureRD maintains as a vector of bit vectors, representing the reachability between any two sets using a single bit. Whenever an edge is added to , reachability is transitively propagated via parallel bit operations.
FutureRD maintains the access history like a two-level direct-mapped cache, and keeps track of the reader list and last writer at four-byte granularity (all our benchmarks perform four-byte or larger accesses). That is, to query or update readers/writers for an address , the more significant bits of are used to index into the top-level table and the rest of the bits are used to index into the second-level table.
Experimental setup
We evaluate FutureRD using six benchmarks: longest-common subsequence (lcs), Smith-Waterman (sw), matrix multiplication without temporary matrices (mm), binary tree merge (bst) as described by Blelloch and Reid-Miller (1997), Heart Wall Tracking (heartwall), and Dedup (dedup). Heart Wall Tracking and Dedup both contain parallel patterns that cannot be easily implemented using fork-join constructs alone. The Heart Wall Tracking algorithm is adapted from the Rodinia benchmark suite (Che et al., 2009) that tracks the movement of a mouse heart over a sequence of ultrasound images. Dedup is a compression program that exhibits pipeline parallelism (Bienia and Li, 2010), taken from the Parsec benchmark suite (Bienia et al., 2008). All but dedup have two implementations: structured and general futures; dedup does not utilize the flexibility of general futures. We use the following input sizes: lcs uses , mm and sw use , heartwall uses images, dedup uses input large, and bst uses input tree sizes and . For Figures 6 and 7, we use base case for lcs, mm, and sw to keep the work the same for the baseline, MultiBags, and MultiBags+ (since MultiBags+ has additional overhead). We then vary the base case size for Figure 8.
We ran our experiments on an Intel Xeon E5-4620 with 2.20-GHz cores on four sockets. Each core has a 32-KByte L1 data cache, 32-KByte L1 instruction cache, a 256-KByte L2 cache. There is a total of 500 GB of memory, and each socket shares a 16-MByte L3-cache. All benchmarks are compiled with LLVM/Clang 3.4.1 with -O3 -flto running on Linux kernel version 3.10. Each data point is the average of runs with standard deviation less than with the exception of running dedup with full race detection, which sees a standard deviation under .
Practical efficiency of FutureRD
First, we evaluate the overhead of FutureRD and show that the algorithms can be implemented efficiently. To get the sense of where the overhead comes from, we ran the application benchmarks with four configurations:
baseline: running time without race detection;
reachability: running time with only the reachability components, including the instrumentation overhead to capture parallel control constructs;
instrumentation: running time with memory-access instrumentation overhead on top of the reachability configuration, but does not maintain or query the access history;
full: running time with the full race detection overhead.
Figure 6 shows the list of programs that employ structured futures running with different configurations, where FutureRD maintains reachability using the MultiBags algorithm. First, observe that the reachability configuration incurs almost no overhead, except for bst, which has very little work per parallel construct. Since the operations on the disjoint-sets data structure are very efficient, as long as there is sufficient work per parallel construct, the overhead of maintaining reachability in MultiBags should be low. These program contains large number of memory accesses, however, and thus adding instrumentation for memory accesses alone incurs additional – overhead.
Going from the instrumentation configuration to the full race detection incurs another – overhead, with the exception of dedup. We expect the additional overhead incurred to be about – because the full configuration transforms every memory access into updates to access history and queries to both access history and reachability data structures. Thus, each memory access is translated into a few function calls and several pointer chases to multiple data structures. The benchmark heartwall only incurs additional , because it spends non-negligible amount of time performing I/O (reading in image files). Finally, dedup is an outlier because dedup calls into a dynamic library to perform compression, which we could not recompile to include instrumentation. Thus, any memory accesses performed within the library do not incur additional overhead. Since the compression takes up a substantial amount of execution time, the additional overhead is small.
Figure 7 shows the runtime of programs that employ general futures where FutureRD maintains reachability using the MultiBags+ algorithm. The additional overhead incurred going from one configuration to the next is similar to Figure 6 except the higher overhead from MultiBags+ is evident in the reachability configuration.
Over five benchmarks (excluding dedup, since we could not instrument its compression library), we see a geometric mean overhead of and to maintain reachability using MultiBags and MultiBags+, respectively. Full race detection exhibits and overhead, respectively.
Comparison between MultiBags and MultiBags+
Next, we compare the performance difference between MultiBags and MultiBags+. To evaluate the overhead difference between them, we run the same programs (i.e., with structured futures) with both algorithms. Although MultiBags+ is designed for general futures, it also works with programs that use structured futures, albeit with an additional overhead, where is the number of get_fut calls.
For lcs, sw, and mm, is dictated by how much the base case is coarsened — the smaller the base case, the more get_fut calls, and the higher is (which leads to higher overhead). Runtimes shown before used base case of to keep the work asymptotically the same across baseline, MultiBags, and MultiBags+. Now we decrease the base case size below (i.e., increase ) to see how the overhead of MultiBags+ changes compared with the overhead of MultiBags.
Figure 8 shows the measurements for running programs with structured futures using MultiBags and MultiBags+ in the reachability configuration with different base cases. The overhead difference between MultiBags and MultiBags+ can readily be observed in Figures 6 and 7 — compared to MultiBags, MultiBags+ incurs more overhead running dedup and more running bst for maintaining reachability. Here we show additional numbers for benchmarks where varying base case sizes changes .
The measurements with lcs and mm bear out the extra overhead of MultiBags+. The lcs benchmark has work versus futures, while mm has more work (), but also requires futures. With a higher ratio of futures to total work, the overhead is more apparent. Moreover, the memory required for the reachability matrix becomes substantial for small base cases, adding more overhead. The sw benchmark, however, has work compared to futures, so the effect of smaller base cases is small.
7. Related Work
Besides works discussed in Section 1, researchers have considered race detection for other structured computations. Dimitrov et al. (2015) propose a sequential near-optimal race detection algorithm for two-dimensional dags which also exhibit nice structural properties. Subsequently, Xu et al. (2018) propose a race detector for two-dimensional dags with asymptotically optimal parallel running time. Lee and Schardl (2015) propose a sequential race detector for fork-join computations with reductions, where the computation dag is almost series-parallel except when reductions are performed.
Beyond task parallel code, there is a rich literature on race detection for programming models that generate nondeterministic computations, such as ones that employ persistent threads and locks. For such models, since the dag necessarily depends on the schedule, the best correctness guarantee that a race detector can provide is for a given program, for a given input, and for a given schedule. Early work (Savage et al., 1997; von Praun and Gross, 2001) employs lock-set algorithm, which provides wide coverage but can lead to many false positives, because it cannot precisely capture happens-before (HB) relations formed between threads.
A vector-clock (VC) based algorithm such as one proposed by Flanagan and Freund (2009) can capture HB precisely for a given schedule. Such algorithm can be used on computation with arbitrary dependences, but naively applying it to task parallel code would be impractical, since it requires storing a VC of length with each each memory location querying against it per access, incurring a multiplicative factor of overhead on top of the work, where is the number of strands, which can be on the order of millions.
In the context of race detecting nondeterministic code, researchers have investigated hybrid approaches incorporating VC and lock-set (O’Callahan and Choi, 2003; Pozniansky and Schuster, 2003; Yu et al., 2005; Serebryany and Iskhodzhanov, 2009) to trade-off precisions and coverage. More recently, researchers have proposed predictive analysis to explore alternative feasible schedules among close by instructions to increase the coverage (e.g. (Smaragdakis et al., 2012; Said et al., 2011; Liu et al., 2016; Kini et al., 2017)) while keeping the precision.
Acknowledgements
This research was supported in part by National Science Foundation grants CCF-1150036, CCF-1527692, CCF-1733873, and XPS-1439062. We thank the referees and our shepherd for their excellent comments.
Appendix A Proofs from Section 5
Proof of Performance of MultiBagsPlus
Theorem 5.1 0.
MultiBags+ detects races in time for programs with work, get_fut calls, memory accesses, and number of strands.
Proof.
We create three new attached sets when we encounter a create_fut edge and 2 new sets when we encounter a get_fut edge. The only interesting part is when we encounter a sync (lines 4–4). When neither of the component SP-dags have a non-SP edge (lines 4–4) or if only one of them has a non-SP edge(lines 4–4), no new attached sets are created. The only case where (at most two) additional attached sets are created is if both subcomponents have non-SP edges (lines 4–4). The total number of such sync nodes is . Therefore, MultiBags+ creates attached sets. Each time an attached set is created, it takes time to insert it into since MultiBags+ maintains a transitive closure. Other than this, each operation (Make-Set, Union and Find) into and runs in . As we argued in Section 3, each memory access generates a constant number of queries. We see from the code that each query leads to a single Find into and a constant number of Finds into . Therefore, the total cost of race detection is . ∎
Proof of Correctness of MultiBags+
We will use the example of Figure 5 to illustrate the proof. Figure 9 shows the same dag and the corresponding with all the attached and unattached sets when it has been partially executed while nodes 23 and 31 are executing, respectively.
We will redefine the terminology a little bit differently than Section 4. A node is a spawn predecessor of a node if there is a path from to which consists of only spawn, create and continue edges. A node is a join predecessor of if there is a path from to that consists of only join and continue edges. Notice the asymmetry here — a creator node is the corresponding future’s spawn successor, but the future is not the getter node’s join predecessor. This mimics the actions of the algorithm on , since spawn and create_fut behave identically while sync and get_fut do not.
We can now define operating function and confluence in the manner identical to Section 4 and it should be clear that Property 1 and Lemma 4.5 still hold.
First consider the first part of the query, where we just check if is in the -bag. As shown in lines 4 and 4, when a function calls either of , the algorithm exactly mimics MultiBags, simply creating an -bag for containing the first node of . Similarly, on function ’s return, becomes . Finally, when a function calls sync on function , it mimics get_fut in structured future and is unioned in . The only difference is that nothing happens on get_fut to . Therefore:
Lemma A.1.
Consider the currently executing strand and a previously executed strand . The following is true: (a) If is in an bag, then ; further more, there is a path from to consisting of only spawn, create, join and continue edges. (b) If there is a path from to consisting of only spawn, create, join and continue edges (no get edges), then is in an bag.
Proof.
Since moves to -bags happen in a more restricted setting here than in Section 4, statement (a) follows somewhat intuitively from Section 4. More formally, from Lemma 4.5, if in an bag, then either the function containing is active, or ’s operating function is active. Therefore, either or some join successor of is part of an active function — therefore, is sequentially before by Property 1.
For (b), say and let be the last node in on the path from to (if is in , then ). Since the path from to does not contain any get_fut edges, this path can not go through any create_fut edges either (since a create_fut edge will take execution out of and one would need a get_fut edge to come back). Therefore, since SP dags are a special case of structure futures, and is sequentially before , must have been in some bag when executed (from Theorem 4.2). After executed, no strand of was executed, therefore can not have moved to a bag since then. ∎
We now consider the case when returns false and use . Here, we must prove that if is in a bag when executes (otherwise, the query returned), then iff there is a path in from to . We have to make a few observations in order to see this.
The following two lemmas state some structural properties of attached and unattached sets and can be proven by inducting on set unions done by the algorithm.
Lemma A.2.
Each unattached set consists of the nodes belonging to a maximal series-parallel subdag such that (1) all nodes in the subdag have been executed, (2) there are no create_fut or get_fut edges in incident on nodes in , (3) there are at most two arcs in incident on nodes in —one directed towards ’s source, and one directed out from ’s sink.
Proof.
Induct on the construction. When we create an unattached set, it consists of a single node. When we merge sets without marking them attached, they are always complete series parallel dags (lines 4–4) with no incident non-SP edges. ∎
Lemma A.3.
The nodes in each attached set induce a series-parallel subdag such that if we contract all maximal series parallel dags within it, it forms a chain.(1) if there is a get_fut or a create_fut edge directed towards a node in , then it is directed towards the source, (2) if there is a get_fut or create_fut edge directed from a node in , then it originates on the sink. Unlike the unattached sets, attached sets do not necessarily match the series-parallel decomposition of —an attached set may have many incoming or outgoing edges.
Proof.
We induct on the growth of sets. The sets always start by containing single nodes. We union sets at two places: On lines 4–4, we union an entire series-parallel subdag together. In this case, the sets containing and are unattached and have no incident non-SP edges by Lemma A.2. Therefore, only the set containing may be attached. By induction, it can only have an incident non-SP edge at its source node. Before the unions, Its sync node is and it clearly does not have an outgoing non-SP arc since it is a spawn node, not a create_fut node.
On line 4, we union the set which contains a spawn node with a set containing one of spawn’s successors . By induction, the set containing only has an incident non-SP edge towards its source. Similarly, the set containing can only have a non-SP edge directed away from its sink node (again is its source node and it does not have a non-SP edge directed towards it since its previous instruction is not a get_fut or sync call).
Similarly, on line 4, we union a sync node with a set containing sync nodes successor. Again, we induct in a similar manner. ∎
We can see that the above lemmas are true for examples shown in Figures 5 and 9. From the above two lemmas, we can see that if and are in the same attached or unattached set, and then since and are in the same SP-dag and there are no incoming or outgoing non-SP edges to nodes within the dag except at the source and sink. Therefore, will answer the query between them correctly.
The following key lemma says that the relationship between nodes in two different attached sets is always correctly represented in .
Lemma A.4.
Consider nodes and , where and are distinct attached sets. Then iff .
The proof of the lemma is complicated and depends on many structural properties of the sets; however, there are two main intuitions. First, if we only consider only attached sets containing nodes of a single series parallel dag (ignoring all create_fut and get_fut edges) then the attached sets form the same series parallel relationship as . This can be seen from in Figure 5 where and form a series parallel dag and induce the same dependencies as the original dag. Therefore, if the path from to doesn’t contain any create_fut or get_fut edges, then we get the correct relationship.
Second, say there is a path from to which contains create_fut and get_fut edges. Walk along the path from until we encounter the first such edge. Let the source of this edge and be the destination. If both and are in attached sets, then there is a path from to in (from the previous paragraph). In addition, there is an edge from to in (since is the creator node and is the first node of a future and we explicitly add this edge on line 4). Therefore, there is a path from to in . We can then find the next create_fut or get_fut edge on this path and continue with the induction. Again, we can see this in Figure 5 where there is a path from to via and also a path from to via and .
In order to prove Lemma A.4 formally, we will define two kinds of attached sets. Intuitively, prefix-complete sets are those where the first node added to the set is the source node of the set and the set can grow in the forward direction; and suffix-complete sets are those where the first node added is the sink node of the set and the set can grow in the backward direction by unions with unattached sets.
We first understand the structure of prefix-complete attached sets. Consider an attached set where the first node added to . is prefix complete if is either (1) the strand immediately after a get_fut node (line 4); or (2) the first node in the continuation of a creator node (line 4); or (3) is the first node of a future function (line 4); or (4) a sync node where both its subdags that are joining are attached (line 4).
Lemma A.5.
If is the first node added to a prefix-complete attached set , then for all other nodes , we have . In addition, consider a node where is a prefix-complete attached set. (1) If there is an edge in such that is not in , then must be the sink node of . (2) If there is any edge in where is not in , then is either the source or is in an unattached set.
Proof.
The first statement is clear by construction. Any attached set constructed in one of the ways described above will never union with an unattached set that has nodes that precede . In particular, we can induct on the growth of a prefix-complete set to show all the properties described above. ∎
Consider an attached set where the first node added to . is suffix-complete if is either (1) the strand ending with a create_fut instruction (line 4); or (2) the node immediately preceding the getter node in the SP dag of the getter node (line 4); or (3) the spawn node where both spawned subdags are attached (line 4).
Lemma A.6.
If is the first node added to a prefix-complete attached set , then for all other nodes , we have . Consider a node ; if there is an edge in such that is not in , then must be the source node of . In addition, if there is any edge in where is not in , then is either the sink or is in an unattached set.
Proof.
Again, we can induct on the growth of sets created by the above methods. ∎
The following is a surprising lemma. Basically, a suffix-complete attached set is never an unattached set’s attached successor or predecessor.
Lemma A.7.
If an attached set is an attached predecessor or an attached successor of an unattached set , then must be prefix-complete.
Proof.
A set is set as an attached successor (on line 4) when a join node unions with . In this case, is clearly growing in the forward direction and must be prefix complete. The first node of the entire computation, is by definition, prefix-complete. After this, we can see that attached predecessor is always prefix-complete by inducting on the execution. ∎
We now consider nodes in unattached sets and argue that they have the correct relationship with their predecessors. If is in an unattached set, then we use as a proxy for when we do the query. The following lemma argues that it is always correct to use this proxy.
Lemma A.8.
At any point during the execution, for an unattached set , if , then for all and , we have . In addition, there is no incoming create_fut or get_fut edge on any node in the path from to (not including ).
Proof.
First, recall that must be a prefix-complete set (Lemma A.7), and the only edges leaving in can be from its sink node. Therefore, if any node in precedes , then they all must.
The fact that some node in must precede can be seen by induction. If has only one immediate predecessor , then ’s attached predecessor is set as ’s attached predecessor (lines 4 and 4). If has two immediate predecessors ( is a join node), then is either in an attached set (lines 4 and 4) or it is in the same set with both its predecessors. ∎
We can also see this from our examples. The example of set is particularly interesting. Note that its attached predecessor is not — this is because was not an attached set when executed — it only became an attached set later. However, note that all nodes in are in fact before all nodes in and there is no intervening incident non-SP edge. Similar observations can be made for set whose attached predecessor is instead of since node 26 was not in when executed (as seen in Figure 9 (right)). However, it is still correct for ’s attached predecessor to be since every node in precedes every node in and there is no non-SP edge on any path from a node in to a node in .
We can now prove that has the correct relationships between attached sets.
Lemma A.4 0.
Consider nodes and , where and are distinct attached sets. Then iff .
Proof.
We first argue that if , then . We can induct on order in which edges are added in . Let and . If is the first node added to via , then all the incoming edges to are from — therefore, from Lemma A.8, all paths into are correct. If is directly created (lines 4, 4, and 4) then we explicitly only add the edges into in which are correct in . Also, at this point, either has no outgoing edges or the correct outgoing edges are explicitly added.
When we execute a sync, we may union unattached sets into attached sets, and we must ensure that the property still holds. (1) When both subdags of an SP dag are unattached, they are both unioned into the set containing the source and the join is also unioned into the same set (lines 4–4). In this case, the only things that can precede these unattached sets also precede the source. (2) When both subdags are attached, no unions happen. (3) When one subdag is attached, the join node is added to the set that contains one of the predecessor of (line 4) — therefore, it must be the case that a node that precedes the predecessor of the join node also precedes . In addition, no node succeeds yet, so the other direction is trivial. The source node union (line 4) is the interesting case. Here is unioned into a suffix-complete attached set . Therefore, has only one incoming edge and by definition, it is from . Therefore, anything that precedes nodes in must also precede and anything that succeeds nodes in must also succeed .
Now we argue that if , then either and are in the same attached set or the set containing precedes the set containing in . First observation is that is always a connected dag — this is easy to see since whenever we add a node in , we also add an arc to it.
Second, if and are in the same SP-dag, and they are not in the same attached set, then there is a path from to in . We can again see this by induction on composition of SP dags. In the base case, and are in different attached sets, we immediately add an edge from to . After this, assuming there is always an edge from the source to sink in the smaller sp dag, we always either merge the entire sp-dag into the same set (lines 4–4), or add edges from source to both sub dag sources (lines 4–4) and from subdag sinks to the sink (lines 4–4) or we merge the source into one of the subdag sources (line 4) and sink to one of the subdag sinks (line 4).
If and are in different SP dags, we induct on the path from to . Let be the last node in and be the node immediately after in the path from to . Both and are in attached sets and there is a path from to in (from the previous paragraph) and an edge from to in (since is the creator node and is the first node of a future and we explicitly add this edge on line 4). Therefore, there is a path from to . We can then induct on this path and keep moving forward until we get to the dag containing . ∎
Finally, we must make claims about attached successors of unattached sets. We first show a structural property of unattached sets.
Lemma A.9.
For a node , let be the closest completed series-parallel dag which is a parallel composition and whose sink node has already executed. If belongs to an unattached set with no attached successor, then all nodes of belong to this same unattached set.
Proof.
When a sink node of a parallel composition following an unattached set executes, the unattached set either gets an attached successor line 4, or all the nodes of the parallel composition are unioned into the same set lines 4–4. ∎
The following lemma claims that if a node has an attached successor, then there is a path from to the last node of the attached successor.
Lemma A.10.
Consider a node where is unattached. If has an attached successor , then where is the current sink of this attached successor . In addition, consider any node . If then . Finally, we have for any node added to after it becomes ’s attached successor.
Proof.
Attached successor is set on line 4 where the attached successor always contains the sync node following the . Since an unattached set is a complete series-parallel dag with no incident create_fut or get_fut edges (Lemma A.2), any path from must go through this node from the property of series-parallel dags. At this point is the current sink of . Any nodes subsequently added to must have the property that from the construction of attached sets(Lemma A.3). Therefore, the property remains true by induction. ∎
The important subtlety here is that not all nodes of ’s attached successor have to be after — consider the set in our example with attached successor . Node 28 does not follow , but becomes ’s attached successor only after node 33 is added to . (Notice that is not ’s attached successor in Figure 9 (right) since 33 has not yet executed.)
Now we can argue about the correctness of the query where both and may be parts of unattached sets. Say has an attached successor and has an attached predecessor and there is no path from to that contains only series-parallel edges (otherwise, the first part of the query will give the correct answer), but there is a path from to containing create_fut and get_fut edges. This path must go through the last node of (Lemma A.10. In addition, since there are no incident non-SP edge between nodes in and (Lemma A.8), this path must also go through . Therefore, it is sufficient to check the relationship between and to check the relationship between and . In a similar manner, we can also show that we get the correct answer when only one of and are in unattached sets.
This final lemma handles the case where does not have an attached successor. This is where we utilize the condition that the program is executing in depth-first eager order and that is the currently executing node. The intuition for this lemma is as follows: Since is in an unattached set, no node in this unattached set has any outgoing non-SP edges. In addition, the nearest join after this set hasn’t executed since it doesn’t have an attached successor. Therefore, all no node where can be executing right now since all such nodes are in the same bag as . 888We also use this fact when we are querying the data structure since the correctness of Lemma A.1 also depends on eager execution.
Lemma A.11.
Consider a node where is unattached and does not have an attached successor. If is an bag, is in parallel with the currently executing node .
Proof.
Say was a strand of function and its operating function is ( and could be the same). Therefore, is in ’s bag and is not active, but has returned. Moreover, has not synced with its parent function. In addition ’s last strand is in the same unattached set as (Lemma A.9). Since no node of this unattached set has outgoing non-SP edges, no node that is sequentially after has any incident non-SP edges. Therefore, from Lemma A.1, there can not be any get_fut edges on any path after at this point. Therefore is in parallel with . ∎
We can now prove the main theorem by combining Lemmas A.1, A.4, A.8, A.10, and A.11.
Theorem 5.2 0.
If the program is executed in a depth first eager execution order, returns true iff in .
If the path from to has no get_fut edge, then Lemmas A.1 applies. In addition, Lemma A.4 argues that the second part of the query (lines 3–3) answers all questions correctly between two nodes in attached sets. Finally, Lemmas A.8, A.10, and A.11 show that using attached predecessors and successors for nodes in unattached sets gives the correct answer when the first part of the query (using ) returns false.
Appendix B Artifact Appendix
B.1. Abstract
This artifact contains source code for the compiler, runtime system, and benchmarks used in the PPoPP 2019 paper Efficient Race Detection with Futures, plus shell scripts that compile everything and run the benchmarks. The hardware requirements are any modern multicore CPU, while the software requirements include a relatively recent Linux distribution (tested on Ubuntu 16.04), the datamash package, and the GNU gold linker. To validate the results, run the test scripts and compare the results to figures 6, 7, and 8 in the paper.
B.2. Artifact check-list (meta-information)
- •
**Program: ** C/C++ code.
- •
**Compilation: ** Modified fork of clang++ with -O3 -flto flags. To fully reproduce the reproduce the results, we recommend installing the GNU gold linker as ld.
- •
**Data set: ** The dedup benchmark uses publicly available data sets. Scripts in the repository will download and setup all data sets.
- •
**Run-time environment: ** Tested on Ubuntu 16.04, but expected to work on any modern Linux.
- •
**Hardware: ** Any modern multicore CPU; tested on an Intel®Xeon® CPU E5-2665 with hyperthreading disabled. Enabling hyperthreading may change results.
- •
**Metrics: ** Runtime (in seconds).
- •
**Output: ** Runtime and standard deviation for all benchmarks, each run with 12 configurations which determine what kind of futures and which race detection algorithm are used and what level of instrumentation/race detection is turned on — baseline, reachability only, reachability + memory instrumentation, and full race detection.
- •
**How much disk space required (approximately)?: ** 13GB.
- •
**How much time is needed to prepare workflow (approximately)?: ** 1.5 hours.
- •
How much time is needed to complete experiments (approximately)?: 4 hours.
- •
**Publicly available?: ** Yes
- •
**Code/data licenses (if publicly available)?: ** MIT.
B.3. Description
B.3.1. How delivered
The project is available on Gitlab at https://gitlab.com/wustl-pctg-pub/futurerd2.git.
B.3.2. Hardware dependencies
Any modern multicore CPU. It was tested on an Intel®Xeon®CPU E5-2665.
B.3.3. Software dependencies
The project was tested on Ubuntu 16.04, but it is expected to run correctly in other Linux distributions. To fully reproduce the results, link-time optimization should be used (-flto) with the GNU gold linker installed as ld. On our system we make /usr/bin/ld a shell script that forwards its arguments to gold whenever the USE_GOLD environment variable is set and the original ld otherwise.
The benchmark script requires GNU datamash, which can be installed using apt-get in Ubuntu 14+ or can be obtained from https://www.gnu.org/software/datamash. Bash 4+ should be used to run the scripts.
B.3.4. Data sets
All required datasets are downloaded by scripts included in the distribution.
B.4. Installation
The setup.sh script in the project repository will build our modified compiler, the modified Cilk Plus runtime, and all the benchmarks.
B.5. Experiment workflow
- •
Clone the source code to your machine:
1$ git clone
2> https://gitlab.com/wustl-pctg-pub/futurerd2.git
3$ cd futurerd2
- •
Install GNU gold as your linker. Modern versions of the GNU binutils package include gold, though for our purposes the system ld should point to gold. Installing gold also installs a header called plugin-api.h, usually in either /usr/include or /usr/local/include. Find this file and replace the BINUTILS_PLUGIN_DIR variable in build-llvm-linux.sh with this path.
- •
Install other software dependencies. In Ubuntu 14+, this is as simple as
1$ sudo apt-get install datamash zlib1g zlib1g-dev openssl
and making sure you have Bash 4+.
- •
Build the necessary components. The setup.sh script will build the compiler and download and unpack the necessary data sets.
- •
Run the benchmark script (bench/run.sh). The script compiles the runtime library and race detection library, and compiles and runs each configuration of each benchmark. Tuning parameters can be found in bench/time.sh (which the run.sh script uses) — feel free to examine the script and change parameters, such as the number of iterations for each benchmark.
1$ cd bench
2$ ./run.sh
- •
Observe the results. Once completed, full results can be found in the files times.ss.csv (benchmarks used MultiBags race detection algorithm with structured futures), times.ns.csv (benchmarks used MultiBags+ algorithm with structured futures), and times.nn.csv (benchmarks used MultiBags+ algorithm with general futures).
B.6. Evaluation and expected result
Although absolute times will differ on your machine, you should see similar relative overhead for the benchmarks. Compare the results to figures 6, 7, and 8 in the paper.
B.7. Notes
Please send feedback or file issues at our gitlab repository (https://gitlab.com/wustl-pctg-pub/futurerd2).
B.8. Methodology
Submission, reviewing and badging methodology:
- •
http://cTuning.org/ae/submission-20180713.html
- •
http://cTuning.org/ae/reviewing-20180713.html
- •
https://www.acm.org/publications/policies/artifact-review-badging
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Agrawal et al . (2018) Kunal Agrawal, Joseph Devietti, Jeremy T. Fineman, I-Ting Angelina Lee, Robert Utterback, and Changming Xu. 2018. Race Detection and Reachability in Nearly Series-parallel DA Gs. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’18) . Society for Industrial and Applied Mathematics, New Orleans, Louisiana, 156–171. http://dl.acm.org/citation.cfm?id=3174304.3175277
- 3Arora et al . (1998) Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread Scheduling for Multiprogrammed Multiprocessors. In 10th Annual ACM Symposium on Parallel Algorithms and Architectures . 119–129.
- 4Arvind et al . (1986) Arvind, R.S. Nikhil, and K.K. Pingali. 1986. I-structures: Data Structures for Parallel Computing. In Proceedings of the Graph Reduction Workshop .
- 5Baker and Hewitt (1977) Henry C. Baker, Jr. and Carl Hewitt. 1977. The incremental garbage collection of processes. SIGPLAN Notices 12, 8 (1977), 55–59.
- 6Barik et al . (2009) Rajkishore Barik, Zoran Budimlić, Vincent Cavè, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sağnak Taşırlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. 2009. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09) . ACM, Orlando, Florida, USA, 735–736.
- 7Bender et al . (2004) Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-Fly Maintenance of Series-Parallel Relationships in Fork-Join Multithreaded Programs. In 16th Annual ACM Symposium on Parallel Algorithms and Architectures . 133–144.
- 8Bienia et al . (2008) Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT . ACM, 72–81.
