Parallel Finger Search Structures
Seth Gilbert, Wei Quan Lim

TL;DR
This paper introduces two parallel finger search structures that are work-optimal and highly parallelizable, supporting efficient searches, insertions, and deletions with theoretical guarantees and practical implementation considerations.
Contribution
First implementation of a parallel finger structure that is work-optimal with respect to the finger bound and offers good parallelism within a factor of O((log p)^2).
Findings
Supports searches, insertions, deletions efficiently.
Achieves work-optimality with respect to the finger bound.
Provides bounds on parallel execution time and discusses practical implementation.
Abstract
In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O( (log p)^2 ) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS. The total work done by either version of FS is bounded by the finger bound F[L] (for some linearization L of D ), i.e. each operation on an item with finger distance r takes O( log r + 1 ) amortized work; it is cheaper for items closer to a finger. Running P using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\setlistdepth
14
Parallel Finger Search Structures
[TABLE]
Keywords
Parallel data structures, multithreading, dictionaries, comparison-based search, distribution-sensitive algorithms
Abstract
In this paper 111This is the full version of a paper published in the 33rd International Symposium on Distributed Computing (DISC 2019). It is posted here for your personal or classroom use. Not for redistribution. © 2019 Copyright is held by the owner/author(s). we present two versions of a parallel finger structure on processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of by any parallel program that is modelled by a dynamically generated DAG where each node is either a unit-time instruction or a call to .
The total work done by either version of is bounded by the finger bound (for some linearization of ), i.e. each operation on an item with distance from a finger takes amortized work. Running using the simpler version takes O\mathopen{}\mathclose{{}\left(\frac{T_{1}+F_{L}}{p}+T_{\infty}+d\cdot\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right)}\right) time on a greedy scheduler, where are the size and span of respectively, and is the maximum number of items in , and is the maximum number of calls to along any path in . Using the faster version, this is reduced to O\mathopen{}\mathclose{{}\left(\frac{T_{1}+F_{L}}{p}+T_{\infty}+d\cdot(\log p)^{2}+s_{L}}\right) time, where is the weighted span of where each call to is weighted by its cost according to . We also sketch how to extend to support a fixed number of movable fingers.
The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers.
Acknowledgements
We would like to express our gratitude to our families and friends for their wholehearted support, to the kind reviewers who provided helpful feedback, and to all others who have given us valuable comments and advice. This research was supported in part by Singapore MOE AcRF Tier 1 grant T1 251RES1719.
1 Introduction
There has been much research on designing parallel programs and parallel data structures. The dynamic multithreading paradigm (see [14] chap. 27) is one common parallel programming model, in which algorithmic parallelism is expressed through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, but the program cannot stipulate any mapping from subcomputations to processors. This is the case with many parallel languages and libraries, such as Cilk dialects [20, 25], Intel TBB [34], Microsoft Task Parallel Library [37] and subsets of OpenMP [31].
Recently, Agrawal et al. [3] introduced the exciting modular design approach of implicit batching, in which the programmer writes a multithreaded parallel program that uses a black box data structure, treating calls to the data structure as basic operations, and also provides a data structure that supports batched operations. Given these, the runtime system automatically combines these two components together, buffering data structure operations generated by the program, and executing them in batches on the data structure.
This idea was extended in [4] to data structures that do not process only one batch at a time (to improve parallelism). In this extended implicit batching framework, the runtime system not only holds the data structure operations in a parallel buffer, to form the next input batch, but also notifies the data structure on receiving the first operation in each batch. Independently, the data structure can at any point flush the parallel buffer to get the next batch.
This framework nicely supports pipelined batched data structures, since the data structure can decide when it is ready to get the next input batch from the parallel buffer, which may be even before it has finished processing the previous batch. Furthermore, this framework makes it easy for us to build composable parallel algorithms and data structures with composable performance bounds. This is demonstrated by both the parallel working-set map in [4] and the parallel finger structure in this paper.
Finger Structures
The map (or dictionary) data structure, which supports inserts, deletes and searches/updates, collectively referred to as accesses, comes in many different kinds. A common implementation of a map is a balanced binary search tree such as an AVL tree or a red-black tree, which (in the comparison model) takes worst-case cost per access for a tree with items. There are also maps such as splay trees [36] that have amortized rather than worst-case performance bounds.
A finger structure is a special kind of map that comes with a fixed finger at each end and a (fixed) number of movable fingers, each of which has a key (possibly or or between adjacent items in the map) that determines its position in the map, such that accessing items nearer the fingers is cheaper. For instance, the finger tree [27] was designed to have the finger property in the worst case; it takes steps per operation with finger distance (1), so its total cost satisfies the finger bound (2).
Definition 1 (Finger Distance).
Define the finger distance of accessing an item on a finger structure to be the number of items from to the nearest finger in (including ), and the finger distance of moving a finger to be the distance moved.
Definition 2 (Finger Bound).
Given any sequence of operations on a finger structure , let denote the finger bound for , defined by F_{L}=\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left(\log r_{i}+1}\right) where is the finger distance of the -th operation in when is performed on .
Main Results
We present in this paper, to the best of our knowledge, the first parallel finger structure. In particular, we design two parallel maps that are work-optimal with respect to the Definition 2 (Finger Bound). (i.e. it takes work) for some linearization of the operations (that is consistent with the results), while having very good parallelism. (We assume that each key comparison takes steps.)
These parallel finger structures can be used by any parallel program , whose actual execution is captured by a program DAG , where each node is an instruction that finishes in time or a call to the finger structure , called an -call, that blocks until the result is returned, and each edge represents a dependency due to the parallel programming primitives.
The first design, called , is a simpler data structure that processes operations one batch at a time.
Theorem 3 ( Performance).
If uses (as ), then its running time on processes using any greedy scheduler (i.e. at each step, as many tasks are executed as are available, up to ) is
[TABLE]
for some linearization of -calls in , where is the number of nodes in , and is the number of nodes on the longest path in , and is the maximum number of -calls on any path in , and is the maximum size of . 222To cater to instructions that may not finish in time (e.g. due to memory contention), it suffices to define and to be the (weighted) work and span (5) respectively of the program DAG where each -call is assumed to take time.
Notice that if is an ideal concurrent finger structure (i.e. one that takes O\mathopen{}\mathclose{{}\left(F_{L}}\right) work), then running using on processors according to the linearization takes worst-case time where . Thus gives an essentially optimal time bound except for the ‘span term’ d\cdot\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right), which adds O\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right) time per -call along some path in .
The second design, called , uses a complex internal pipeline to reduce the ‘span term’.
Theorem 4 ( Performance).
If uses , then its running time on processes using any greedy scheduler is
[TABLE]
for some linearization of -calls in , where is the maximum number of -calls on any path in , and is the weighted span of where each -call is weighted by its cost according to , except that each finger-move operation is weighted by . Specifically, each access -call that is an access with finger distance according to is given the weight , and each -call that is a finger-move is given the weight , and is the maximum weight of any path in . Thus, ignoring finger-move operations, gives an essentially optimal time bound up to an extra O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) time per -call along some path in .
We shall first focus on basic finger structures with just one fixed finger at each end, since we can implement the general finger structure with movable fingers by essentially concatenating basic finger structures, as we shall explain later in Section 6. We will also discuss later in Section 7 how to adapt our results for work-stealing schedulers that can actually be provided by a real runtime system.
Challenges & Key Ideas
The sequential finger structure in [22] (essentially a B-tree with carefully staggered rebalancing) takes worst-case time per access with finger distance , but seems impossible to parallelize efficiently. It turns out that relaxing this bound to amortized time admits a simple sequential finger structure (Section 3) that can be parallelized. In , the items are stored in order in a list of segments , where each segment is a balanced binary search tree with size at most but at least unless , where . This ensures that has height O\mathopen{}\mathclose{{}\left(2^{k}}\right), and that the least items are in the first segments and the greatest items are in the last segments. Thus for each operation with finger distance , it takes time to search through the segments from both ends simultaneously to find the correct segment and perform the operation in it. After that, we rebalance the segments to preserve the size invariant, in such a way that each imbalanced segment will have new size . This double-exponential segment sizes and the reset-to-middle rebalancing is critical in ensuring that all the rebalancing takes amortized time per operation, even if each rebalancing cascade may take up to time.
The challenge is to parallelize while preserving the total work. Naturally, we want to process operations in batches, and use a batch-parallel search structure in place of each binary search tree. This may seem superficially similar to the parallel working-set map in [4], but the techniques in the earlier paper cannot be applied in the same way, for three main reasons.
Firstly, searches and deletions for items not in the map must still be cheap if they have small finger distance, so we have to eliminate these operation in a separate preliminary phase by an unsorted search of the smaller segments, before sorting and executing the other operations.
Secondly, insertions and deletions must be cheap if they have small finger distance (e.g. deleting an item from the first segment must have cost), so we cannot enforce a tight segment size invariant, otherwise rebalancing would be too costly.
This is unlike the parallel working-set map, where we not only have a budget of for each insertion or deletion or failed search, but also must shift accessed items sufficiently near to the front to achieve the desired span bound. The rebalancing in the parallel finger structures in this paper is hence completely different from that in the parallel working-set map.
Thirdly, for the faster version where the larger segments are pipelined, in order to keep all segments sufficiently balanced, the pipelined segments must never be too underfull, so we must carefully restrict when a batch is allowed to be processed at a segment. Due to this, we cannot even guarantee that a batch of operations will proceed at a consistent pace through the pipeline, but we can use an accounting argument to bound the ‘excess delay’ by the number of -calls divided by .
Other Related Work
There are many approaches for designing efficient parallel data structures, so as to make maximal use of parallelism in a multi-processor system, whether with empirical or theoretical efficiency.
For example, Ellen et al. [17] show how to design a non-blocking concurrent binary search tree, with later work analyzing the amortized complexity [16] and generalizing this technique [13]. Another notable concurrent search tree is the CBTree [2, 1], which is based on the splay tree. But despite experimental success, the theoretical access cost for these tree structures may increase with the number of concurrent operations due to contention near the root, and some of them do not even maintain balance (i.e., the height may get large).
Another method is software combining [19, 23, 32], where each process inserts a request into a shared queue and at any time one process is sequentially executing the outstanding requests. This generalizes to parallel combining [6], where outstanding requests are executed in batches on a suitable batch-parallel data structure (similar to implicit batching). These methods were shown to yield empirically efficient concurrent implementations of various common abstract data structures including stacks, queues and priority queues.
In the PRAM model, Paul et al. [33] devised a parallel 2-3 tree where synchronous processors can perform a sorted batch of operations on a parallel 2-3 tree of size in time. Blelloch et al. [10] show how to increase parallelism of tree operations via pipelining. Other similar data structures include parallel treaps [11] and a variety of work-optimal parallel ordered sets [8] supporting unions and intersections with optimal work, but these do not have optimal span. As it turns out, we can in fact have parallel ordered sets with optimal work and span [5, 28].
Nevertheless, the programmer cannot use this kind of parallel data structure as a black box with atomic operations in a high-level parallel program, but must instead carefully coordinate access to it. This difficulty can be eliminated by designing a suitable batch-parallel data structure and using implicit batching [3] or extended implicit batching as presented in [4] and more fully in this paper. Batch-parallel implementations have been designed for various data structures including weight-balanced B-trees [18], priority queues [6], working-set maps [4] and euler-tour trees [38].
2 Parallel Computation Model
In this section, we describe parallel programming primitives in our model, how a parallel program generates an execution DAG, and how we measure the cost of an execution DAG.
2.1 Parallel Primitives
The parallel finger structures and in this paper are described and explained as multithreaded data structures that can be used as composable building blocks in a larger parallel program. In this paper we shall focus on the abstract algorithms behind and , relying merely on the following parallel programming primitives (rather than model-specific implementation details, but see Appendix Section A.6 for those):
Threads: A thread can at any point terminate itself (i.e. finish running). Or it can fork a new thread, obtaining a pointer to that thread, or join to another thread (i.e. wait until that thread terminates). Or it can suspend itself (i.e. temporarily stop running), after which a thread with a pointer to it can resume it (i.e. make it continue running from where it left off). Each of these takes time. 2. 2.
Non-blocking locks: Attempts to acquire a non-blocking lock are serialized but do not block. Acquiring the lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If threads concurrently access the lock, then each access finishes within time. 3. 3.
Dedicated lock: A dedicated lock is a blocking lock initialized with a constant number of keys, where concurrent threads must use different keys to acquire it, but releasing does not require a key. Each attempt to acquire the lock takes time, and the thread will acquire the lock after at most subsequent acquisitions of that lock. 4. 4.
Reactivation calls: A procedure with no input/output can be encapsulated by a reactivation wrapper, in which it can be run only via reactivations. If there are always at most concurrent reactivations of , then whenever a thread reactivates , if is not currently running then it will start running (in another thread forked in time), otherwise it will run within time after its current run finishes.
We also make use of basic batch operations, namely filtering, sorted partitioning, joining and merging (see Appendix Section A.2), which have easy implementations using arrays in the binary forking model in [9]. So and (using a work-stealing scheduler) can be implemented in the Arbitrary CRCW PRAM model with fetch-and-add, achieving the claimed performance bounds. Actually, and were also designed to function correctly with the same performance bounds in a much stricter computation model called the QRMW parallel pointer machine model (see Appendix Section A.1 for details).
2.2 Execution DAG
The program DAG captures the high-level execution of , but the actual complete execution of (including interaction between data structure calls) is captured by the execution DAG (which may be schedule-dependent), in which each node is a basic instruction and the directed edges represent the computation dependencies (such as constrained by forking/joining of threads and acquiring/releasing of blocking locks). At any point during the execution of , a node in the program/execution DAG is said to be ready if its parent nodes have been executed. At any point in the execution, an active thread is simply a ready node in , while a terminated/suspended thread is an executed node in that has no child nodes.
The execution DAG consists of program nodes (specifically -nodes) and ds (data-structure) nodes, which are dynamically generated as follows. At the start has a single program node, corresponding to the start of the program . Each node could be a normal instruction (i.e. basic arithmetic/memory operation) or a parallel primitive (see Section 2.1). Each program node could also be a data structure call.
When a (ready) node is executed, it may generate child nodes or terminate. A normal instruction generates one child node and no extra edges. A join generates a child node with an extra edge to it from the terminate node of the joined thread. A resume generates an extra child node (the resumed thread) with an edge to it from the suspend node of the originally suspended thread. Accesses to locks and reactivation calls would each expand to a subDAG comprised of normal instructions and possibly fork/suspend/resume.
The program nodes correspond to nodes in the program DAG , and except for data structure calls they generate only program nodes. A call to a data structure is called an -call. If is an ordinary (non-batched) data structure, then an -call generates an -node (and every -node is a ds node), which thereafter generates only -nodes except for calls to other data structures (external to ) or returning the result of some operation (generating a program node with an edge to it from the original -call).
However, if is an (implicitly) batched data structure, then all -calls are automatically passed to the parallel buffer for (see Appendix Section A.4). So an -call generates a buffer node corresponding to passing the call to the parallel buffer, as if the parallel buffer for is itself another data structure and not part of . Buffer nodes generate only buffer nodes until it notifies of the buffered -calls or passes the input batch to , which generates an -node. In short, -nodes exclude all nodes generated as part of the buffer subcomputations (i.e. buffering the -calls, and notifying , and flushing the buffer).
2.3 Data Structure Costs
We shall now define work and span of any (terminating) subcomputation of a multithreaded program, i.e. any subset of the nodes in its execution DAG. This allows us to capture the intrinsic costs incurred by a data structure, separate from the costs of a parallel program using it.
Definition 5 (Subcomputation Work/Span/Cost).
Take any execution of a parallel program (on processors), and take any subset of nodes in its execution DAG . The work taken by is the total weight of where each node is weighted by the time taken to execute it. The span taken by is the maximum weight of nodes in on any (directed) path in . The cost of is .
Definition 6 (Data Structure Work/Span/Cost).
Take any parallel program using a data structure . The work/span/cost of (as used by ) is the work/span/cost of the -nodes in the execution DAG for .
Note that the cost of the entire execution DAG is in fact an upper bound on the actual time taken to run it on a greedy scheduler, which on each step assigns as many unassigned ready nodes (i.e. nodes that have been generated but have not been assigned) as possible to available processors (i.e. processors that are not executing any nodes) to be executed.
Moreover, the subcomputation cost is subadditive across subcomputations. Thus our results are composable with other algorithms and data structures in this model, since we actually show the following for some linearization (where are as defined in Section 1 Main Results, and is the total number of calls to the parallel finger structure).
Theorem 7 ( Work/Span Bounds).
- ✧
(12 and 14) takes O\mathopen{}\mathclose{{}\left(F_{L}}\right) work and O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right)}\right) span.
- ✧
(16 and 21) takes O\mathopen{}\mathclose{{}\left(F_{L}}\right) work and O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot(\log p)^{2}+s_{L}}\right) span.
Note that the bounds for the work/span of and are independent of the scheduler. In addition, using any greedy scheduler, the parallel buffer for either finger structure has cost O\mathopen{}\mathclose{{}\left(\frac{T_{1}+F_{L}}{p}+d\cdot\log p}\right) (Appendix 24). Therefore our main results (3 and 4) follow from these composable bounds (7).
In general, if a program uses a fixed number of implicitly batched data structures, then running it using a greedy scheduler takes O\mathopen{}\mathclose{{}\left(\frac{T_{1}+w^{*}}{p}+T_{\infty}+s^{*}+d^{*}\cdot\log p}\right) time, where is the total work of all the data structures, and is the total span of all the data structures, and is the maximum number of data structure calls on any path in the program DAG.
3 Amortized Sequential Finger Structure
In this section we explain a sequential finger structure with a fixed finger at each end, which (unlike finger structures based on balanced binary trees) is amenable to parallelization and pipelining due to its doubly-exponential segmented structure (which was partially inspired by Iacono’s working-set structure [24]).
ABCDEgjpqy
keeps the items in order in two halves, the front half stored in a chain of segments , and the back half stored in reverse order in a chain of segments . Let for each . Each segment has a target size , and a target capacity defined to be if but if . Each segment stores its items in order in a 2-3 tree. We say that a segment is balanced iff its size is within of its target capacity, and** overfull** iff it has more than items above target capacity, and underfull iff it has more than items below target capacity. At any time we associate every item to a unique segment that it fits in; fits in if is the minimum such that , and that fits in if is the minimum such that , and that fits in if . We shall maintain the invariant that every segment is balanced after each operation is finished.
For each operation on an item , we find the segment that fits in, by checking the range of items in and for each from [math] to and stopping once is found, and then perform the desired operation on the 2-3 tree in . This takes O(k+\log(t(k)+c(k)))\subseteq O\mathopen{}\mathclose{{}\left(2^{k}}\right) steps, and where is the finger distance of the operation.
After that, if becomes imbalanced, we rebalance it by shifting (appropriate) items to or from (after creating empty segment if it does not exist) to make have target size or as close as possible (via a suitable split then join of the 2-3 trees), and then is removed if it is the last segment and is now empty. After the rebalancing, will not only be balanced but also have size within its target capacity. But now may become imbalanced, so the rebalancing may cascade.
Finally, if one chain is longer than the other chain , it must be that , so we rebalance the chains as follows: If is below target size, shift items from to to fill it up to target size. If is (still) below target size, remove the now empty , otherwise add a new empty segment .
Rebalancing may cascade throughout the whole chain and take steps. But we shall show below that the rebalancing costs can be amortized away completely, and hence each operation with finger distance takes amortized steps, giving us the finger bound for . We will later use the same technique in analyzing and as well.
Lemma 8 ( Rebalancing Cost).
All the rebalancing takes amortized steps per operation.
Proof.
We shall maintain the invariant that each segment with items beyond (i.e. above or below) its target capacity has at least stored credits. Each operation is given credit, and we use it to pay for any needed extra stored credits at the segment where we perform the operation. Whenever a segment is rebalanced, it must have had items beyond its target capacity for some , and so had at least stored credits. Also, the rebalancing itself takes O(\log(t(k)+q)+\log(t(k+1)+c(k+1)+q))\subseteq O(\log q)\subseteq O\mathopen{}\mathclose{{}\left(q\cdot 2^{-k}}\right) steps, after which needs at most extra stored credits. Thus the stored credits at can be used to pay for both the rebalancing and any extra stored credits needed by . Whenever the chains are rebalanced, it can be paid for by the last segment rebalancing (which created or removed a segment), and no extra stored credits are needed. Therefore the total rebalancing cost amounts to per operation.
4 Simpler Parallel Finger Structure
We now present our simpler parallel finger structure . The idea is to use the amortized sequential finger structure (Section 3) and execute operations in batches. We group each pair of segments and into one section , and we say that an item fits in the sections iff fits in some segment in .
The items in each segment are stored in a batch-parallel map (Appendix Section A.3), which supports:
- ✧
Unsorted batch search: Search for an unsorted batch of items within work and span, tagging each search with the result, where is the map size.
- ✧
Sorted batch access: Perform an item-sorted batch of operations on distinct items within O\mathopen{}\mathclose{{}\left(b\cdot\log n}\right) work and span, tagging each operation with the result, where is the map size.
- ✧
Split: Split a map of size around a given pivot rank (into lower+upper parts) within work/span.
- ✧
Join: Join maps of total size separated by a pivot (i.e. lower+upper parts) within work/span.
For each section , we can perform a batch of operations on it within work and span if we have the batch sorted. Excluding sorting, the total work would satisfy the finger bound for the same reason as in . However, we cannot afford to sort the input batch right at the start, because if the batch had searches of distinct items all with finger distance , then it would take work and exceed our finger bound budget of .
We can solve this by splitting the sections into two slabs, where the first slab comprises the first sections, and passing the batch through a preliminary phase in which we merely perform an unsorted search of the relevant items in the first slab, and eliminate operations on items that fit in the first slab but are neither found nor to be inserted.
This preliminary phase takes work per operation and span at each section . We then sort the uneliminated operations and execute them on the appropriate slab. For this, ordinary sorting still takes too much work as there can be many operations on the same item, but it turns out that the finger bound budget is enough to pay for entropy-sorting (Appendix 31), which takes O\mathopen{}\mathclose{{}\left(\log\frac{b}{q}+1}\right) work for each item that occurs times in the batch. Rebalancing the segments and chains is a little tricky, but if done correctly it takes amortized work per operation. Therefore we achieve work-optimality while being able to process each batch within O\mathopen{}\mathclose{{}\left((\log b)^{2}+\log n}\right) span. The details are below.
4.1 Description of
ABCDEgjpqy
-calls are put into the parallel buffer (Section 2) for . Whenever the previous batch is done, flushes the parallel buffer to obtain the next batch . Let be the size of , and we can assume . Based on , the sections in are conceptually divided into two slabs, the first slab comprising sections and the final slab comprising sections , where m=\mathopen{}\mathclose{{}\left\lceil\log\log(2b)}\right\rceil+1 (where is the binary logarithm). The items in each segment are stored in a batch-parallel map (Appendix Section A.3).
processes the input batch in four phases:
Preliminary phase: For each first slab section in order (i.e. from [math] to ) do as follows:
- (a)
Perform an unsorted search in each segment in for all the items relevant to the remaining batch (of direct pointers into ), and tag the operations in the original batch with the results. 2. (b)
Remove all operations on items that fit in from the remaining batch . 3. (c)
Skip the rest of the first slab if becomes empty. 2. 2.
Separation phase: Partition based on the tags into three parts and handle each part separately as follows:
- (a)
Ineffectual operations (on items that fit in the first slab but are neither found nor to be inserted): Return the results. 2. (b)
Effectual operations (on items found in or to be inserted into the first slab): Entropy-sort (Appendix 31) them in order of access type (search, update, insertion, deletion) with deletions last, followed by item, combining operations of the same access type on the same item into one group-operation that is treated as a single operation whose effect is the last operation in that group. Each group-operation is stored in a leaf-based binary tree with height (but not necessarily balanced), and the combining is done during the entropy-sorting itself. 3. (c)
Residual operations (on items that do not fit in the first slab): Sort them while combining operations in the same manner as for effectual operations. 333This does not require entropy-sorting, but combining merge-sort essentially achieves the entropy bound anyway. 3. 3.
Execution phase: Execute the effectual operations as a batch on the first slab, and then execute the residual operations as a batch on the final slab, namely for each slab doing the following at each section in order (small to big):
- (a)
Let be the partition of the batch of operations into the access types (deletions last), each sorted by item. 2. (b)
For each segment in , and for each from to , cut out the operations that fit in from , and perform those operations (as a sorted batch) on , and then return their results. 3. (c)
Skip the rest of the slab if the batch becomes empty. 4. 4.
Rebalancing phase: Rebalance all the segments and chains by doing the following:
- (a)
Segment rebalancing: For each chain , for each segment in in order (small to big):
- i.
If and is overfull, shift items from to to make have target size. 2. ii.
If and is underfull and either has at least items or is the last segment in , let be the first underfull segment in , and fill using as follows: for each from down to , shift items from to to make have total size or as close as possible, and then remove if it is emptied. 3. iii.
If is (still) overfull and is the last segment in , create a new (empty) segment . 4. iv.
Skip the rest of the current slab if is (now) balanced and the execution phase had skipped . 2. (b)
Chain rebalancing: After that, if one chain is longer than the other chain , repeat the following until the chains are the same length:
- i.
Let the current chains be and . Create new (empty) segments , and shift all items from to , and then fill the underfull segments in using (as in item 4aii). 2. ii.
If is (now) empty again, remove .
4.2 Analysis of
First we establish that the rebalancing phase works, by proving the following two lemmas.
Lemma 9 ( Segment Rebalancing Invariant).
During the segment rebalancing (item 4a), just after the iteration for segment , for any imbalanced segment in , either or are all underfull.
Proof.
The invariant clearly holds for . Consider each iteration for segment during the segment rebalancing where . If was overfull, then by the invariant it was the only imbalanced segment in , and would be rebalanced in item 4ai, preserving the invariant. If was underfull and had at least items or was the last segment in , then in item 4aii would be filled using , which had at least items unless it was the last segment in , and hence after that every segment in (that is not removed) would be balanced, preserving the invariant. If item 4ai and item 4aii do not apply, then is balanced or is underfull, so the invariant is preserved. Finally, if is balanced at the end of that iteration, and had been skipped by the execution phase, then by the invariant all segments in are balanced, and all segments skipped by the rebalancing phase are also balanced, so the invariant is preserved.
Lemma 10 ( Chain Rebalancing Iterations).
The chain rebalancing (item 4b) takes at most two iterations, after which both chains and will have equal length and all their segments will be balanced.
Proof.
By 9, all segments in each chain will be balanced after the segment rebalancing (item 4a). After that, if one chain is longer than the other chain , the first chain rebalancing iteration transfers all items in to the other chain (item 4bi), leaving empty. If remains non-empty, then both chains have length and we are done. Otherwise, would be removed, and then the second chain rebalancing iteration transfers all items in to the other chain, which is at least items, so every segment in would be filled to target size, and hence both chains would have length .
Next we bound the work done by .
Definition 11 (Inward Order).
Take any sequence of map operations and let be the set of items accessed by operations in . Define the inward distance of an operation in on an item to be . We say that is in inward order iff its operations are in order of (non-strict) increasing inward distance. Naturally, we say that is in outward order iff its reverse is in inward order.
Theorem 12 ( Work).
takes work for some linearization of -calls in .
Proof.
Let be a linearization of -calls in such that:
- ✧
Operations on in earlier input batches are before those in later input batches.
- ✧
The operations within each batch are ordered as follows:
Ineffectual operations are before effectual/residual operations. 2. 2.
Effectual/residual operations are in order of access type (deletions last). 3. 3.
Effectual insertions are in inward order, and effectual deletions are in outward order. 4. 4.
Operations in each group-operation are consecutive and in the same order as in that group.
Let be the same as except that in item 3 effectual deletions are ordered so that those on items in earlier sections are later (instead of outward order). Now consider each input batch of operations on .
In the preliminary and execution phases, each section takes O\mathopen{}\mathclose{{}\left(2^{a}}\right) work per operation. Thus each operation in with finger distance according to on an item that was found to fit in section takes O\mathopen{}\mathclose{{}\left(\sum_{a=0}^{k}2^{a}}\right)=O\mathopen{}\mathclose{{}\left(2^{k}}\right)\subseteq O(\log r+1) work, because if is in the first slab (since earlier effectual operations in did not delete items in ), and if is in the final slab (since ). Therefore these phases take work in total.
Let be the effectual operations in as a subsequence of . Entropy-sorting takes work (Appendix 32), where is the entropy of (i.e. where is the number of occurrences of the -th operation in ). Partition into parts: searches/updates and insertions and deletions . And let be the entropy of . Then where is the number of operations in the same part of as the -th operation in , and \sum_{i=1}^{b}\log\frac{b}{b_{i}}\leq b\cdot\log\mathopen{}\mathclose{{}\left(\frac{1}{b}\sum_{i=1}^{b}\frac{b}{b_{i}}}\right)=b\cdot\log 3 by Jensen’s inequality. Thus entropy-sorting takes O\mathopen{}\mathclose{{}\left(\sum_{j=1}^{3}H_{j}+b}\right) work. Let be the cost of according to . Since each operation in has inward distance (with respect to ) at most its finger distance according to , we have (Appendix 28), and hence entropy-sorting takes work in total.
Sorting the residual operations in the batch (that do not fit in the first slab) takes work per operation with finger distance according to , since .
Therefore the separation phase takes work in total. Finally, the rebalancing phase takes amortized work per operation, as we shall prove in the next lemma. Thus takes total work.
Lemma 13 ( Rebalancing Work).
The rebalancing phase of takes amortized work per operation.
Proof.
We shall maintain the credit invariant that each segment with items beyond its target capacity has at least stored credits. The execution phase clearly increases the total stored credits needed by at most per operation, which we can pay for. We now show that the invariant can be preserved after the segment rebalancing and the chain rebalancing.
During the segment rebalancing (item 4a), each shift is performed between some neighbouring segments and , where has items and has items just before the shift, and . The shift clearly takes work. If then this is obviously just work. But if , then will also be rebalanced in item 4ai of the next segment balancing iteration, since at most items will be shifted from to in item 4aii, and hence will still have at least items. In that case, the second term in the work bound for this shift can be bounded by the first term of the work bound for the subsequent shift from to , since . Therefore in any case we can treat this shift as taking only O(\log t(k)+\log|q|)\subseteq O(\log|q|)\subseteq O\mathopen{}\mathclose{{}\left(|q|\cdot 2^{-k}}\right) work.
Now consider the two kinds of segment rebalancing:
- ✧
Overflow: item 4ai shifts items from overfull to . Suppose that has items just before the shift. After the shift, has target size and needs no stored credits, and would need at most extra stored credits. Thus the credits stored at can pay for both the shift and the needed extra stored credits.
- ✧
Fill: item 4aii fills some underfull segments using . Suppose that has items just before the fill, for each . After the fill, every segment in has size within target capacity and needs no stored credits, and needs at most \mathopen{}\mathclose{{}\left(\sum_{j=k^{\prime}}^{k}u_{i}(j)}\right)\cdot 2^{-(k+1)}\leq\frac{1}{2}\sum_{j=k^{\prime}}^{k}\mathopen{}\mathclose{{}\left(u_{i}(j)\cdot 2^{-j}}\right) extra stored credits, which can be paid for by using half the credits stored at each segment in . The other half of the credits stored at suffices to pay for the shift from to , for each .
The chain rebalancing (item 4b) is performed only when segment rebalancing creates or removes a segment and makes one chain longer than the other. Consider the biggest segment that was created or removed. If was created, it must be due to overflowing to in item 4ai, and hence the shift from to already took \Theta\mathopen{}\mathclose{{}\left(2^{k}}\right) work. If was removed, it must be due to filling some segments using in item 4aii, but must have had at least items before the execution phase, and at least half of them were either deleted or shifted to , and hence either the deletions can pay \Theta\mathopen{}\mathclose{{}\left(2^{k}}\right) credits, or the shift to already took \Theta\mathopen{}\mathclose{{}\left(2^{k}}\right) work. Therefore in any case we can afford to ignore up to \Theta\mathopen{}\mathclose{{}\left(2^{k}}\right) work done by chain rebalancing.
Now observe that the chain rebalancing performs at most two transfers (item 4bi) of items from the last segment of the longer chain to the shorter chain , by the Lemma 10 ( Chain Rebalancing Iterations). (10). Each transfer takes work to create the new segments and work to shift over to , and then fills underfull segments in using . The fill takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work for the shift from to , and takes O\mathopen{}\mathclose{{}\left(2^{j}}\right) work for each shift from to for each , since has at most items just before the shift. Therefore each transfer takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work in total, and hence we can ignore all the work done by the chain rebalancing.
And now we turn to bounding the span of .
Theorem 14 ( Span).
takes O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right)}\right) span, where is the number of operations on , and is the maximum size of , and is the maximum number of -calls on any path in the program DAG .
Proof.
Let denote the maximum span of processing an input batch of size (that has been flushed from the parallel buffer). Take any input batch of size . We shall bound the span taken by in each phase.
The preliminary phase takes O\mathopen{}\mathclose{{}\left(\log b\cdot 2^{k}}\right) span in each first slab segment , adding up to O\mathopen{}\mathclose{{}\left((\log b)^{2}}\right) span. The separation phase also takes O\mathopen{}\mathclose{{}\left((\log b)^{2}}\right) span, by Theorem 32 ( Costs). (32). The execution phase takes O\mathopen{}\mathclose{{}\left(\log b+2^{k}}\right) span in each segment , adding up to O\mathopen{}\mathclose{{}\left(\log b\cdot\log\log b+\log n}\right) span. Returning the results for each group-operation takes span.
The rebalancing phase also takes O\mathopen{}\mathclose{{}\left(\log b+2^{k}}\right) span for each segment processed in item 4a, because each shift between segments with total size takes span, and filling using in item 4aii takes O\mathopen{}\mathclose{{}\left(\log\mathopen{}\mathclose{{}\left(b+\sum_{a=k^{\prime}}^{k}t(a)}\right)}\right)\subseteq O\mathopen{}\mathclose{{}\left(\log b+2^{k}}\right) span for the first shift from to and then O\mathopen{}\mathclose{{}\left(\log\sum_{a=k^{\prime}}^{j}t(a)}\right)\subseteq O\mathopen{}\mathclose{{}\left(2^{j}}\right) span for each subsequent shift from to . Similarly, the chain rebalancing in item 4b takes span, because it performs at most two iterations by Lemma 10 ( Chain Rebalancing Iterations). (10), each of which takes span to fill the underfull segments of the shorter chain using its last segment.
Therefore s(b)\in O\mathopen{}\mathclose{{}\left((\log b)^{2}+\log n}\right)\subseteq O\mathopen{}\mathclose{{}\left(\frac{b}{p}+(\log p)^{2}+\log n}\right), since (\log b)^{2}\in O\mathopen{}\mathclose{{}\left(\frac{b}{p}}\right) if .
Each batch of size waits in the buffer for the preceding batch of size to be processed, taking O\mathopen{}\mathclose{{}\left(s(b^{\prime})}\right) span, and then itself is processed, taking span, taking span in total. Since over all batches each of will sum up to at most the total number of -calls, and there are at most -calls on any path in the program DAG , the span of is O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot\mathopen{}\mathclose{{}\left((\log p)^{2}+\log n}\right)}\right).
5 Faster Parallel Finger Structure
Although has optimal work and a small span, it is possible to reduce the span even further, intuitively by pipelining the batches in some fashion so that an expensive access in a batch does not hold up the next batch.
As with , we need to split the sections into two slabs, but this time we fix the first slab at sections where so that we can pipeline just the final slab. We need to allow big enough batches so that operations that are delayed because earlier batches are full can count their delay against the total work divided by . But to keep the span of the sorting phase down to O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right), we need to restrict the batch size. It turns out that restricting to batches of size at most works.
We cannot pipeline the first slab (particularly the rebalancing), but the preliminary phase and separation phase would only take O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) span. The execution phase and rebalancing phases are still carried out as before on the first slab, taking O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) span, but execution and rebalancing on the final slab are pipelined, by having each final slab section process the batch passed to it and rebalance the preceding segments and if necessary.
To guarantee that this local rebalancing is possible, we do not allow to proceed if it is imbalanced or if there are more than pending operations in the buffer to . In such a situation, must stop and reactivate , which would clear its buffer and rebalance before restarting . It may be that also cannot proceed for the same reason and is stopped in the same manner, and so may be delayed by such a stop for a long time. But by a suitable accounting argument we can bound the total delay due to all such stops by the total work divided by . Similarly, we do not allow the first slab to run (on a new batch) if is imbalanced or there are more than pending operations in the buffer to .
Finally, we use an odd-even locking scheme to ensure that the segments in the final slab do not interfere with each other yet can proceed at a consistent pace. The details are below.
5.1 Description of
ABCDEgjpqy
We shall now give the details (see Figure 3). We will need the bunch structure (Appendix 23) for aggregating batches, which is an unsorted set supporting both addition of a batch of new elements within work/span and conversion to a batch within work and span if it has size .
has the same sections as in , with the first slab comprising the first m=\mathopen{}\mathclose{{}\left\lceil\log\log\mathopen{}\mathclose{{}\left(5p^{2}}\right)}\right\rceil sections, and the final slab comprising the other sections. uses a feed buffer, which is a queue of bunches of operations each of size exactly except the last (which can be empty). Whenever is notified of input (by the parallel buffer), it reactivates the first slab.
Each section in the final slab has a buffer before it (for pending operations from ), which for each access type uses an optimal batch-parallel map (Appendix Section A.3) to store bunches of group-operations of that type, where operations on the same item are in the same bunch. When a batch of group-operations on an item is inserted into the buffer, it is simply added to the correct bunch. Whenever we count operations in the buffer, we shall count them individually even if they are on the same item. The first slab and each final slab section also has a deferred flag, which indicates whether its run is deferred until the next section has run. Between every pair of consecutive sections starting from after is a neighbour-lock, which is a dedicated lock (see Section 2.1) with key for each arrow to it in Figure 3.
Whenever the first slab is reactivated, it runs as follows:
If the parallel buffer and feed buffer are both empty, terminate. 2. 2.
Acquire the neighbour-lock between and . (Skip steps 2 to 4 and steps 8 to 10 if does not exist.) 3. 3.
If has any imbalanced segment or has more than operations in its buffer, set the first slab’s deferred flag and release the neighbour-lock, and then reactivate and terminate. 4. 4.
Release the neighbour-lock. 5. 5.
Let be the size of the last bunch in the feed buffer. Flush the parallel buffer (if it is non-empty) and cut the input batch of size into small batches of size except possibly the first and last, where the first has size \min\mathopen{}\mathclose{{}\left(b,p^{2}-q}\right). Add that first small batch to , and append the rest as bunches to the feed buffer. 6. 6.
Remove the first bunch from the feed buffer and convert it into a batch , which we call a cut batch. 7. 7.
Process using the same four phases as in (Figure 2), but restricted to the first slab (i.e. execute only the effectual operations on the first slab, and do segment rebalancing only on the first slab, and do chain rebalancing only if had not existed before this processing). Furthermore, do not update ’s segments’ sizes until after this processing (so that in item 4 will not find any of ’s segments imbalanced until the first slab rebalancing phase has finished). 8. 8.
Acquire the neighbour-lock between and . 9. 9.
Insert the residual group-operations (on items that do not fit in the first slab) into the buffer of , and then reactivate . 10. 10.
Release the neighbour-lock. 11. 11.
Reactivate itself.
Whenever a final slab section is reactivated, it runs as follows:
Acquire the neighbour-locks (between and its neighbours) in the order given by the arrow number in Figure 3. 2. 2.
If has any imbalanced segment or (exists and) has more than operations in its buffer, set ’s deferred flag and release the neighbour-locks, and then reactivate and terminate. 3. 3.
For each access type, flush and process the (sorted) batch of bunches of group-operations of that type in its buffer as follows:
- (a)
Convert each bunch in to a batch of group-operations. 2. (b)
For each segment in , cut out the group-operations on items that fit in from , and perform them (as a sorted batch) on , and then fork to return the results of the operations (according to the order within each group-operation). 3. (c)
If is non-empty (i.e. has leftover group-operations), insert into the buffer of and then reactivate . 4. 4.
Rebalance locally as follows (essentially like in ):
- (a)
For each segment in :
- i.
If is overfull, shift items from to to make have target size. 2. ii.
If is underfull, shift items from to to make have target size, and then remove if it is emptied. 3. iii.
If is (still) overfull and is the last segment in , create a new segment and reactivate it. 2. (b)
If is (still) the last section, but chain is longer than chain :
- i.
Create a new segment and shift all items from to . 2. ii.
If is (now) underfull, shift items from to to make have target size. 3. iii.
If is (now) empty again, remove . 5. 5.
If , and the first slab is deferred, clear its deferred flag then reactivate it. 6. 6.
If , and is deferred, clear its defered flag then reactivate it. 7. 7.
Release the neighbour-locks.
5.2 Analysis of
For each computation, we shall define its delay to intuitively capture the minimum time it needs, including all potential waiting on locks. Each blocked acquire of a dedicated lock corresponds to an acquire-stall node in the execution DAG whose child node is created by the release just before the successful acquisition of the lock. Let be the ancestor nodes of that have not yet executed at the point when is executed. Then the delay of a computation is recursively defined as the weighted span of , where each acquire-stall node in is weighted by the delay of (to capture the total waiting at ), and every other node is weighted by its cost. 444The delay of depends on the actual execution, due to the definition of for each acquire-stall node in . But it captures the minimum time needed to run in the following sense: For any computation , on any step that executes all ready nodes in the remaining computation (i.e. the unexecuted nodes in ), the delay of is reduced. (So if a greedy scheduler is used, the number of steps in which some processor is idle is bounded by the delay.)
Whenever the first slab or a final slab section runs, we say that it defers if it terminates with its deferred flag set (i.e. at step 2), otherwise we say that it proceeds (i.e. to step 3) and eventually finishes (i.e. reaches the last step) with its deferred flag cleared. We now establish some invariants, which guarantee that is always sufficiently balanced.
Lemma 15 ( Balance Invariants).
satisfies the following invariants:
When the first slab is not running, every segment in is balanced and has at most items. 2. 2.
When a final slab section rebalances a segment in (in item 4a), it will make that segment have size . 3. 3.
Just after the last section finishes without creating new sections, the segments in are balanced and both chains have the same length. 4. 4.
Each final slab section always has at most operations in its buffer. 5. 5.
Each final slab segment always has at most items, and at least items unless is the last section.
Proof.
Invariant 1 holds as follows: The first slab proceeds only if ’s segments are balanced, and from that point until after the rebalancing phase, its segments are modified only by itself (since will not modify ), and thereafter all its sections except remain unmodified until it processes the next cut batch. Thus the same proof as for Lemma 9 ( Segment Rebalancing Invariant). (9) shows that just before the segment rebalancing (item 4a) iteration for , for any imbalanced first slab segment , either or are underfull. But note that the cut batch had at most operations, and so after the execution phase, had at least items unless it was the last segment in its chain. Thus will be made balanced (by item 4ai or item 4aii in the iteration for , or by item 4b). Similarly, will have at most items in each segment, since .
Invariant 2 holds as follows. Each final slab section proceeds only if its segments each has at least items unless it is the last segment in its chain, and its buffer had at most operations by Invariant 4. Since , rebalancing a segment in (item 4a) will make it have size .
Invariant 3 holds as follows. The last section proceeds only if its segments each has at most items, and its buffer had at most operations by Invariant 4. Thus if any of its segments becomes overfull and it creates a new section , it will subsequently be deferred until runs. And during that run of , it will proceed and shift at most items from to , after which will not be overfull, and so will not create another new section . Therefore we can assume that the chains’ lengths never differ by more than one segment, and so the chain rebalancing (item 4b) will make the chains the same length while ensuring the segments in and are balanced.
Invariant 4 holds for , because the first slab proceeds only if ’s buffer has at most operations, and only processes a cut batch of size at most , hence after that ’s buffer will have at most operations. Invariant 4 holds for for each , because proceeds only if ’s buffer has at most operations, and only processes a buffered batch of size at most by Invariant 4 for , hence after that ’s buffer will have at most operations.
Invariant 5 holds as follows. Each final slab segment is modified only when either or runs, and the latter never makes imbalanced. Consider each run. It proceeds only if has at most items and at least items unless is the last section, and its buffer had at most operations by Invariant 4, and had at most items by Invariant 5 for . So at most items were inserted into , and at most items were shifted from to . Also, at most items were deleted from , and at most items were shifted from to . Thus after that run, has at most items and at least items unless was the last section, since .
With these invariants, we are ready to bound the work done by .
Theorem 16 ( Work).
takes work for some linearization of -calls in .
Proof.
We shall use a similar proof outline as for Theorem 12 ( Work). (12). Let be a linearization of -calls in such that:
- ✧
Operations on that finish during the first slab run or some final slab section run are ordered by when that run finished.
- ✧
Operations on that finish during the same first slab run are ordered as follows:
Ineffectual operations are before effectual operations. 2. 2.
Effectual operations are in order of access type (deletions last). 3. 3.
Effectual insertions are in inward order, and effectual deletions are in outward order (11).
- ✧
Operations on in each group-operation are in the same order as in that group.
As before, let be the same as except that in item 3 effectual deletions are ordered so that those on items in earlier sections are later (instead of outward order).
Consider each cut batch of operations processed by the first slab. By Lemma 15 ( Balance Invariants). (15), just before that processing, every segment in is balanced, and has at most items. Thus in both the preliminary phase and the execution phase, each section takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work per operation. And this amounts to work per operation in with finger distance according to , because the operation reaches only if .
As with , the separation phase takes work in total (see 12’s proof).
Now consider each batch of operations processed by a final slab section . By Lemma 15 ( Balance Invariants). (15), has at most operations, and each segment in always has at most items. So inserting the operations in into the buffer took O\mathopen{}\mathclose{{}\left(2^{k}}\right) work per operation. Converting each bunch in to a group-operation takes work per operation. Cutting out and performing and returning the results of the group-operations that fit takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work per group-operation. And the local rebalancing takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work. Therefore each run that proceeds to process its buffered operations takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work per operation. This again amounts to work per operation in with finger distance according to as follows:
- ✧
If finishes in : At that point the first slab has at least items in each chain, because was balanced just before processing the last cut batch. Thus and hence costs O\mathopen{}\mathclose{{}\left(2^{m}}\right)\subseteq O(\log r) work.
- ✧
If finishes in for some : At that point has at least items in each segment by Lemma 15 ( Balance Invariants). (15). Thus and hence costs O\mathopen{}\mathclose{{}\left(2^{k}}\right)\subseteq O(\log r) work.
Finally, all the rebalancing takes amortized work per operation, which we shall leave to the next lemma.
Lemma 17 ( Rebalancing Work).
All the rebalancing steps of take amortized work per operation.
Proof.
We shall maintain the credit invariant that each segment with items beyond its target capacity has at least stored credits. Also, each unfinished operation carries credit with it. As with (see 13’s proof), the invariant can be preserved after rebalancing in the first slab. By the same reasoning, the invariant can also be preserved after segment rebalancing in the final slab (item 4a), because any shift between segments and where is performed only when is imbalanced, and after that has size by Lemma 15 ( Balance Invariants). (15). Similarly, the invariant can be preserved after chain rebalancing in the final slab (item 4b), because it takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work, which can be ignored since the last segment rebalancing shift already took O\mathopen{}\mathclose{{}\left(2^{k}}\right) work.
To tackle the span of , we need some lemmas concerning the span of cutting the input batch and the delay in each slab.
Lemma 18 ( Input Cutting Span).
The first slab cuts an input batch of size (i.e. cutting it into small batches and storing them in the feed buffer) within O\mathopen{}\mathclose{{}\left(\frac{b}{p}+\log p}\right) span.
Proof.
Cutting the input batch into small batches takes span. Adding them to the feed buffer takes span. This amounts to O\mathopen{}\mathclose{{}\left(\frac{b}{p}+\log p}\right) span because \log b\in O\mathopen{}\mathclose{{}\left(\frac{b}{p}}\right) if .
Lemma 19 ( Final Slab Delay).
Each section in the final slab runs within O\mathopen{}\mathclose{{}\left(2^{k}}\right) delay (whether it defers or finishes).
Proof.
Consider any final slab section that has acquired the second neighbour-lock. Checking whether it has an imbalanced segment and checking ’s buffer size takes only delay. By Lemma 15 ( Balance Invariants). (15), has at most operations in its buffer, and always has at most items in each segment, and has at most items in each segment. Thus converting each bunch in the buffer takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) span, and performing the operations that fit in takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) span, and rebalancing the segments in takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) span.
Now consider any final slab section that has acquired the first neighbour-lock. It waits O\mathopen{}\mathclose{{}\left(2^{k}}\right) delay for the current holder (if any) of the second neighbour-lock to release it, and then itself takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) more delay to complete its run.
Finally consider any final slab section that starts running. If , it waits O\mathopen{}\mathclose{{}\left(2^{k}}\right) delay for the first slab to release the shared neighbour-lock, since the first slab takes only O\mathopen{}\mathclose{{}\left(2^{m}}\right) span on each access to . If , it waits O\mathopen{}\mathclose{{}\left(2^{k}}\right) delay for the current holder of the first neighbour-lock to release it, and then itself takes O\mathopen{}\mathclose{{}\left(2^{k}}\right) more delay to complete its run.
Lemma 20 ( First Slab Delay).
The first slab takes delay for each acquiring of the neighbour-lock, and it processes each cut batch within O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) delay.
Proof.
Each acquiring of the neighbour-lock takes O\mathopen{}\mathclose{{}\left(2^{m}}\right)=O(\log p) delay by Lemma 19 ( Final Slab Delay). (19). Checking whether has an imbalanced segment and checking ’s buffer size takes only delay. Obtaining the cut batch (whose size is at most ) from the first bunch from the feed buffer takes delay. The four phases take O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right) delay in total, as in (see 14). Inserting the residual group-operations into the buffer of takes O\mathopen{}\mathclose{{}\left(\log p+2^{m}}\right)=O(\log p) delay, since ’s buffer had at most items by Lemma 15 ( Balance Invariants). (15).
With these lemmas, we can finally bound the span of .
Theorem 21 ( Span).
takes O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot(\log p)^{2}+s_{L}}\right) span for some linearization of . ( is the maximum number of -calls on any path in , and is the weighted span of with -calls weighted according to .)
Proof.
Take any path through the program DAG . Let be the linearization in the proof of Theorem 16 ( Work). (16). Consider any -call along with finger distance according to . We shall trace the journey of from the parallel buffer in an input batch to a cut batch and then through the slabs, and bound the delay taken by relative to , meaning that in the computation of the delay we only count -nodes. Along the way, we shall partition that delay into the normal delay and the deferment delay, where the latter comprises all waiting at the first slab or a section that defers from the point it sets the deferred flag until it is reactivated and clears the deferred flag (and proceeds).
Normal delay
At the start, waits in the parallel buffer for the first slab to finish running on the previous input batch of size , taking O\mathopen{}\mathclose{{}\left(\frac{b^{\prime}}{p}+(\log p)^{2}}\right) delay by Lemma 18 ( Input Cutting Span). (18) and Lemma 20 ( First Slab Delay). (20). Next waits for the first slab to process some cut batches of size in the feed buffer, each taking O\mathopen{}\mathclose{{}\left((\log p)^{2}}\right)\subseteq O(p) normal delay. Then is flushed from the parallel buffer in some input batch of size , which is cut within O\mathopen{}\mathclose{{}\left(\frac{b}{p}+(\log p)^{2}}\right) normal delay, and next waits for another cut batches of size that come before in the feed buffer, each taking normal delay. (Note that we are ignoring all waiting while the first slab is deferred.)
If finishes in the final slab, it waits a further O\mathopen{}\mathclose{{}\left(2^{k}}\right) normal delay at each final slab section that it passes through by Lemma 19 ( Final Slab Delay). (19). And when finishes in a section , at that point has at least items in each segment by Lemma 15 ( Balance Invariants). (15). Thus and hence takes O\mathopen{}\mathclose{{}\left(\sum_{a=m}^{k}2^{a}}\right)=O\mathopen{}\mathclose{{}\left(2^{k}}\right)\subseteq O(\log r) normal delay in the final slab. Finally when is returned, it is in some group-operation with operations, so returning the results takes O(\log g)\subseteq O\mathopen{}\mathclose{{}\left(\frac{g}{p}+\log p}\right) span.
Therefore in total takes O\mathopen{}\mathclose{{}\left(\frac{b^{\prime}}{p}+\frac{b}{p}+\frac{g}{p}+i\cdot p+j\cdot p+(\log p)^{2}+\log r}\right) normal delay.
Deferment delay
To bound the deferment delay, we shall use a similar credit invariant as in Lemma 17 ( Rebalancing Work). (17), but instead of paying for rebalancing work we shall use the credits to pay for times the deferment delay. This would imply that the deferment delay is at most O\mathopen{}\mathclose{{}\left(\frac{1}{p}}\right) per operation on . The invariant is that for , each segment with items beyond its target capacity has at least stored credits, and that each operation in ’s buffer carries credits with it.
Consider each deferment of a section for (where deferment of the first slab is treated as deferment of ). At that point either one of its segment is imbalanced or ’s buffer has more than items, and reactivates , which may either defer or proceed. In any case, from that point until proceeds, will never proceed (even if reactivated), because its segments and ’s buffer remain untouched. But once proceeds, it will empty its buffer and make ’s segments balanced by Lemma 15 ( Balance Invariants). (15), and then reactivate on finishing, so will proceed within O\mathopen{}\mathclose{{}\left(2^{k}}\right) subsequent delay.
Thus if is waiting at due to consecutive sections being deferred, and proceeding, the deferment at lasts O\mathopen{}\mathclose{{}\left(\sum_{a=k}^{j+1}2^{a}}\right)=O\mathopen{}\mathclose{{}\left(2^{j}}\right) delay (by 15 again), and p\cdot 2^{j}\leq{\!\!\sqrt[\raisebox{0.70004pt}{\scalebox{0.7}{\ ,,}}]{c(m)}\,}\cdot 2^{j}\in O\mathopen{}\mathclose{{}\left(c(j)\cdot 2^{-j}}\right) since 2^{2j}\in O\mathopen{}\mathclose{{}\left({\!\!\sqrt[\raisebox{0.70004pt}{\scalebox{0.7}{\ ,,}}]{c(j)}\,}}\right). If had an imbalanced segment, it would have at least stored credits, and we can use half of it to pay for any needed extra stored credits at due to the shift. If ’s buffer had more than items, then they carry credits, and we can use half to pay for any needed extra stored credits at and for any credits carried by operations that go on to . In both cases, we can use the other half of those credits to pay for times the deferment delay that takes at .
Total delay
There are at most -calls along , and over all , each of above will sum up to at most the total number of -calls, and the total deferment delay of all -calls along is O\mathopen{}\mathclose{{}\left(\frac{N}{p}}\right). Therefore the span of is O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot(\log p)^{2}+s_{L}}\right).
6 General Parallel Finger Structures
To support an arbitrary but fixed number of movable fingers (besides the fingers at the ends), while retaining both work-optimality with respect to the finger bound and good parallelism, we essentially use a basic parallel finger structure for each sector between adjacent fingers.
It is easier to do this with , because we are processing the operations in batches. The finger-move operations are all done first in a finger phase before the rest of the batch, and of course we combine finger-move operations on the same finger. Consider any finger that is between two sectors and . This finger is sandwiched between the nearest chain of and the nearest chain of . To move this finger into chain of past an item in segment , we move all the items between the old and new finger position from to , roughly as follows:
Cut out the items in from sector ’s segments and join them (from small to big) into a single batch . 2. 2.
Join the items in sector ’s segments (from small to big) and shift them into (by a single join). 3. 3.
Use to fill sector ’s sections to target size except perhaps . 4. 4.
Rebalance and as in ’s rebalancing phase (Figure 2).
This essentially contributes O\mathopen{}\mathclose{{}\left(2^{k}}\right) work and span, because we can preserve the same credit invariant to bound the rebalancing work and span. It is similar but messier for moving a finger so far that it goes over the nearer chain of and into its further chain.
After that, we can simply partition the map operations around the fingers and perform each part on the correct sector in parallel. This partitioning takes work and span for each batch of operations (see Appendix Section A.2), and O(\log b)\subseteq O\mathopen{}\mathclose{{}\left(\frac{b}{p}+\log p}\right), and each sector takes span. Thus we will obtain the desired work/span bounds (7).
It is much harder for , and considerably complicated, so we shall not attempt to explain it here.
7 Work-Stealing Schedulers
The bounds on the work and span of and in Section 4 and Section 5 hold regardless of the scheduler. The performance bounds for and in Section 1 require a greedy scheduler, in order to bound the parallel buffer cost. In practice, we do not have such schedulers. But we can design a suitable work-stealing scheduler in the QRMW pointer machine model that yields the desired time bounds (3 and 4) on average, as we shall explain below.
We make the modest assumption that each processor (in the QRMW pointer machine) can generate a uniformly random integer in and convert it to a pointer given by a constant lookup-table within steps. For instance, this can be done if each processor has local RAM of size (i.e. sole access to its own local memory with cells and random access).
The blocking work-stealing scheduler in [12] is for an atomic message passing model, in which multiple concurrent accesses to each deque are arbitrarily queued and serviced one at a time. This can be supported by guarding each deque with a CLH lock [29], and the analysis carries over.
The non-blocking work-stealing scheduler in [7] assumes memory contention cost, which is contrary to the QRMW contention model. Nevertheless, the combinatorial techniques in that paper can be adapted to prove the desired performance bounds for our implementation (22).
Definition 22 (Non-Blocking Work-Stealing Scheduler).
The non-blocking work-stealing scheduler can be implemented in the QRMW pointer machine model as follows:
- ✧
Each processor has:
- ✧
A global deque of DAG nodes, shared between owner and stealer using Dekker’s algorithm.
- ✧
A global non-blocking lock (see Appendix 36).
- ✧
A local array where stores a pointer to and a pointer to . // Used implicitly wherever needed.
- ✧
Each processor does the following repeatedly:
-
Access as owner, removing the node at the bottom if it is non-empty.
-
If exists (i.e. was non-empty):
-
Execute .
-
Access as owner, inserting all the child nodes generated by at the bottom.
-
Otherwise:
-
Create Int uniformly randomly chosen from .
-
If TryLock():
-
Access as stealer, removing the node at the top if it is non-empty.
-
Unlock().
-
If exists (i.e. was non-empty):
-
Execute .
-
Access as owner, inserting all the child nodes generated by at the bottom.
8 Conclusions
This paper presents two parallel finger structures that are work-optimal with respect to the finger bound, and the faster version has a lower span by using careful pipelining. Pipelining techniques to reduce the span of data structure operations have been explored before [10, 4]. As indicated by our results, the extended implicit batching framework combines nicely with pipelining and is a promising approach in the design of parallel data structures.
Nevertheless, despite the common framework, the parallel finger structures in this paper and the parallel working-set map in [4] rely on different ad-hoc techniques and analysis, and it raises the obvious interesting question of whether there is a way to obtain a batch-parallel splay tree in the same framework, that satisfies both the working-set property and the finger property.
Appendix
Here we spell out the model details, building blocks and supporting theorems used in our paper.
A.1 QRMW Pointer Machine Model
QRMW stands for queued read-modify-write, as described in [15]. In this contention model, asynchronous processors perform memory accesses via read-modify-write (RMW) operations (including read, write, test-and-set, fetch-and-add, compare-and-swap), which are supported by almost all modern architectures. Also, to capture contention costs, multiple memory requests to the same memory cell are FIFO-queued and serviced one at a time, and the processor making each memory request is blocked until the request has been serviced.
In the parallel pointer machine, each processor has a fixed number of local registers and memory accesses are done only via pointers, which can be locally stored or tested for equality (but no pointer arithmetic). The QRMW pointer machine model, introduced in [4], extends the parallel pointer machine model in [21] to RMW operations. In this model, each memory node has a fixed number of memory cells, and each memory cell can hold a single field, which is either an integer or a pointer. Each processor also has a fixed number of local registers, each of which can hold a single field. The basic operations that a processor can perform include arithmetic operations on integers in its registers, equality-test between pointers in its registers, creating a new memory node and obtaining a pointer to it, and RMW operations. An RMW operation can be performed on any memory cell via a pointer to the memory node that it belongs to.
All operations except for RMW operations take one step each. RMW operations on each memory cell are FIFO-queued to be serviced, and the first RMW operation in the queue (if any) is serviced at each time step. The processor making each memory request is blocked until the request has been serviced.
A.2 Parallel Batch Operations
We rely on the following basic operations on batches:
- ✧
Split a given batch of items into left and right parts around a given position, within work/span.
- ✧
Partition a given batch of items into lower and upper parts around a given pivot, within work and span.
- ✧
Partition a sorted batch of items around a sorted batch of pivots, within work and span.
- ✧
Join a batch of batches with total items, within work and span.
- ✧
Merge two sorted batches with total items, optionally combining duplicates, within work and span if the combining procedure takes work/span.
These can be implemented in the QRMW pointer machine model [28] with each batch stored as a** BBT** (leaf-based height-balanced binary tree with an item at each leaf). They can also be implemented (more easily) in the binary forking model in [9] with each batch stored in an array. For instance, joining a batch of arrays can be done by using the standard prefix-sum technique to compute the total size of the first arrays, and hence we can copy each array in parallel into the final output array, and merging two sorted arrays can be done by the algorithm given in [26] (section 2.4) and [35].
A related data structure that we also rely on is the bunch data structure, which is defined as follows.
Definition 23 (Bunch Structure).
A bunch is an unsorted set supporting addition of any batch of new elements within work/span and conversion to a batch within work and span if it has size . A bunch can be implemented using a complete binary tree with batches at the leaves, with a linked list threaded through each level to support adding a new batch as a leaf in work/span. To convert a bunch to a batch, we treat the bunch as a batch of batches and parallel join all the batches.
A.3 Batch-Parallel Map
In this paper we rely on a parallel map that supports the following operations:
- ✧
Unsorted batch search: Search for an unsorted input batch of items (not necessarily distinct), tagging each search item with the result, all within work and span, where is the map size.
- ✧
Sorted batch access: Perform an item-sorted input batch of operations on distinct items, tagging each operation with the result, all within O\mathopen{}\mathclose{{}\left(b\cdot\log n}\right) work and span, where is the map size before the batch access.
- ✧
Split: Split a map of size around a given pivot rank into two maps , where contains the items with ranks at most in , and contains the items with ranks more than in , within work/span.
- ✧
Join: Join maps of total size where every item in is less than every item in , within work/span.
This can be achieved in the QRMW pointer machine model [28], and also (more easily) in the binary forking model [9].
A.4 Parallel Buffer
To facilitate extended implicit batching, we can use any parallel buffer implementation that takes work and span per batch of size (on processors), any operation that arrives is (regardless of the scheduler) within span included in the batch that is being flushed or in the next batch, and there are always at most ready buffer nodes (active threads of the buffer) where is the number of operations that are currently buffered or being flushed. This would entail the following parallel buffer overhead [4] (and we reproduce the proof here).
Theorem 24 (Parallel Buffer Cost).
Take any program using an implicitly batched data structure that is run using any greedy scheduler. Then the cost (6) of the parallel buffer for is O\mathopen{}\mathclose{{}\left(\frac{T_{1}+w}{p}+d\cdot\log p}\right), where is the work of all the -nodes, and is the work taken by , and is the maximum number of -calls on any path in the program DAG .
Proof.
Let and be the total work and span (6) respectively of the parallel buffer for . Let be the total number of operations on . Consider each batch of operations on . Let be span taken by the buffer on . If , then . If , then t_{B}\in O(\log b)\subseteq O\mathopen{}\mathclose{{}\left(\frac{b}{p}}\right). Thus t_{B}\in O\mathopen{}\mathclose{{}\left(\frac{b}{p}+\log p}\right) and hence t_{\infty}\in O\mathopen{}\mathclose{{}\left(\frac{N}{p}+d\cdot\log p}\right).
Now consider the actual execution of the execution DAG of the program using . At each time step, the buffer is processing at most two consecutive batches, so we shall analyze the buffer work done during the time interval for each pair of consecutive batches and , where has operations and has operations.
If , then the buffer work done on and is .
If , then there are at most ready buffer nodes in , so at least one of the following holds at each time step in this interval:
- ✧
At least ready -nodes in are being executed. These steps take at most work over all intervals.
- ✧
At least ready -nodes in are being executed. These steps take at most work over all intervals.
- ✧
At most ready nodes in are being executed. All ready buffer nodes in are being executed (by greedy scheduling), so over all intervals there are such steps, taking work.
Therefore , and hence the buffer’s cost is \frac{t_{1}}{p}+t_{\infty}\in O\mathopen{}\mathclose{{}\left(\frac{T_{1}+w}{p}+d\cdot\log p}\right) since .
The parallel buffer for each data structure can be implemented using a static BBT (leaf-based balanced binary tree), with a sub-buffer at each leaf node, one for each processor, and a flag at each internal node. Each sub-buffer stores its operations as the leaves of a complete binary tree with a linked list threaded through each level. Whenever a thread makes a call to , the processor running suspends it and inserts the call together with a callback (i.e. a structure with a pointer to and a field for the result) into the sub-buffer for that processor. Then the processor walks up the BBT from leaf to root, test-and-setting each flag along the way, terminating if it was already set. On reaching the root, the processor notifies (by reactivating it), which can decide when to flush the buffer. can also query whether the parallel buffer is non-empty, defined as whether the flag at the root is set. can eventually return the result of the call via the callback (i.e. by updating the result field and then resuming ).
Whenever the buffer is flushed (by ), all sub-buffers are swapped out by a parallel recursion on the BBT, replaced by new sub-buffers in a newly constructed static BBT. We then wait for all pending insertions into the old sub-buffers to be completed, before joining their contents into a single batch to be returned (to ). To do so, each processor has a flag initialized to , and a thread field initialized to . Whenever it inserts an -call , it sets , then inserts into the (current) sub-buffer, then resumes if . To wait for pending insertions into the old sub-buffer for processor , we store a pointer to the current thread in and then suspend it if .
Inserting into each sub-buffer can be done in time. Test-and-setting each flag in the BBT also takes time, because at most three processors ever access it. Each static BBT takes work and span to initialize. Each data structure call takes work and span for a processor to reach the root, because the flags ensure that only work is done per node in traversing the BBT. Joining the contents of the sub-buffers takes work and span if the resulting joined batch is of size . It is also easy to ensure that flushing uses at most threads where is the size of the flushed batch. Thus this parallel buffer implementation has the desired properties that support extended implicit batching.
It is worth noting that the parallel buffer can be implemented in the dynamic multithreading paradigm, like all other data structures and algorithms in this paper, but it requires the ability for a thread to have -time access to the sub-buffer for the processor running it, so that it can insert each data structure-call into the sub-buffer in work/span. This can be done if each processor has a local array of size (i.e. it is accessible only by that processor but supports random access) for each implicitly batched data structure, and each thread can retrieve the id of the processor running it. But in the QRMW pointer machine model this is not necessary if the program uses a fixed set of implicitly batched data structures, since each processor can be initialized with a (constant) pointer to a structure that always points to the current sub-buffer for that processor.
A.5 Sorting Theorems
The items in the search problem can come from any arbitrary set that is linearly ordered by a given comparison function, and we shall assume that has at least two items. As is standard, let be the set of all length- sequences from . Search structures can often be adapted to implement sorting algorithms 555A sorting algorithm is a procedure that given any input sequence will output a sequence of pointers to the input items in sorted order., in which case any lower bound on complexity of sorting typically implies a lower bound on the costs of the search structure. For the proofs of Theorem 12 ( Work). and Theorem 16 ( Work). we need a crucial lemma that the entropy bound is a lower bound for (comparison-based) sorting, as precisely stated below.
Lemma 25 (Sorting Entropy Bound).
For any sequence in with item frequencies (i.e. ), any sorting algorithm requires comparisons on average over all (distinct) rearrangements of , where H=\sum_{i=1}^{u}\mathopen{}\mathclose{{}\left(q_{i}\cdot\log\frac{n}{q_{i}}}\right) is the entropy of . [30]
From this we immediately get a relation (28) between the entropy bound and the maximum finger bound (i.e. the maximum finger bound over all permutations), because we can use a finger-tree to perform sorting.
Definition 26 (Finger-Tree Sort).
Let be the sequential algorithm that sorts an input sequence as follows:
- Create an empty finger-tree (with one finger at each end) that stores linked lists of items. For each item in , if already has a linked list of copies of , then append to that linked list, otherwise insert a linked list containing just into . At the end iterate through to produce the desired sorted sequence.
Definition 27 (In-order Item Frequencies).
A sequence in is said to have in-order item frequencies if the -th smallest item in occurs times in .
Theorem 28 (Maximum Finger Bound).
Take any sequence in with in-order item frequencies . Then the maximum finger bound for , defined as , satisfies where H=\sum_{i=1}^{u}\mathopen{}\mathclose{{}\left(q_{i}\cdot\log\frac{n}{q_{i}}}\right).
Proof.
By the Lemma 25 (Sorting Entropy Bound). (25) let be a rearrangement of such that takes comparisons. Clearly also takes O\mathopen{}\mathclose{{}\left(MF_{J}}\right)=O\mathopen{}\mathclose{{}\left(MF_{I}}\right) comparisons, and hence .
Finally we give a parallel sorting algorithm that achieves the entropy bound for work but yet takes only O\mathopen{}\mathclose{{}\left((\log n)^{2}}\right) span on a list of items, which we need in our parallel finger structure. For comparison, we also give the simpler parallel merge-sort . The input and output lists are each stored in a batch (leaf-based balanced binary tree), and these algorithms work in the QRMW pointer machine model.
We shall use the following notation for every binary tree : is its root, and for each node of , and are its child nodes, and is the height of the subtree at , and is the number of leaves of the subtree at .
Definition 29 (Parallel Merge-Sort).
Let be the procedure that does the following on an input batch of items:
- If , return . Otherwise, compute in parallel and , and then parallel merge (Section A.2) and into an item-sorted batch , and then return .
Theorem 30 ( Costs).
sorts every sequence in within work and O\mathopen{}\mathclose{{}\left((\log n)^{2}}\right) span.
Proof.
The claim follows directly from the work/span bounds for parallel merging (Section A.2) and .
Definition 31 (Parallel Entropy-Sort).
Define a bundle of an item to be a BT (binary tree) in which every leaf has a tagged copy of . Let be the parallel merge-sort variant that does the following on an input batch of items:
- If , return . Otherwise, compute in parallel and , and then parallel merge (Section A.2) and into an item-sorted batch of bundles, combining bundles of the same item into one by simply making them the child subtrees of a new bundle, and then return .
Then returns an item-sorted batch of bundles, with one bundle (of all the tagged copies) for each distinct item in , and clearly each bundle has height at most .
Theorem 32 ( Costs).
sorts every sequence in with item frequencies within work and O\mathopen{}\mathclose{{}\left((\log n)^{2}}\right) span, where H=\sum_{i=1}^{u}\mathopen{}\mathclose{{}\left(q_{i}\cdot\ln\frac{n}{q_{i}}}\right).
Proof.
Consider the merge-tree , in which each node is the result of parallel merging its child nodes. Note that , and that each item in occurs in at most one bundle in each node of . Clearly the work done is times the total length of all the parallel merged batches (Section A.2). Thus the work done can be divided per item; work done on item takes times the number of nodes of that contain a bundle of , and there are O\mathopen{}\mathclose{{}\left(k\cdot\log\frac{n}{k}+k}\right) such nodes where is the frequency of in , by 33 below. Therefore takes O\mathopen{}\mathclose{{}\left(\sum_{i=1}^{u}\mathopen{}\mathclose{{}\left(q_{i}\cdot\log\frac{n}{q_{i}}+q_{i}}\right)}\right)\subseteq O(H+n) work. The span bound on is immediate from the span bound on parallel merging (Section A.2).
Lemma 33 (BBT Subtree Size Bound).
Given any BBT with leaves of which are marked with , and with each internal node marked iff it is on a path from the root to a marked leaf, the number of marked nodes of is O\mathopen{}\mathclose{{}\left(k\cdot\log\frac{n}{k}+k}\right).
Proof.
We shall iteratively change the set of marked leaves of , and accordingly update the internal nodes so that each of them is marked iff it is on a path from the root to a marked leaf. At each step, if there is a marked node with a marked child and an unmarked child such that has two marked children, then unmark the rightmost marked leaf in the subtree at and mark the deepest leaf in the subtree at . This will not decrease the number of marked nodes, because unmarking results in unmarking at most internal nodes, and marking results in marking at least internal nodes, and since is a BBT.
Note that each step decreases the sum of the lengths of all the paths from the root to the marked nodes with two marked children, so this iterative procedure terminates after finitely many steps. After that, for every node with only one marked child, there is only one marked leaf in the subtree at . Let be the set of marked nodes with two marked children, and be the set of marked nodes not in but with a parent in . Then there are exactly nodes in , and exactly nodes in , and the subtrees at nodes in are disjoint, so . Since every marked node is either in or on the downward path of marked nodes from some node in , the number of marked nodes is at most (k-1)+\sum_{v\in B}(v.\text{height}+1)\in O\mathopen{}\mathclose{{}\left(k+\sum_{v\in B}\log v.\text{size}}\right)\subseteq O\mathopen{}\mathclose{{}\left(k+k\cdot\log\frac{n}{k}}\right) by Jensen’s inequality.
Remark 0.
See [28] (Subtree Size Bound) for a generalization of 33 with a different proof, but if we want a bound with explicit constants then the above proof yields a tighter bound for a BBT.
is all we need for the parallel finger search structures and , but we can in fact obtain a full parallel entropy-sorting algorithm, namely one that outputs a single item-sorted batch of all the (tagged copies of) items in the input sequence from and satisfies the entropy bound for work. Specifically, we can convert each bundle in to a batch (34), and then parallel join (Section A.2) all those batches to obtain the desired output.
Definition 34 (Bundle Balancing).
A bundle of size and height is balanced as follows:
- Recursively construct a linked list through all the leaves of , and mark the leaves of with (-based) rank of the form , and then extract those marked leaves as a batch (by parallel filtering as described in [28]). Then at each leaf in , construct and store at a batch of the items in with ranks to , obtained by traversing the linked list forward. Now is essentially a batch of size- batches (except perhaps the last smaller batch), which we then recursively join to obtain the batch of all items in (alternatively, but less efficiently, simply parallel join ).
Theorem 35 (Bundle Balancing Costs).
Balancing a bundle of size and height takes work and span.
Proof.
Note that has less internal nodes than leaves, and so constructing the linked list takes work and span. Extracting the batch of items of with ranks at intervals of takes O\mathopen{}\mathclose{{}\left(b+P.\text{size}\cdot h}\right)=O(b) work and span. Constructing the batches of items in-between those in takes work and span, and recursively joining them takes work and span per node of (except span for the first joining involving the last batch).
A.6 Locking Mechanisms
Here we give pseudo-code implementations of the various locking mechanisms used as primitives in this paper (Section 2.1), which have the claimed properties under the QRMW memory contention model.
The non-blocking lock is trivially implemented using test-and-set as shown in TryLock/Unlock below.
Definition 36 (Non-Blocking Lock).
-
TryLock( Bool ):
-
Return .
-
Unlock( Bool ):
-
Set .
Next is the reactivation wrapper for a procedure , which can be implemented using fetch-and-add and guarantees the following according to some linearization [28]:
Whenever is reactivated, there will be a complete run of that starts after that reactivation. 2. 2.
If is run only via reactivations, then no runs of overlap, and there are at most as many runs of as reactivations of . 3. 3.
If is reactivated by only threads at any time, then each reactivation call finishes within span, and some run of starts within span after the start of or the end of the last run of that overlaps .
Definition 37 (Reactivation Wrapper).
( is the procedure to be guarded by the wrapper.)
-
Private Procedure .
-
Private Int .
-
Public Reactivate():
-
If :
-
Fork the following:
-
Do:
-
Set .
-
.
-
While .
The dedicated lock with keys , where threads must use distinct keys to acquire it, can be implemented using fetch-and-add as shown below and guarantees the following according to some linearization [4]:
Mutual exclusion: Only one thread can hold the lock at any point in time; a thread becomes the lock holder when it successfully acquires the lock, and must release the lock before the next successful acquisition. 2. 2.
Fairness and bounded latency: When any thread attempts to acquire the dedicated lock, it will become a pending holder within span, and each pending holder will successfully acquire the lock after at most subsequent successful acquisition per key (if every lock holder eventually releases the lock). And whenever the lock is released, if there is at least one pending holder then within span the lock would be successfully acquired again.
Definition 38 (Dedicated Lock).
( is the number of keys.)
-
Private Int .
-
Private Int .
-
Private Array initialized with .
-
Public Acquire( Int ):
-
If :
-
Set .
-
Return.
-
Otherwise:
-
Write pointer to current thread into .
-
Suspend current thread.
-
Public Release():
-
If :
-
Create Int .
-
Create Pointer .
-
While :
-
Set .
-
If , then swap .
-
Set .
-
Resume .
It is worth mentioning that we can easily replace the array in the above implementation by a cyclic linked list, and use the linked list nodes instead of integers as the keys.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Yehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, and Robert E Tarjan. The cb tree: a practical concurrent self-adjusting search tree. Distributed computing , 27(6):393–417, 2014.
- 2[2] Yehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, and Robert Endre Tarjan. Cbtree: A practical concurrent self-adjusting search tree. In DISC , volume 7611 of Lecture Notes in Computer Science , pages 1–15. Springer, 2012.
- 3[3] Kunal Agrawal, Jeremy T Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures , pages 84–95. ACM, 2014.
- 4[4] Kunal Agrawal, Seth Gilbert, and Wei Quan Lim. Parallel working-set search structures. In Proceedings of the 30th ACM symposium on Parallelism in algorithms and architectures , pages 321–332. ACM, 2018.
- 5[5] Yaroslav Akhremtsev and Peter Sanders. Fast parallel operations on search trees. In 2016 IEEE 23rd International Conference on High Performance Computing (Hi PC) , pages 291–300. IEEE, 2016.
- 6[6] Vitaly Aksenov, Petr Kuznetsov, and Anatoly Shalyto. Parallel Combining: Benefits of Explicit Synchronization. In Jiannong Cao, Faith Ellen, Luis Rodrigues, and Bernardo Ferreira, editors, 22nd International Conference on Principles of Distributed Systems (OPODIS 2018) , volume 125 of Leibniz International Proceedings in Informatics (LIP Ics) , pages 11:1–11:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
- 7[7] Nimar S Arora, Robert D Blumofe, and C Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of computing systems , 34(2):115–144, 2001.
- 8[8] Guy E Blelloch, Daniel Ferizovic, and Yihan Sun. Just join for parallel ordered sets. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures , pages 253–264. ACM, 2016.
