Parallel Sort-Based Matching for Data Distribution Management on   Shared-Memory Multiprocessors

Moreno Marzolla; Gabriele D'Angelo

arXiv:1703.06680·cs.DC·August 8, 2018

Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors

Moreno Marzolla, Gabriele D'Angelo

PDF

TL;DR

This paper introduces a parallel algorithm for the Data Distribution Management problem in agent-based simulations, improving efficiency on shared-memory multicore systems by overcoming the sequential limitations of previous methods.

Contribution

It presents a parallel version of the efficient Sort-Based Matching algorithm, enabling scalable DDM processing on multicore architectures.

Findings

01

Achieves significant speedup on multicore systems

02

Demonstrates good scalability with increasing cores

03

Outperforms existing sequential algorithms

Abstract

In this paper we consider the problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles. This is a common problem that arises in many agent-based simulation studies, and is of central importance in the context of High Level Architecture (HLA), where it is at the core of the Data Distribution Management (DDM) service. Several realizations of the DDM service have been proposed; however, many of them are either inefficient or inherently sequential. These are serious limitations since multicore processors are now ubiquitous, and DDM algorithms -- being CPU-intensive -- could benefit from additional computing power. We propose a parallel version of the Sort-Based Matching algorithm for shared-memory multiprocessors. Sort-Based Matching is one of the most efficient serial algorithms for the DDM problem, but is quite difficult to parallelize due to data…

Tables1

Table 1. Table 1: Hardware specifications of the machines used for the experimental evaluation.

	solaris	titan
CPU	Intel Xeon	Intel Core
	E5-2640	i7-5820K
Clock frequency	2.00 GHz	3.30 GHz
Processors	2	1
Total cores	16	6
HyperThreading	Yes	Yes
RAM	128 GB	64 GB
L3 cache size	20480 KB	15360 KB

Equations22

x . low \leq y . high \land y . low \leq x . high

x . low \leq y . high \land y . low \leq x . high

y_{0}

y_{0}

y_{1}

y_{2}

⋮

y_{N - 1}

z_{0}

z_{0}

z_{1}

z_{2}

⋮

z_{N - 1}

SubSet [p]

SubSet [p]

UpdSet [p]

α = \frac{\sum area of extents}{area of the routing space} = \frac{N \times l}{L}

α = \frac{\sum area of extents}{area of the routing space} = \frac{N \times l}{L}

E_{N, strong} (P)

E_{N, strong} (P)

E_{N, weak} (P)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setcopyright

acmcopyright

Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors111The publisher version of this paper is available at https://doi.org/10.1109/DISTRA.2017.8167660.

Please cite this paper as: “Moreno Marzolla, Gabriele D’Angelo. Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors. Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2017)”. Best Paper Award @DS-RT 2017.

Moreno Marzolla

Gabriele D’Angelo

Dept. of Computer Science and Engineering

University of Bologna, Italy

[email protected]

Dept. of Computer Science and Engineering

University of Bologna, Italy

[email protected]

Abstract

In this paper we consider the problem of identifying intersections between two sets of $d$ -dimensional axis-parallel rectangles. This is a common problem that arises in many agent-based simulation studies, and is of central importance in the context of High Level Architecture (HLA), where it is at the core of the Data Distribution Management (DDM) service. Several realizations of the DDM service have been proposed; however, many of them are either inefficient or inherently sequential. These are serious limitations since multicore processors are now ubiquitous, and DDM algorithms – being CPU-intensive – could benefit from additional computing power. We propose a parallel version of the Sort-Based Matching algorithm for shared-memory multiprocessors. Sort-Based Matching is one of the most efficient serial algorithms for the DDM problem, but is quite difficult to parallelize due to data dependencies. We describe the algorithm and compute its asymptotic running time; we complete the analysis by assessing its performance and scalability through extensive experiments on two commodity multicore systems based on a dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.

keywords:

Data Distribution Management (DDM), Parallel And Distributed Simulation (PADS), High Level Architecture (HLA), Parallel Algorithms

\ccsdesc

[500]Computing methodologies Massively parallel and high-performance simulations \ccsdesc[300]Computing methodologies Shared memory algorithms

\printccsdesc

1 Introduction

Agent-based simulations involve a possibly large number of agents that interact in a virtual environment. Generally, the environment may represent a two- or three-dimensional space. For example, in a large-scale road traffic simulation, agents may represent vehicles moving in a two-dimensional, “flat” road network (the third dimensions can be ignored since vehicles are concerned about obstacles on their plane of movement only). Molecular models or air traffic simulations, on the other hand, involve agents moving in a three-dimensional world.

Agents must be made aware of events happening in their area of interest, so that they can promptly react if necessary. For example, in the road traffic scenario above, each car should be made aware of the behavior of neighboring vehicles only, since distant vehicles can not produce immediate observable effects. For simplicity, an agent’s area of interest is often represented as a $d$ -dimensional rectangle (region), centered at the agent coordinates, with the sides parallel to the axes of a $d$ -dimensional space (usually, $d=2$ or $d=3$ ). A simulation event that is generated by an agent $A$ should then be forwarded to all agents whose area of interest intersect that of $A$ .

Managing areas of interest in agent-based simulations is so common that the High Level Architecture (HLA) specification [1] defines Data Distribution Management (DDM) services to handle the problem. Specifically, DDM services are responsible for sending events generated on update regions to a set of subscription regions.

Identifying all pairs of intersecting rectangles is a well-known computational geometry problem with applications in such diverse areas as VLSI design and geographic information systems. Spatial data structures that can solve the region intersection problem have been developed: examples include the $k$ - $d$ tree [25] and R-tree [13]. However, it turns out that DDM implementations tend to rely on less efficient but simpler solutions. The reason is that spatial data structures can be difficult to implement and their manipulation incurs a significant overhead which is not evident from their asymptotic complexities.

The increasingly large size of agent-based simulations is posing a challenge to the existing implementations of the DDM service. As the number of regions increases, so does the execution time of the intersection-finding algorithms. A possible solution comes from the computer architectures domain. The current trend in microprocessor design is to put more execution units (cores) in the same processor; the result is that multi-core processors are now ubiquitous, so it makes sense to try to exploit the increased computational power to speed up the DDM service [11]. Therefore, an obvious parallelization strategy for the intersection-finding problem is to distribute the rectangles across the processor cores, so that each core can work on a smaller problem. Interestingly, this approach fails on all but the most trivial (and inefficient) algorithms.

In this paper we present a parallel implementation of Sort-based Matching (SBM) for shared-memory processors. SBM [23] is an efficient solution to the $d$ -dimensional rectangle intersection problem for the special case $d=1$ . Since any algorithm that can solve the intersection problem in $d=1$ dimensions can be extended to $d>1$ dimensions, SBM is widely used to implement DDM services. Unfortunately, data dependencies in the SBM algorithm makes it difficult to exploit parallelism.

This paper is organized as follows. In Section 2, we review the state of the art concerning the DDM service. In Section 3, we describe some of the existing DDM algorithms: brute force, grid-based, sequential sort-based, and interval-tree matching. In Section 4, we present the main contribution of this work, i.e., a parallel version of the SBM algorithm. In Section 5 we experimentally evaluate the performance of parallel SBM on two multicore processors. Finally, conclusions and future works will be discussed in Section 6.

2 Related Work

The matching part of DDM is a more specific instance of the problem of identifying the intersecting pairs of (hyper) rectangles in a multidimensional metric space.

Data structures such as $k$ - $d$ trees [25] and R-trees [13] are able to efficiently store volumetric objects and identify intersections. Such data structures are quite complex to implement and, in many real-world situations, slower than less efficient but simpler solutions [22]. For example, in [12] the authors introduced a rectangle-intersection algorithm that is implemented using only simple data structures (i.e., arrays) and that can enumerate all $K$ intersections among $n$ rectangles with complexity $O(n\log n+K)$ time and $O(n)$ space.

Among the many matching algorithms that have been proposed for enumerating all intersections among subscription and update extents, the SBM [23] proved to be very efficient. SBM solves the region matching problem in one dimension; SBM first sorts the endpoints, and then scans the sorted set. In [20], SBM has been extended to deal with dynamic environments in which extents are dynamic (both in terms of placement and size). On the other hand, SBM has the drawback that it can not be trivially parallelized due to the presence of a sequential scan phase that is intrinsically serial. This is a serious limitation since the most of modern processing architectures are multi or many-cores.

Only few parallel solutions for DDM and interest matching [15] have been proposed. Among them, the authors of this paper have proposed the Interval Tree Matching (ITM) algorithm for computing intersections among $d$ -rectangles [18]. ITM is based on an interval tree data structure, and after the tree is built, exhibits an embarrassingly parallel structure. The performance evaluation reported in [18] shows that the sequential implementation of ITM is competitive with the sequential implementation of SBM.

In [16], a parallel ordered-relation-based matching algorithm is proposed. The algorithm is composed of five phases: projection, sorting, task decomposition, internal matching and external matching. In the experimental evaluation, a MATLAB implementation is compared with the sequential SBM. The results show that, with a high number of extents the proposed algorithm is faster than SBM.

In [24] the performance of parallel versions of Brute Force (BF) and grid-based matching (fixed, dynamic and hierarchical) are compared. In this case, the preliminary results presented show that the parallel BF has a limited scalability and that, in this specific case, the hierarchical grid-based matching has the best performance.

3 The Region Matching Problem

In this section we define the DDM problem, and describe three matching algorithms that have been thoroughly investigated in the literature (brute-force, region-based and sort-based), in addition to one that has been introduced recently (interval-tree matching).

Given two sets $\mathbf{S}=\{S_{1},\ldots,S_{n}\}$ and $\mathbf{U}=\{U_{1},\ldots,U_{m}\}$ of $d$ -dimensional rectangles with sides parallel to the axes (called subscription extents and update extents, respectively), the DDM problem consists of identifying all intersections between a subscription extent and an update extent. Formally, a DDM algorithm must return the list of all pairs $(S_{i},U_{j})$ such that $S_{i}\cap U_{j}\neq\emptyset$ , $1\leq i\leq n$ , $1\leq j\leq m$ .

Figure 1 shows an instance of the DDM problem in $d=2$ dimensions with three subscription extents $\{S_{1},S_{2},S_{3}\}$ and two update extents $\{U_{1},U_{2}\}$ . There are four overlaps (intersections) between a subscription an update extent, namely $(S_{1},U_{1})$ , $(S_{2},U_{2})$ , $(S_{3},U_{1})$ , and $(S_{3},U_{2})$ . Note that $S_{1}$ and $S_{2}$ overlap, but this intersection is ignored since it involves subscription extents only.

The time complexity of any DDM algorithm is output-sensitive, since it depends on the size of the output in addition to the size of the input. Therefore, every DDM algorithm that explicitly enumerates all the $K$ intersections requires time $\Omega(K)$ . Since there can be at most $n\times m$ intersections, the worst-case complexity of the DDM problem is $O(n\times m)$ .

One of the key steps of any DDM algorithm is testing whether two $d$ -rectangles overlap. The special case $d=1$ is quite simple, as it reduces to testing whether two closed intervals $x=[x.\textit{low},x.\textit{high}]$ , $y=[y.\textit{low},y.\textit{high}]$ intersect; this can be done in constant time: $x$ and $y$ overlap if and only if

[TABLE]

(see Algorithm 1).

The general case $d>1$ can be reduced to the base case $d=1$ by observing that two $d$ -rectangles overlap if and only if all their projections along each dimension overlap. Therefore, we can invoke Algorithm 1 $d$ times, and compute the logical “and” of the results. Using this property, an algorithm that enumerates all intersections among two sets of $n$ and $m$ one-dimensional segments in time $O\left(f(n,m)\right)$ can be readily extended to an $O\left(d\times f(n,m)\right)$ algorithm for reporting intersections among two sets of $d$ -rectangles. For this reason, it is common practice in the DDM research community to focus on the simpler one-dimensional case.

3.1 Brute-Force Matching

The simplest solution to the $1$ -dimensional segment intersection problem is the BF approach, also called Region-Based matching (Algorithm 2). The BF algorithm, as the name suggests, checks all $n\times m$ subscription-update pairs and inserts every intersection into a list $L$ .

Despite its simplicity, the BF algorithm is extremely inefficient since it requires time $O(nm)$ . However, it exhibits an embarrassingly parallel structure since the loop iterations (lines 2–5) are independent. This makes parallelization of the the BF algorithm trivial; when $P$ processors are available, the amount of work performed by each processor is $O\left(nm/P\right)$ .

3.2 Grid-Based Matching

The Grid Based (GB) matching algorithm proposed by Boukerche and Dzermajko [5] improves over BF matching. GB works by partitioning the routing space into a regular mesh of $d$ -dimensional cells. Each subscription or update extent is mapped to the grid cells it overlaps with. Events generated by an update extent $U_{j}$ are sent to all subscription extents that share at least one cell with $U_{j}$ . A filtering mechanism must then be applied to avoid delivering of spurious events. For example, in Figure 2 we see that $S_{2}$ shares the hatched grid cells with $U_{1}$ , but does not overlap with $U_{1}$ . Hence, the GB matching algorithm would send notifications from $U_{1}$ to $S_{2}$ that will need to be filtered out.

A simple filtering mechanism consists on the application of the BF algorithm to each grid cell. If the routing space is partitioned into $G$ cells and all extents are evenly distributed, each cell will overlap with $n/G$ subscription and $m/G$ update extents on average. Therefore, the brute force approach applied to each cell will require $O(nm/G^{2})$ operations; since there are $G$ cells, the overall worst-case complexity of GB matching is $O(nm/G)$ . Therefore, in the ideal case GB can decrease the matching complexity by a factor $G$ with respect to BF. Unfortunately, when cells are small (and therefore $G$ is large) each extent is mapped to a larger number of cells, which increases the computation time.

3.3 Interval-Tree Matching

The Interval Tree Matching (ITM) algorithm [18] is based on the interval tree data structure that solves the matching problem in one dimension. An interval tree is a balanced search tree that stores a dynamic set of intervals, supporting insertions, deletions, and queries to get the list of segments intersecting a given interval $q$ . Different implementations of interval trees are possible, depending on the structure of the underlying search tree; the implementation described in [18] is based on AVL trees [2].

Each node $x$ of the AVL tree holds three fields: (i) an interval $x.\textit{in}$ , represented by its lower and upper bounds; (ii) the minimum lower bound $x.\textit{minlower}$ among all intervals stored at the subtree rooted at $x$ ; (iii) the maximum upper bound $x.\textit{maxupper}$ among all intervals stored at the subtree rooted at $x$ . Nodes are kept sorted according to the interval lower bounds. Figure 3 shows a set of intervals and the corresponding interval tree representation.

Insertions and deletions are handled according to the normal rules for AVL trees, with the additional requirement that any update of the values of maxupper and minlower must be propagated up to the tree root. Since the height of an AVL tree is $O(\log n)$ , insertions and deletions in the augmented data structure require $O(\log n)$ time in the worst case. The storage requirement is $O(n)$ .

Function IntTree-Matching-1D (Algorithm 3) returns the list of intersections among the set $\mathbf{S}$ of subscription intervals and the set $\mathbf{U}$ of update intervals. This is done by first building an interval tree $T$ containing all elements in $\mathbf{S}$ (line 13); then, for each update interval $U_{j}\in\mathbf{U}$ , the algorithm calls function $\textsc{Interval-Query}(x,q)$ to identify all subscriptions that intersect $U_{j}$ (lines 14–15). The function returns the list of intersections of the update interval $q$ with the segments stored in the subtree rooted at $x$ ( $T.\textit{root}$ is the root of $T$ ). Function Interval-Query performs a visit of the interval tree data structure, using the values of attributes $x.\textit{minlower}$ and $x.\textit{maxupper}$ of each node $x$ to steer the visit out of the subtrees that would yield no matches.

An interval tree can be created in time $O(n\log n)$ ; the total query time is $O\left(\min\{mn,(K+1)\log n\}\right)$ , $K\leq nm$ being the number of intersections involving all subscription and all update intervals [18]. When executed on a shared-memory multiprocessor with $P$ cores, the iterations of the for loop in Algorithm 3, lines 14–15 can be split across the cores, with the provision that updates to the result list $L$ are serialized. The only remaining serial part is the construction of the interval tree; while concurrent balanced search trees have been proposed in the literature [19, 21] it is unclear whether they can be used as drop-in replacements.

3.4 Sort-Based Matching

The Sort-based Matching algorithm [14, 23] is an efficient solution to the DDM problem. Algorithm 4 illustrates SBM in its basic form: given a set $\mathbf{S}$ of $n$ subscription intervals, and a set $\mathbf{U}$ of $m$ update intervals, the algorithm considers each of the $2\times(n+m)$ endpoints in non-decreasing order; two sets SubSet and UpdSet are used to keep track of the active subscription and update intervals at every point $t$ ; we say that an interval is active at $t$ if its lower endpoint has time $\leq t$ , and its upper endpoint has time $>t$ . For example, Figure 4 shows the values of SubSet while the SBM sweeps through a set of subscription intervals (update intervals are handled in exactly the same way). When the upper bound of an interval is encountered, the list of intersections $L$ is updated accordingly.

Let $N=n+m$ be the total number of endpoints; then, the SBM algorithm uses simple data structures and requires $O\left(N\log N\right)$ time to sort the vector of endpoints, plus $O(N)$ time to scan the sorted vector. During the scan phase, $O(K)$ time is spent in total to transfer the information from the sets SubSet and UpdSet to the intersection list $L$ . The overall computational cost of SBM is $O\left(N\log N+K\right)$ ( $K$ is the number of intersections).

4 Parallel Sort-based Matching

In this section we describe a parallel version of the SBM algorithm, using Algorithm 4 as the starting point.

We have seen that SBM operates in two phases: first, the list $T$ of endpoints is sorted; then, the sorted list is traversed to compute the values of the SubSet and UpdSet variables, from which the list of overlaps is derived. On a shared-memory architecture with $P$ processors, the sorting phase can be realized using a parallel sorting algorithm [27, 9]. The traversal of the sorted list of endpoints (Algorithm 4 lines 6–20) is, however, more challenging to execute in parallel. Ideally, we would like to split the list $T$ into $P$ segments of equal size $T_{0},\ldots,T_{P-1}$ , and assign each segment to a processor. Unfortunately, this is made difficult by the loop-carried dependencies caused by the variables SubSet and UpdSet, whose values are modified at each iteration.

Let us pretend that the scan phase can be parallelized somehow. Then, a parallel version of SBM would look like Algorithm 5 (line 6 will be explained shortly). The major difference between Algorithm 5 and its sequential counterpart is that the former uses two arrays $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ instead of the scalar variables SubSet and UpdSet. This allows each core to operate on its private copy of the subscription and update sets, achieving the maximum level of parallelism.

It is not difficult to see that Algorithm 5 is equivalent to the sequential SBM (i.e., they return the same result) if and only if $\texttt{SubSet}[0..P-1]$ and $\texttt{UpdSet}[0..P-1]$ are properly initialized. Specifically, $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ must be initialized with the values that the sequential SBM algorithm assigns to SubSet and UpdSet right after the last endpoint of $T_{p-1}$ is processed, $p=1,\ldots,P-1$ ; $\texttt{SubSet}[0]$ and $\texttt{UpdSet}[0]$ must be initialized to the empty set.

It turns out that the content of the arrays $\texttt{SubSet}[0..P-1]$ and $\texttt{UpdSet}[0..P-1]$ can be computed efficiently using a prefix computation (also called scan or prefix-sum). To make this paper self-contained, we provide details on prefix computations before illustrating the missing part of the parallel SBM algorithm.

Prefix computations

A prefix computation consists of a sequence of $N>0$ data items $x_{0},\ldots,x_{N-1}$ and an associative operator $\oplus$ . There are two types of prefix computations: the inclusive scan operation produces a new sequence of $N$ data items $y_{0},\ldots,y_{N-1}$ such that:

[TABLE]

while the exclusive scan operation produces the sequence $z_{0},z_{1},\ldots z_{N-1}$ such that:

[TABLE]

where [math] is the neutral element of operator $\oplus$ , i.e., $0\oplus x=x$ .

Blelloch [4] showed that the prefix sums of $N$ items can be computed in time $O(N/P+\log P)$ using $P<N$ processors on a shared-memory multiprocessor by organizing the computation as a tree. The $O(N/P+\log P)$ time is optimal when $N/P>\log P$ . In our algorithm we use a simpler two-level mechanism that achieves running time $O(N/P+P)$ , which is still optimal when $N/P>P$ . This is usually the case, since the current generation of CPUs have a small number of cores (e.g., $P\leq 72$ for the Intel Xeon Phi) and the number of extents $N$ is usually very large. We remark that the parallel SBM algorithm can be readily implemented with the tree-structured reduction operation, and therefore will still be competitive on future generations of processors with a higher number of cores.

Figure 5 illustrates an example of parallel (inclusive) scan with $P=4$ processors, assuming that the $\oplus$ operator is the numeric addition. The computation involves two parallel steps, and one serial step which is executed by a single processor that we call the master. \raisebox{-.9pt} {\sf1}⃝ The input sequence is splitted across the processors, and each processor computes the prefix sum of the elements in its portion. \raisebox{-.9pt} {\sf2}⃝ The master computes the prefix sum of the $P$ last local sums. \raisebox{-.9pt} {\sf3}⃝ The master scatters the first $(P-1)$ computed values (prefix sums of the last local sums) to the last $(P-1)$ processors. Each processor, except the first one, adds (more precisely, applies the $\oplus$ operator) the received value to the prefix sums from step \raisebox{-.9pt} {\sf1}⃝, producing a portion of the output sequence. Steps \raisebox{-.9pt} {\sf1}⃝ and \raisebox{-.9pt} {\sf3}⃝ require time $O(N/P)$ , while step \raisebox{-.9pt} {\sf2}⃝ is executed by the master only in time $O(P)$ , yielding a total cost of $O(N/P+P)$ .

Parallel Sort Matching

We are now ready to complete the description of the parallel SBM algorithm by showing how to fill the arrays $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ in parallel. To better illustrate the steps involved, we refer to the example in Figure 6. In the figure, we consider subscription extents only, since the procedure for update extents is the same.

The sorted list of endpoints $T$ is evenly split into $P$ segments $T_{0},\ldots,T_{P-1}$ . Processor $p$ scans the endpoints $t\in T_{p}$ in non-decreasing order, updating four auxiliary variables $\texttt{Sadd}[p]$ , $\texttt{Sdel}[p]$ , $\texttt{Uadd}[p]$ , and $\texttt{Udel}[p]$ . Informally, $\texttt{Sadd}[p]$ and $\texttt{Sdel}[p]$ (resp. $\texttt{Uadd}[p]$ and $\texttt{Udel}[p]$ ) contain the endpoints that the sequential SBM algorithm would add/remove from SubSet (resp. UpdSet) while scanning the endpoints belonging to segment $T_{p}$ . More formally, at the end of each local scan the following invariants hold:

$\texttt{Sadd}[p]$ (resp. $\texttt{Uadd}[p]$ ) contains the subscription (resp. update) intervals whose lower endpoint belongs to $T_{p}$ , and whose upper endpoint does not belong to $T_{p}$ ; 2. 2.

$\texttt{Sdel}[p]$ (resp. $\texttt{Udel}[p]$ ) contains the subscription (resp. update) intervals whose upper endpoint belongs to $T_{p}$ , and whose lower endpoint does not belong to $T_{p}$ .

This step is realized by lines 2–18 of Algorithm 6, and its effects are shown in Figure 6 \raisebox{-.9pt} {\sf1}⃝. The figure reports the values of $\texttt{Sadd}[p]$ and $\texttt{Sdel}[p]$ after each endpoint has been processed; the algorithm does not store every intermediate value, since only the last ones (within thick boxes) will be needed by the next step.

Once all $\texttt{Sadd}[p]$ and $\texttt{Sdel}[p]$ are available, the next step is executed by the master and consists of computing the values of $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ , $p=0,\ldots,P-1$ . Recall from the discussion above that $\texttt{SubSet}[p]$ (resp. $\texttt{UpdSet}[p]$ ) is the set of active subscription (resp. update) intervals that would be identified by the sequential SBM algorithm right after the end of segment $T_{0}\cup\ldots\cup T_{p-1}$ . The values of $\texttt{SubSet}[p]$ and $\texttt{SubSet}[p]$ are related to $\texttt{Sadd}[p]$ , $\texttt{Sdel}[p]$ , $\texttt{Uadd}[p]$ and $\texttt{Udel}[p]$ as follows:

[TABLE]

Intuitively, the set of active intervals at the end of $T_{p}$ can be computed from those active at the end of $T_{p-1}$ , plus the intervals that became active in $T_{p}$ , minus those that ceased to be active in $T_{p}$ .

Lines 20–23 of Algorithm 6 take care of this computation; see also Figure 6 \raisebox{-.9pt} {\sf2}⃝ for an example. Once the initial values of $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ have been computed, Algorithm 5 can be resumed to identify the list of overlaps.

Asymptotic Analysis

We now analyze the asymptotic cost of parallel SBM. Algorithm 5 consists of three phases:

Fill the array of endpoints $T$ , and sort $T$ in non-decreasing order; if $P$ processors are available, this step requires total time $O\left(N\log N/P\right)$ , where $N$ is the total number of subscription and update extents, using a suitable sorting algorithm such as parallel merge sort [9]. 2. 2.

Compute the initial values of $\texttt{SubSet}[p]$ and $\texttt{UpdSet}[p]$ , for each $p=0,\ldots,P-1$ ; this phase requires $O\left(N/P+P\right)$ steps using the two-level scan shown on Algorithm 6; the time can be further reduced to $O\left(N/P+\log P\right)$ steps using a tree-structured reduction [4]. 3. 3.

Perform the final local scans. Each scan can be completed in $O(N/P)$ steps.

Note, however, that phases 2 and 3 require the manipulation of data structures to hold sets of endpoints, supporting insertions and removals of single elements and whole sets. Therefore, a single step of the algorithm has a non-constant time complexity that depends on the actual implementation of sets and the number of elements they contain. Furthermore, during phase 3 total time $O(K)$ is spend cumulatively by all processors to push all $K$ intersections into the result list $L$ .

5 Experimental Evaluation

In this section we evaluate the performance and scalability of parallel SBM with respect to parallel versions of the BF and ITM algorithms. BF and ITM are considered because both exhibit an embarrassingly parallel structure, and ITM has already been shown to be more computationally efficient than BF [18]. In the present study we do not consider the GB algorithm: while it can be very fast and contains easily exploitable parallelism, its efficiency depends on the grid size $G$ that should either be judiciously selected, or adaptively defined by means of non-trivial heuristics [6]. Therefore, to reduce the number of degrees of freedom we restrict our study to algorithms that have no tunable parameters, postponing a more complete study to a forthcoming paper. To foster the reproducibility of our experiments, all the source code used in this performance evaluation, and the raw data obtained in the experiments execution, are freely available on the Web222http://pads.cs.unibo.it.

The BF and ITM algorithms have been implemented in C, and the parallel SBM algorithm has been implemented in C++. We used the GNU C Compiler (GCC) version 4.8.4 with the -O3 -fopenmp -D_GLIBCXX_PARALLEL flags to turn on optimization and to enable parallel constructs at the compiler and library levels. Specifically, the -fopenmp flag allows the compiler to process OpenMP directives in the source code [10]. OpenMP is an open interface supporting shared memory parallelism in the C, C++ and FORTRAN programming languages. OpenMP allows the programmer to label specific sections of the source code as parallel regions; the compiler takes care of dispatching portions of these regions to separate threads, that the Operating System (OS) can schedule on separate processors or cores. In the C/C++ languages, OpenMP directives are specified using #pragma compiler hints. The OpenMP standard also defines a set of library functions that can be called by the programmer to query and control the execution environment programmatically.

Both the BF and ITM algorithms required a single omp parallel for directive to parallelize their inner loop. The parallel SBM algorithm was more complex, and its implementation benefited from the use of some of the data structures and algorithms provided by the C++ Standard Template Library (STL) [26]. Specifically, to sort the endpoints we used the parallel std::sort function provided by the STL extensions for parallelism [8]. Indeed, the GNU STL provides several parallel sort algorithms (multiway mergesort and quicksort with various splitting heuristics) that are automatically selected at compile time when the -D_GLIBCXX_PARALLEL compiler flag is given. The remaining part of the SBM algorithm has been parallelized using explicit OpenMP directives.

The Sort-based Matching (SBM) algorithm requires a suitable data structure to store the sets of endpoints SubSet and UpdSet (see Algorithms 5 and 6). Parallel SBM puts a higher strain on this data structure with respect to its sequential counterpart, since it requires efficient support for unions and differences between sets, in addition to insertions and deletions of single elements. We have experimented with three implementations for sets: (i) bit vectors based on the std::vector<bool> STL container (note that std::bitset can not be used, since it requires the set size to be known at compile time); (ii) an ad-hoc implementation of bit vectors based on raw memory manipulation; (iii) the std::set container, which in the case of the GNU STL is based on Red-Black trees [3]. The latter turned out to be the most efficient, so the performance results reported in this section refer to the std::set container.

Experimental setup

The experiments have been carried out on two different machines, called solaris and titan, both running the 64 bit version of the Ubuntu 14.04.05 LTS OS. The hardware specifications are reported in Table 1: solaris has two Intel Xeon processors with 8 cores each (16 cores total); titan has a single Intel Core i7 processor with 6 cores. Both types of processors employ the HT technology [17]. In HT-enabled CPUs some functional components are duplicated, but there is a single main execution unit for physical core. From the point of view of the OS, HT provides two “logical” processors for each physical core. Studies from Intel and others have shown that HT contributes a performance boost between $16$ – $28\%$ [17]. This means that when two processes are executed on the same core, the processes compete for the shared hardware resources resulting is lower efficiency.

When running an OpenMP program it is possible to choose the number $P$ of threads to use, either in the source code or through the OMP_NUM_THREADS environment variable. In our experiments below, $P$ never exceeds twice the number of physical cores provided by the processor, so that the OS will be able to assign each thread to a separate (logical) core. Unless configured differently, the Linux scheduler tries to spread processes to different cores as far as possible; only when there are more runnable processes than cores does HT come into effect.

For better comparability of our results with those reported in the literature we consider $d=1$ dimensions and use the methodology and parameters described in [23]. The first parameter is the total number of extents $N$ , that includes $n=N/2$ subscription and $m=N/2$ update extents. All extents are randomly placed on a segment of total length $L=10^{6}$ and have the same length $l$ . The segment length is defined in such a way that a given overlapping degree $\alpha$ is obtained, where

[TABLE]

Therefore, given $\alpha$ and $N$ , the length $l$ of each segment is set to $l=\alpha L/N$ . The overlapping degree is an indirect measure of the total number of intersections among subscription and update extents. While the cost of BF and SBM is not affected by the number of intersections, this is not the case for ITM, as will be shown below. We considered the same values for $\alpha$ as in [23], namely $\alpha\in\{0.01,1,100\}$ . Finally, each measure is the average of $30$ independent runs to get statistically valid results. Our implementations do not explicitly store the list of intersections, but only count them. We did so to ensure that the algorithms run time is not affected by the choice of the data structure used to store the intersections.

Wall clock time

The first performance metric we analyze is the Wall Clock Time (WCT) of the algorithms. Figure 7(a) shows the WCT for the parallel versions of BF, ITM and SBM as a function of the number $P$ of OpenMP threads used, given $N=10^{6}$ extents and overlapping degree $\alpha=100$ . Dashed lines indicate when $P$ exceeds the number of CPU cores.

With those parameters, the parallel BF algorithm is about three orders of magnitude slower than SBM on both the titan and solaris machine. For larger values of $N$ the gap widens further, since BF is asymptotically slower than the other two algorithms. Indeed, the computational cost of BF grows quadratically with the number of extents (see Section 3), while that of SBM and ITM grows only polylogarithmically. ITM performs better than BF, but worse than SBM.

In Figure 8 we study how the WCT of the parallel ITM and SBM algorithms depend on the number of extents $N$ and the overlapping degree $\alpha$ . The measures were taken on both machines (solaris and titan) with as many OpenMP threads as physical cores. Figure 8(a) shows that the WCT grows polylogarithmically with $N$ for both ITM and SBM, confirming the asymptotic analysis in Section 4; however, the parallel SBM algorithm is faster than ITM on both machines, suggesting that its asymptotic cost has smaller constants and terms of lower order.

In Figure 8(b) we report the WCT as a function of $\alpha$ , for a fixed $N=10^{8}$ . We observe that, unlike ITM, the execution time of SBM is essentially independent from the overlapping degree.

Speedup

The relative speedup measures the increase in speed that a parallel program achieves when more processors are employed to solve a problem of the same size. This metric can be computed from the WCT as follows. Let $T(N,P)$ be the WCT required to process an input of size $N$ using $P$ processes (OpenMP threads). Then, for a given $N$ , the relative speedup $S_{N}(P)$ is defined as $S_{N}(P)=T(N,1)/T(N,P)$ . Ideally, the maximum value of $S_{N}(P)$ is $P$ , which means that solving a problem with $P$ processors requires $1/P$ the time needed by a single processor. In practice, however, several factors limit the speedup, such as the presence of serial regions in the parallel program, uneven load distribution, scheduling overhead, and heterogeneity in the execution hardware.

Figure 7(b) shows the speedups of the parallel versions of BF, ITM and SBM as a function of the number of OpenMP threads $P$ ; the speedups have been computed using the wall clock times of Figure 7(a). Line colors denote the algorithm, while the shape of the data points denote the host where the tests have been executed (square = solaris, circle = titan). Dashed lines indicate data points where $P$ exceeds the number of physical processor cores available on that machine.

The BF algorithm, despite being the less efficient, is the most scalable. This can be attributed to its embarrassingly parallel structure and lack of any serial part. SBM, on the other hand, is the most efficient but the less scalable. Interestingly, with equal number of OpenMP threads, SBM and ITM scale better on the i7 machine (titan) than on the Xeon machine (solaris), while BF seems unaffected by the processor type. SBM achieves a $2.6\times$ speedup with $16$ OpenMP threads on the dual Xeon machine, and a $2.9\times$ speedup with $6$ OpenMP threads on the Core i7 machine. When all “virtual” cores are used, the speedup grows to $3.6\times$ on the Xeon machine and $4.1\times$ on the i7.

The effect of HT (dashed lines) is clearly visible in Figure 7(b). The speedup degrades when $P$ exceeds the number of cores, as can be seen from the different slopes for BF on titan. When HT kicks in, load unbalance arises due to contention of the shared control units of the processor cores, and this limits the scalability. The bizarre behavior of BF on solaris around $P=24$ (the speedup drops and then raises again) is likely caused by OpenMP Non Uniform Memory Access (NUMA) scheduling issues [7]. Considering the high wall clock time of the BF algorithm, we do not address this issue in this paper.

The speedup of SBM improves slightly if we increase the work performed by the algorithm. Figure 9 shows the speedup of parallel ITM and SBM with $N=10^{8}$ extents and overlapping degree $\alpha=100$ (in this scenario BF takes so long that it has been omitted). The SBM algorithm behaves better, especially on the dual socket Xeon machine, achieving a $4.5\times$ speedup with $16$ OpenMP threads, and $7\times$ speedup with $32$ threads. On the Core i7 machine the speedup is $3.6\times$ with $6$ OpenMP threads (one per core), and $5.1\times$ with $12$ threads (two per core).

Scaling Efficiency

The scaling efficiency measures how well a parallel application exploits the available computational resources. Two formulation of scaling efficiency are given in the literature: strong scaling and weak scaling. Given an input of size $N$ and $P$ processors, the strong scaling efficiency $E_{N,\textrm{strong}}(P)$ and weak scaling efficiency $E_{N,\textrm{weak}}(P)$ are defined as:

[TABLE]

Scaling efficiencies are real numbers in the range $[0,1]$ . An efficiency of, say, $0.8$ indicates that the application spends $80\%$ of the time doing actual work, the rest being communication and synchronization overhead. Therefore, higher efficiencies denote better scaling behavior. Strong scaling measures how well a parallel application exploits the processors, assuming constant total problem size. Weak scaling measures how well the application exploits the processors under constant per-processor work.

Strong and weak scaling are investigated in Figure 10, assuming overlapping degree $\alpha=100$ . We observe that both efficiencies sharply drop when going from $P=1$ to $P=2$ and $P=4$ OpenMP threads. Looking at the strong scaling behavior of SBM and ITM (Figure 10(a)), SBM scales better than ITM on the Intel i7 machine. On the other hand, ITM is more efficient than SBM on the Xeon machine up to $P=8$ OpenMP threads; after that, SBM becomes more efficient. Weak scaling (Figure 10(b)) shows a similar behavior: SBM scales consistently better than ITM on titan, while on solaris ITM is better than SBM up to $P=6$ OpenMP threads.

Figure 10 confirms that the OpenMP SBM implementation exhibits efficiency issues. The reason is still being investigated, since it is unclear whether NUMA memory issues can explain this behavior.

Memory Usage

We conclude our experimental evaluation with an assessment of the memory usage of the parallel BF, ITM and SBM algorithms. Figure 11 shows the peak Resident Set Size (RSS) of the three algorithms as a function of the number of extents and OpenMP threads; the data have been collected on the Xeon machine solaris. The RSS is the portion of a process memory that is kept in RAM. Care has been taken to ensure that all experiments reported in this section fit comfortably in the main memory of the available machines, so that the RSS represents an actual upper bound of the amount of memory required by the algorithms. Note that the data reported in Figure 11 includes the code for the test driver and the input arrays of intervals.

Figure 11(a) shows that the resident set size grows linearly with the number of extents $N$ . BF has the smaller memory footprint, since it requires a tiny bounded amount of additional memory for a few local variables; SBM uses more memory since it allocates larger data structures, namely the list of endpoints to be sorted, and a few arrays of sets. SBM requires approximately $7$ GB of memory to process $N=10^{8}$ intervals, about three times the amount of memory required by BF.

In Figure 11(b) we study the RSS as a function of the number of OpenMP threads $P$ . The RSS for BF and ITM grows very slowly with $P$ , since they do not explicitly use additional per-thread variables; therefore, the marginal increase of RSS that we observe is due to the normal overhead of the OpenMP threading system. On the other hand, the RSS for the parallel SBM algorithm is strongly influenced by the number of threads, although the variability is so high that it is not possible to observe a smooth correlation (despite each data point being computed over multiple runs). Such variability is likely caused by memory fragmentation induced by the STL data structures used by the algorithm. In any case, the RSS for SBM shows only a threefold increase when moving fro $P=2$ to $P=16$ OpenMP threads; we can therefore postulate that the RSS will not become a bottleneck for any reasonable number of OpenMP threads that are used.

6 Conclusions and Future Works

In this paper we described a parallel version of the SBM algorithms for solving the $d$ -rectangle intersection problem for Data Distribution Management. Our algorithm is targeted at shared-memory multicore architectures that constitute the vast majority of current processors.

We have implemented the parallel SBM algorithm in C++ with OpenMP directives. Performance measurement shows a $2.6\times$ speedup with $16$ OpenMP threads on a dual Intel Xeon processor, and a $2.9\times$ speedup with $6$ threads on an Intel Core i7 processor. The memory usage of parallel SBM grows linearly with the number of extents $N$ ; memory usage also depends on the number of OpenMP threads used. In any case, SBM uses about $7$ GB of memory to handle $100$ millions of extents, making it attractive for large scenarios.

We are currently extending the present work in two directions. First, we are studying how the choice of parallel sorting algorithm and of dynamic set data structure influence the scalability of parallel SBM. Indeed, while the results presented in Section 5 are encouraging, scalability is lower than what asymptotic analysis predicts, suggesting the presence of bottlenecks in the implementation that should be identified and removed. Scheduling issues and NUMA memory conflicts are suspected to play a significant role in the loss of efficiency that we observe on the dual-socket test machine. Second, we are extending the SBM algorithm to solve the dynamic DDM matching problem, where extents can be moved or resized dynamically. This problem has already been investigated in the context of serial SBM [20], so it is important to assess if and how it can be solved in a parallel environment.

Notation

[TABLE]

Acronyms

BF Brute Force DDM Data Distribution Management GB Grid Based HLA High Level Architecture HT Hyper-Threading ITM Interval Tree Matching NUMA Non Uniform Memory Access OS Operating System RSS Resident Set Size SBM Sort-based Matching STL Standard Template Library WCT Wall Clock Time

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] IEEE Standard for Modeling and Simulation (M&S) High Level Architecture (HLA)–Framework and Rules. IEEE Std 1516-2010 (Rev. of IEEE Std 1516-2000), 2010.
2[2] G. Adelson-Velskii and E. M. Landis. An Algorithm for the Organization of Information. Doklady Akademii Nauk USSR , 146(2):263–266, 1962.
3[3] R. Bayer. Symmetric binary B-Trees: Data structure and maintenance algorithms. Acta Informatica , 1(4):290–306, 1972.
4[4] G. E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers , 38(11):1526–1538, Nov 1989.
5[5] A. Boukerche and C. Dzermajko. Performance comparison of data distribution management strategies. In Proc. 5th IEEE Int. Workshop on Distributed Simulation and Real-Time Applications , DS-RT ’01, pages 67–, Washington, DC, USA, 2001. IEEE Computer Society.
6[6] A. Boukerche and A. Roy. Dynamic grid-based approach to data distribution management. Journal of Parallel and Distributed Computing , 62(3):366–392, 2002.
7[7] F. Broquedis, F. Diakhaté, S. Thibault, O. Aumage, R. Namyst, and P.-A. Wacrenier. Scheduling dynamic openmp applications over multicore architectures. In R. Eigenmann and B. R. de Supinski, editors, Open MP in a New Era of Parallelism: 4th International Workshop, IWOMP 2008 West Lafayette, IN, USA, May 12-14, 2008 Proceedings , pages 170–180, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.
8[8] Programming languages – technical specification for C++ extensions for parallelism. ISO/IEC TS 19570:2015, 2015.