Engineering Massively Parallel MST Algorithms
Peter Sanders, Matthias Schimek

TL;DR
This paper presents scalable distributed-memory algorithms for minimum spanning trees, including a variant of Boruvka's algorithm and a parallel Filter-Boruvka, achieving significant speedups on large core counts.
Contribution
It introduces new scalable distributed MST algorithms, including a contracting approach and a parallel Filter-Boruvka, outperforming previous methods.
Findings
Algorithms scale up to 65,536 cores
Up to 800 times faster than prior algorithms
Effective for graphs with poor locality and high degree
Abstract
We develop and extensively evaluate highly scalable distributed-memory algorithms for computing minimum spanning trees (MSTs). At the heart of our solutions is a scalable variant of Boruvka's algorithm. For partitioned graphs with many local edges, we improve this with an effective form of contracting local parts of the graph during a preprocessing step. We also adapt the filtering concept of the best practical sequential algorithm to develop a massively parallel Filter-Boruvka algorithm that is very useful for graphs with poor locality and high average degree. Our experiments indicate that our algorithms scale well up to at least 65 536 cores and are up to 800 times faster than previous distributed MST algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Network Technologies · Complex Network Analysis Techniques · Graph Theory and Algorithms
spacing=nonfrench
Engineering Massively Parallel MST Algorithms
††thanks:
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 882500).
Peter Sanders
Institute of Theoretical Informatics
*Karlsruhe Institute of Technology
*Karlsruhe, Germany
Matthias Schimek
Institute of Theoretical Informatics
*Karlsruhe Institute of Technology
*Karlsruhe, Germany
Abstract
We develop and extensively evaluate highly scalable distributed-memory algorithms for computing minimum spanning trees (MSTs). At the heart of our solutions is a scalable variant of Borůvka’s algorithm. For partitioned graphs with many local edges we improve this with an effective form of contracting local parts of the graph during a preprocessing step. We also adapt the filtering concept of the best practical sequential algorithm to develop a massively parallel Filter-Borůvka algorithm that is very useful for graphs with poor locality and high average degree. Our experiments indicate that our algorithms scale well up to at least 65 536 cores and are up to 800 times faster than previous distributed MST algorithms.
Index Terms:
graph algorithms, distributed algorithms, minimum spanning tree, MPI
I Introduction
Graphs are a universal way to model relations between objects and are thus prominently needed everywhere in computing. Hence, it is not surprising that processing huge graphs is eminently interesting for parallel processing. For example, this interest is reflected in the graph500 benchmark (graph500.org) that has led to the development of massively scalable algorithms for breadth-first traversal of RMAT random graphs. However, much less work has been done on other graph problems or even other graph families.
This paper wants to change this for the fundamental problem of computing minimum spanning trees (MSTs). Given an undirected and connected graph with weighted edges, the MST problem asks for a subset of the edges that connects all vertices and has minimum total weight among all such sets.
In the search for graph problems which can be solved massively in parallel, the MST problem is a prime candidate since it is known to have scalable PRAM algorithms (e.g. [2, 3]) with time complexity (poly)logarithmic in the number of vertices of the input graph. The MST problem is not only of theoretical interest, but also has many applications, e.g., clustering, image segmentation, and network design [4, 5, 6].
After discussing basic terms and tools in Section II and previous work in Section III, we describe a scalable distributed variant of Borůvka’s algorithm [7] in Section IV. Borůvka’s algorithm is a promising candidate for the distributed setting as it is conceptually simple and has a high potential for parallelism.
Our implementation represents the graph as a distributed sequence of edges that is lexicographically sorted. This approach has a large design space that we explored in an experimental, performance-driven way with theoretical analysis as a background consideration. Essentially, a Borůvka round is reduced to a number of sparse all-to-all communications which are notoriously difficult to scale on large distributed machines due to contention and high message-startup overheads. We mitigate this by using a two-level sparse all-to-all on a logical grid and a fast distributed sorter. We also switch to a base case with replicated vertex set as soon as the number of vertices is small enough.
Many graphs have a numbering of the vertices that assigns significant parts of a vertex’s neighborhood to the same processing element (PE) yielding local edges. In Section IV-A, we develop a preprocessing algorithm that contracts local edges in such a way that the only remaining vertices have nonlocal incident edges that are lighter than any of their local incident edge. This reduces processing time by up to a factor .
A disadvantage of Borůvka’s algorithm is that it may have to process all the edges a logarithmic number of times. In Section V, we therefore present the Filter-Borůvka algorithm that combines Borůvka’s algorithm with the filtering approach of the Filter-Kruskal MST algorithm [8] that in many respects is currently the best practical sequential algorithm. The idea is to first compute the minimum spanning forest of the globally lightest edges, then drop edges that are within components of , and only then compute the MST of the graph that includes the surviving edges. We prove that is has work linear in the number of edges and polylogarithmic span.
The implementation outlined in Section VI uses hybrid parallelism with multiple OpenMP threads operating within each process of the message passing interface (MPI) which turns out to be crucial for being able to support a large number of cores for many inputs.
In Section VII, we discuss experiments on a supercomputer using up to cores. This includes weak scaling experiments using 6 families of graphs (grid, 2D/3D random geometric, hyperbolic, Erdős-Renyi, and RMAT). Our best algorithm scales on all these families all the way to cores with a potential for considerable further scaling on the families that have some locality. In many cases our implementation is one or two orders of magnitude faster than implementations of previous distributed MST algorithms.
We also perform strong scaling experiments on six large real world graphs with good scaling up to cores on the largest available inputs as well as order-of-magnitude speedups over the implementations of previous algorithms.
In Section VIII, we summarize our result and discuss possible future improvements.
Summary of Contributions
- •
Design and analysis of a practical and highly scalable distributed variant of Borůvka’s MST algorithm.
- •
Design and analysis of a Filter-Borůvka algorithm with constant expected work per edge and polylogarithmic span.
- •
Scalable sparse all-to-all communication using indirect communication.
- •
Fast base case with replicated vertex set and communication-efficient preprocessing by contracting local MST edges.
- •
Hybrid implementation effectively using multithreading on each compute node.
- •
Extensive experimental evaluation on up to cores on a large variety of synthetic and real-world instances.
II Preliminaries
II-A Machine Model and Communication Primitives
In this work, we assume a distributed-memory computing model consisting of processing elements (PE) numbered 111 is shorthand for the sequence allowing single-ported point-to-point communication between arbitrary communication partners [9]. In this model, it takes time to send a message of length between two PEs. The parameter denotes the startup overhead to initiate a message, while quantifies the time needed to communicate a data unit.
There are algorithms for the collective operations broadcast, (all)reduce and prefix-sum with a time complexity in . For the allgather-operation (a.k.a as all-to-all broadcast or gossiping) we get the same complexity when is the sum of all message lengths.
In addition to these operations with very clear-cut complexity, in distributed graph algorithms we need more complex primitives where the complexity highly depends on the implementation and the concrete communication patterns. (Personalized, sparse) all-to-all communication delivers arbitrary sets of messages in a batched way. This can be achieved in when is the bottleneck communication volume, i.e., the maximum amount of data sent or received by a PE. The large startup term can be reduced at the cost of more and more indirect data delivery. For example, using a hypercube communication scheme, can be achieved. In our implementations, we go part of this way and reduce the startup term to by adding one indirection to the communication.
Similarly, comparison based sorting of elements can be done in expected time , e.g., by sample sort and direct data delivery, or in time using an algorithm that moves the data a logarithmic number of times. Once more, an intermediate solution moving data a constant number of times turns out to be optimal for large data sets on large distributed machines [10].
Since our algorithms crucially depend on the performance of sparse all-to-all and sorting, we therefore use several implementations depending on the number of PEs used and the amount of data involved.
II-B Graph and Input/Output Format
The input is an undirected weighted graph with vertex labels in . We use and . We say that is connected if for all vertices , contains a path from to . For a connected graph, the output is a minimum spanning tree (MST). If consists of multiple connected components, the output is an MST for each component. Such a set of MSTs is called a minimum spanning forest (MSF).
We assume to be represented as a lexicographically sorted sequence of directed edges with source vertex , destination vertex and weight . For each edge the back edge is also in . Note that we sometimes omit the edge weight writing instead of . The edge sequence is 1D-partitioned among the PEs, i.e., PE obtains a subsequence of of size . By we denote the lexicographically smallest edge in . Lexicographical means that we sort with respect to source vertex, destination vertex and then edge weight. We refer to the local input to PE as with . may share its first vertex with and its last vertex with . Such a vertex is called a shared vertex. For vertices that are not shared and edges , we refer to PE as the home PE of and .
The following definitions are from the point of view of a specific PE : A vertex is local to PE . A vertex that is not local and appears in is a ghost vertex. An edge is local if and are both local. Otherwise is a cut-edge. Fig. 1 visualizes the different vertex and edge types. Let and . Then is the subgraph of induced by .
We replicate an array of size containing for on each PE as part of our distributed graph data structure. This allows localization of the home PE of a vertex or edge by binary search.
For each MSF edge with weight , we either find or on their respective home PE as output.
II-C MST Properties and Borůvka’s Algorithm
Since our distributed MST algorithms conceptually follow Borůvka’s algorithm [7], we outline its structure in a sequential setting and state some properties of MSTs we will use later on. Let be the input graph. For a graph with multiple connected components, Borůvka’s algorithm (and our distributed variants) can be applied to each component independently yielding an MSF. We can therefore assume to be connected without loss of generality. Since one can use vertex labels to consistently break ties for equal edge weights, we also can assume our input graphs to have distinct edge weights resulting in a unique MST [9]. Borůvka’s algorithm runs in so called (Borůvka) rounds: First, for each vertex the lightest incident edge is determined. These edges induce connected components in . These components are trees with one 2-cycle, i.e., one additional edge , so called pseudo trees [9]. By some tie breaking rule, such a pseudo tree can be converted into a rooted tree with root . By the min-cut property, all edges of are MST edges [9]. These trees are then contracted into a single vertex – the so called component root, for which the vertex label of the tree root is used and edges are relabeled accordingly. Subsequently, self-loops are deleted. Optionally, parallel edges can be deleted keeping only the lightest among these. This results in a new graph with being the set of the component roots on which we subsequently apply a Borůvka round until Since we find , the computation finishes after at most rounds. Note that we do not necessarily need to contract the rooted trees into exactly one component. Instead we can also split into multiple subtrees and define as the set of the roots of all such subtrees. We will use this property in our distributed MST algorithms to handle shared vertices.
III Related Work
The computation of minimum spanning trees has been extensively studied in many different models of computation. The earliest MST algorithms in the sequential setting are due to Borůvka [7], Jarnik-Prim [11] and Kruskal [12]. Many of the later developed algorithms use ideas from these three algorithms. The KKT-algorithm [13] combines Borůvka’s algorithm with randomized sampling and edge filtering to obtain an algorithm with linear expected running time. The Filter-Kruskal algorithm [8] combines Kruskal’s algorithm with a simplified approach to edge filtering that works provably well for random edge weights and arguably is the fastest practical sequential algorithm.
In the PRAM model, there are multiple algorithms with polylogarithmic span (critical path length of the computation DAG) [2, 14]. More practical algorithms have been devised for the shared-memory (single-node) parallel setting [15, 16, 17, 8]. The shared-memory Borůvka-variant described in [15] has some similarities to our distributed Filter-Borůvka described in Section V but uses iterative skewed partitioning rather than recursive symmetric partitioning. We are also not aware of an analysis comparable to ours. Very recently, Esfahani et al. [18] proposed a structure-aware algorithm that outperforms the previous shared-memory state-of-the-art MST algorithm [15] for graphs with skewed degree distributions. We discuss their results in Section VII. For an overview of GPU algorithms, we refer to the related work section of [19].
External MST algorithms [20, 21] take memory locality into account but have inherently sequential components. An MST algorithm for the parallel external memory model (PEM) [22] works in a small number of rounds using Euler tour construction and list-ranking as subroutines. We did not follow that approach as it is relatively complicated and does not address contention and constant factors in the number of synchronizations in the way we need it.
Finding MSTs has also been extensively examined in distributed computing models. Chung and Condon [23] were the first to present a distributed implementation of Borůvka’s algorithm. They use distributed pointer doubling to contract components and evaluate their algorithm on up to cores. Dehne et al. [24] provide several practical algorithms and an evaluation on up to cores for dense graphs with . This condition essentially ensures that the memory of each machine is large enough to hold all vertices of the graph. Loncar et al. [25] propose distributed variants of the Kruskal and Jarnik-Prim algorithm that also rely on replicated vertices.
Many distributed graph processing frameworks [26, 27, 28, 29, 19] include a (mostly Borůvka-based) MST algorithm to evaluate their performance. ReGraph [29] and MND-MST [19] exploit locality by MST computations on locally available subgraphs. However, none of these approaches have been evaluated on more than cores. Furthermore, many of the above-mentioned graph processing frameworks need considerable time to load and prepare/partition a graph before the actual computation starts [30]. In Section VII, we directly compare ourselves with a version of MND-MST. Since MND-MST outperforms other graph tools in previous studies [19], this also allows some transitive comparisons.
The MST problem has also received considerable attention in MapReduce/MPC models [31, 32, 33, 4, 34, 35, 36]. Algorithms in these models aim at the reduction of communication rounds while ensuring that the respective model’s upper bound on communication volume per round is met. In [31, 32, 33], the proposed algorithms use similar ideas as previous parallelizations of Borůvka’s algorithm [23, 24]. While [31] and [32] present only theoretical results, the practical implementation of [33] on up to cores reveals scalability problems whereas our algorithms scale up to cores on the same graphs and are much faster when using a comparable number of cores.
Better performance than plain MapReduce can be achieved by adding a distributed hash-table [4, 36, 35]. An implementation [35] using between and cores reports running times on the friendster and twitter graph, which are also part of our benchmark set. Unfortunately, the number of cores actually used is not given. Our code on cores is times faster on friendster and times faster on twitter. Using cores, the speedup grows to and , respectively.
Recently, Baer et al.[37] used sparse matrix kernels to adapt the Awerbuch-Shiloach-PRAM algorithm [2] to the distributed setting leveraging a distributed library for sparse tensor algebra. To our knowledge this work was the first to evaluate the scalability of an MST algorithm on systems with more than a few hundred to thousand cores using up to cores. Therefore, we include their algorithm in our experimental evaluation.
Large scale experiments using up to 6000 cores have also been done for complete graphs stemming from geometric MST-based clustering problems [38, 39, 40]. However, processing dense graph is much easier to parallelize and the algorithms used there would not scale for the large sparse graphs (with three orders of magnitude more vertices) we are considering.
IV Scalable Distributed-Memory Borůvka
Algorithm 1 shows pseudocode for our distributed Borůvka-MST algorithm where PE works on the local subgraph . We give a high level overview before describing subroutines in more detail. We first exploit locality by determining local MST edges without communication (see Section IV-A). These edges are then contracted yielding smaller subgraphs which have to be processed in a distributed manner. In every distributed Borůvka round, each PE first determines the lightest edges incident to its local vertices . Shared vertices are only considered in the base case discussed below. This simplifies several parts of the algorithm and has no significant effect on the running time as it can be shown that the number of local vertices shrinks by at least a factor of two in every round. The components induced by the edges are then contracted and the component roots are determined (see Section IV-B). Afterwards, each PE retrieves the new labels for its ghost vertices . Using this information, the contracted graph is built using the operations Relabel and Redistribute eliminating self-loops and parallel edges on the way (see Section IV-C).
As soon as the global number of vertices is small enough to be stored on one PE, we switch to our base case algorithm (see Section IV-D). By choosing the size threshold , we take into account that up to shared vertices are not contracted in our distributed Borůvka rounds. As a very last step, we send each identified MST edge back to its original home PE in RedistributeMST.
IV-A Local Preprocessing
The key observation behind local preprocessing is that we can contract edges that can be proven to be MST edges using only locally available information. This results in a graph where the only remaining vertices have cut edges as lightest incident edges. We implement this using a variant of Borůvka’s algorithm that works on local edges only and only contracts local edges when no lighter cut edge is incident.
After local contraction, we need to update the labels of ghost vertices as these might have changed. This can be achieved with the label exchange method described in Section IV-B. We also have to reestablish the invariant that edges are globally sorted in lexicographic order. Since we only contract local edges, this can “almost” be done by locally resorting the edges. What needs to be done nonlocally is sorting the edges incident to shared vertices. This can be achieved very fast for the frequent case that these are short subsequences allocated to two subsequent PEs.
IV-B Component Contraction and Label Exchange
Recall that the identified minimum incident edges define pseudo trees connecting the components to be contracted. We first convert them to rooted trees by tie breaking for 2-cycles and by declaring shared vertices as component roots. The rooted trees are then converted to rooted stars by pointer doubling along the minimum weight edges [23]. This algorithm iteratively halves the depth of the rooted trees by replacing paths of length two by a direct shortcut. This can be implemented for a tree edge by requesting the next edge from the home PE of vertex and subsequently replacing edge by An important special case is when is a shared vertex. This property can be determined locally from the distributed graph data structure. No communication is necessary in this case as is known to be a component root. This eliminates a case that would otherwise induce contention at high degree vertices.
When pointer doubling terminates, each PE possesses the labels of the component roots of its local vertices. In order to obtain the new labels for ghost vertices, for each cut edge the new label of is sent to the home PE of . If there are multiple edges with the same home PE and , the new label of is only sent once. Note that all requests and replies are implemented with the bulk-operations discussed in Section II-A.
IV-C Relabelling and Redistribution
In Relabel, each PE scans through its edges retrieving their labels and respectively. If , the edge is discarded as a self-loop. Otherwise is stored as an edge of the contracted graph.
In operation Redistribute, the resulting edges are first sorted lexicographically (depending on the number of edges different sorting algorithms turn out to be useful (see Section II-A). Afterwards, edges between the same pair of vertices are consecutive in the sorted result and can be replaced by a single edge with the smallest occurring weight. Finally, the distributed graph data structure is reestablished using an allgather-operation on the first edge on each PE.
IV-D Base Case Algorithm
We stop the distributed Borůvka rounds when the remaining number of vertices is small enough to be stored on a single PE. Then we use yet another variant of Borůvka’s algorithm which was initially proposed by Adler et al. [41]. Here, the vertices are replicated over all PEs while the edges are distributed arbitrarily, without need to sort them. To streamline this, we remap vertex labels to the dense range . The lightest edge for each vertex can then be computed using an allReduce-operation with vector length . Subsequent contraction is then possible by local replicated computations as in the plain Borůvka algorithm (see Section II-C).
IV-E Analysis
We essentially port a PRAM algorithm to distributed memory. The former can be proven to run in (expected) time [9, Theorem 11.8]. Using PRAM emulation [42], we would achieve a bound for distributed memory that is a factor larger. Our implementation deviates from this in a way driven by experimental performance evaluation. This reduces overhead at least for benign instances while accepting asymptotically larger overheads in the worst case. We abstain from a full analysis for reasons of space and because this would yield highly complex formulas.
We note however that the operations involving all edges – finding locally minimal edges (segmented min-reduction), label exchange (balanced sparse all-to-all), and building the contracted graph (sorting) all use primitives that can avoid the log-factor overhead of the PRAM emulation.
The component contraction using pointer doubling is a more complicated issue. In the worst case, it needs a logarithmic number of iterations and heavy contention requiring general PRAM emulation can occur. However, pointer doubling only works on vertices rather than on all edges, in practice a small number of iterations suffices and also contention usually remains manageable. Thus, we get away with a relatively simple implementation based on sparse all-to-all that in our experiments never dominates the running time.
V Filter-Borůvka-Algorithm
In the worst case, Borůvka’s algorithm looks at all edges a logarithmic number of times. Asymptotically better algorithms are known whose running time depends only linearly on the number of edges . The most practical of those may still have running time superlinear in but work very well for sufficiently dense graphs. To go into that direction also for massively parallel algorithms, we look into Filter-Kruskal[8]. This algorithm is based on the observation that for many instances, most MST edges are quite light. To exploit this, Filter-Kruskal partitions the edges into light edges and heavy edges similar to quicksort. It first recurses on the light edges computing an MSF of . Before also recursing on , it filters them by eliminating those inside a connected component of . Filter-Kruskal is also a reasonable shared-memory parallel algorithm [8]. However, Filter-Kruskal has an critical path length since MST edges are found sequentially. To eliminate this bottleneck, we propose Filter-Borůvka which replaces Kruskal’s algorithm by Borůvka’s algorithm in the base case. To facilitate subsequent filtering, we modify the output specification of the underlying Borůvka algorithm to also provide component representatives within the MSF for each vertex. Changing the base case algorithm does not affect the performed work but reduces the span from linear to polylogarithmic:
Theorem 1**.**
For random edge weights, sequential Filter-Borůvka has expected sequential running time . When used as a parallel algorithm its expected span is times the span of the underlying parallel Borůvka implementation.
Proof.
The time bound immediately follows from the bound for Filter-Kruskal [8] since the only difference is in the base case. This does not affect the asymptotic running time since both Kruskal’s and Borůvka’s algorithm have running time for inputs with edges.
Since filtering and partitioning have an (expected) span in and there are at most recursion levels prior to a base case Borůvka call, it suffices to count the expected number of base case Borůvka calls to get an upper bound on the overall span. To facilitate the analysis, we assume and are in the range . If the chosen pivot does not yield such a balanced partition, one can simply repeat the pivot selection until it does. This process will need repetitions in expectation. Using a balanced partitioning, the number of edges which survive filtering is dominated by a random variable with negative binomial distribution[13]. Assume that we stop the recursion when the number of input edges is with . The probability that we recurse on then is at most . This is equivalent to the probability , where is a binomial distributed random variable. The mean of is . Let . Now, we can apply a Chernoff-bound [43, Theorem 4.5] to obtain an upper bound on the probability that the right recursion is not stopped directly, , which is in .
There are right recursive calls associated with the leftmost path in the computation tree. For each of these right recursive calls, the recursion either stops directly resulting in one additional base case call or continues with an exponentially small probability which yields at most additional base case calls. Therefore, by the union bound argument, the expected number of base case Borůvka calls is in with and being positive constants. 222This is an improved version of the proof in the conference version.
∎
Our distributed implementation of Filter-Borůvka is shown in Algorithm 2. We perform local preprocessing as described in Section IV-A once at the beginning. When the graph is sufficiently sparse, i.e., the number of input edges is in , we stop the recursion and use our distributed Borůvka-MST algorithm as base case. There are three differences when calling distributed Borůvka-MST to its description given in Algorithm 1:
- •
We refrain from performing local preprocessing as graph locality has already been exploited beforehand.
- •
We do not redistribute the computed MST edges as this is done once at the end of Filter-Borůvka.
- •
There is a distributed array of size , where PE holds the elements . After a Borůvka round, each PE stores the component root for its local vertices in . In the end, the implicitly constructed trees in are contracted using pointer doubling rounds (see Section IV-B).
For PivotSelection , randomly sampled edges are sorted with a distributed sorting algorithm (see Section II-A). Afterwards, the median of the sample is broadcasted as pivot element .
For the filtering step, each PE first requests the labels of the component representatives for the local vertices in its part of from the distributed array (RequestLabels). We then use the label exchange and relabelling routines from Section IV-B and Section IV-C to rename the edges according to these new labels. For edges that are within one component of the previously computed partial MSF , we now find . Thus, these can be discarded easily. In a last step, we redistribute the edges and eliminate parallel ones as described in Section IV-C.
VI Engineering and Implementation Details
Here, we describe some engineering refinements we made to obtain a scalable implementation of our algorithms in practice. Our algorithms are implemented in C++ using MPI for distributed communication in funneled mode, i.e., using only one dedicated thread per process for MPI communication. Multithreading is realized with OpenMP. For shared-memory parallel algorithmic building blocks like prefix-sums or filtering, we use the parlay library [44].
VI-A Reducing Startup Overhead of All-To-All Exchanges
Many steps of our two distributed MST algorithms involve multiple all-to-all exchanges with often only few bytes per message. During our experiments we noticed that the startup overhead of the built-in MPI routine for the personalized all-to-all exchange MPI_Alltoallv becomes prohibitive with increasing number of PEs. Therefore, we use an indirect grid based all-to-all variant if the average number of bytes sent per message is below a certain threshold (we use on our system). The PEs are arranged in a (virtual) two dimensional grid with columns and rows. Note that . PE resides in column and . Instead of sending messages directly, we exchange them in two steps. A message from PE to PE is first sent to the intermediate PE in row and column and from there to its final destination . Since is in the same column as and the same row as , the message exchanges can be realized with two standard MPI_Alltoallv exchanges with at most participating PEs, thus reducing the startup overhead to at the cost of a doubled communication volume. When and is a member of the last (incomplete) row of the grid, the intermediate PE is the one in row and column . For the second message exchange such a is (virtually) appended to row . For larger , the grid approach can easily be generalized to dimensions . For , we basically get the hypercube all-to-all algorithm from [45].
In particular, we found two-level all-to-all to be crucial for pointer doubling during component contraction. Fig. 2 shows the accumulated running time of the component contraction phases for an Erdős-Renyi graph with vertices and edges per core. (One-level-)pointer doubling using the built-in MPI_Alltoallv routine directly, exhibits significantly increasing running times with a growing number of cores whereas our two-level approach scales very well.
VI-B Local Preprocessing Optimizations
We enhance the modified Borůvka variant used for the local preprocessing step with a recursive edge-filtering approach as in Filter-Borůvka. Our multithreaded implementation uses the Min-Priority-Write-approach for minimum edge computation (as well as some other building blocks) from a fast shared-memory MST algorithm [15]. We apply the preprocessing only if at least 10% of the edges are local.
In preliminary experiments, we found that local preprocessing considerably reduces the number of vertices leaving many parallel edges. Instead of sorting all edges to remove parallel ones, we determine a pivot weight such that the set of edges lighter than is small. The edges from are then inserted into a hash table omitting their weight. In a subsequent scan over we can filter out all edges that are in . We then only have to sort the remaining edges and remove parallel ones therein in a final scan. This variant outperforms the pure sorting approach by up to a factor of 2.5 if the hash table remains small enough to fit into the cache.
VI-C Further Remarks
Regarding distributed sorting we use distributed hypercube quicksort [10] if the average number of elements to sort per PE is below . For larger inputs we use our own implementation of distributed two-level sample sort (similar to AMS-sort [46, 10]) applying the hypercube algorithm to sort the samples.
In order to be able to output the original source and destination vertices of an MST edge, we add an id to every edge prior to the actual MST computation. The id of an MST edge is then looked up in a copy of the initial edge list. As main memory on compute cluster nodes is notoriously scarce, this copy is stored with -bit variable length encoding on the differences of consecutive vertices. Note that the time for encoding is not included in our experiments, however, we account for decoding the compressed edge list twice (before and after the actual MST computation).333Also note that encoding takes at most 30% of the running time and that our competitors in Section VII do not produce a similarly prepared output.
For switching to the base case in our distributed Borůvka algorithm, we use the maximum of two times the number of MPI processes and . In distributed Filter-Borůvka, we stop the recursive partitioning and employ our distributed Borůvka algorithm as base case when the average degree of the graph is four or less. We also refrain from further partitioning if the graph is too small (less than 1000 edges per MPI processes). Furthermore, if the number of filtered edges is too small, these are not processed directly but propagated back to the previous recursion level and merged with the heavy edges there.
VII Experiments
We now discuss the experimental evaluation of our algorithms. An implementation is available at https://github.com/mschimek/kamsta. We compare our algorithms (boruvka and filterBoruvka) against two state-of-the-art competitors: sparseMatrix by Baer et al. [37] and MND-MST by Panja et al. [19]. All algorithms are written in C++ and compiled with g++ version 10.2.0 using optimizations -O3 and -march=native. We use OpenMPI version 4.0.4 for interprocess communication. All algorithms use multithreading with OpenMP.
SparseMatrix adapts the Awerbuch-Shiloach PRAM algorithm[2] to the distributed setting using linear algebra primitives and leveraging Cyclops, a library for generalized sparse tensor algebra. They use a 2D-partitioning scheme where the graph’s adjacency matrix is divided into blocks which are then distributed among the PEs. The source code is publicly available444https://github.com/raghavendrak/algebraicMSF. MND-MST is a multinode GPU-CPU algorithm, which also comprises a CPU-only version. It uses Borůvka’s algorithm to compute local MST edges and to contract the incident vertices. Afterwards, fixed size groups of PEs exchange parts of the previously contracted vertices and iteratively apply Borůvka’s algorithm on their local input. Once a threshold on the size of the reduced graph is reached, all group members send their contracted graphs to the group leader. Then, the whole process starts again with only the group leaders performing computations. As in our algorithms, they use 1D-partitioning. However, they do not share vertices beyond process boundaries which can lead to load imbalances for graphs with very skewed degree distributions. The source code was provided by the authors directly. However, it seems to deviate somewhat from the algorithm described in the paper [19].
For our evaluation, we mainly555We also conducted experiments on the smaller HoreKa supercomputer. As the results obtained there are in line with our findings from SuperMUC-NG, we omit them due to space limitations. use the SuperMUC-NG supercomputer. The thin node cluster consists of 6 336 compute nodes with a total of 304 128 cores. A compute nodes is equipped with an Intel Xeon Platinum 8174 processor with 48 cores and 96 GByte of main memory and runs SUSE Linux Enterprise Server 15 SP3. The nodes are connected via an OmniPath network with 100 Gbit/s bandwidth.
Instances and Methodology
For our strong scaling experiments, we use the six real-world graphs listed in Table I. The number of (symmetric, directed) edges of these graphs ranges from 57 millions to 123 billions. In our weak scaling experiments, we use instances of six different graph families generated with KaGen666https://github.com/sebalamm/KaGen [53] and a fast RMAT generator [54]. We use two-dimensional grids (2D-GRID), two- and three-dimensional random geometric graphs (2D/3D-RGG), random hyperbolic graphs (RHG), Erdős-Renyi graphs (GNM) and RMAT graphs.
RGGs are constructed by placing vertices uniformly at random in the unit square (unit cube for 3D) and each MPI process is assigned a part thereof. Vertices are connected if the Euclidean distance is below a threshold . RHG construction is conceptually similar, as vertices are placed on a disk with radius that depends on the average degree and power-law exponent , where the disk is again evenly divided among the MPI processes. Two vertices are adjacent, if the (hyperbolic) distance is smaller than . In Erdős-Renyi graphs, each edge is inserted with a probability given as an input parameter. RMAT graphs are generated by recursively partition the adjacency matrix. An edge is inserted as soon as a partition size is reached. We use the default probabilities from the Graph500 benchmark.
Grids and RGGs have a high degree of locality. GNMs and RMAT graphs consist almost exclusively of cut-edges. RHGs are somewhere in between. RHGs (for which we use power-law exponent ) and RMAT graphs possess a power-law degree distribution.
All graphs are scaled such that the number of vertices and number of (symmetric, directed) edges are proportional to the number of cores used (). For RGG/GNM the threshold distance/edge probability is chosen accordingly. KaGen ensures that the generated edges are globally lexicographically sorted and thus do not produce shared vertices for the input. Regarding the RMAT generator [54], we first globally sort the generated edges and then redistribute them equally over all PEs. For MND-MST, the edges incident to a shared vertex are moved completely to one MPI process to meet their input format. Following the experimental setup in [37], we assign a weight drawn uniformly at random from to each edge. On every benchmark set, all algorithms are run at least three times with one warm-up round which we discard. As the variances in running time are usually very low, we report only the mean running time.
VII-A Weak Scaling
Fig. 3 shows the throughput (measured in edges per second) achieved in our weak scaling experiments with vertices and edges per core on up to cores. Due to the high running time we ran our competitors only up to cores to save computation time. MND-MST crashed beyond cores on all instances. On the grid and RMAT instances, sparseMatrix also crashed beyond and respectively. Our two algorithms (boruvka and filterBoruvka) outperform the competitors clearly on all instances.
Especially on graphs with high locality, we achieve speedups of two orders of magnitude over our competitors peaking at a factor for grid graphs. For sparseMatrix we attribute this to the fact that the algorithm does not exploit locality. Furthermore, their 2D-partitioning makes exploiting graph locality more challenging from a structural point of view as only the processors on the diagonal of the matrix possess local edges. MND-MST is faster on these inputs as its design aims at (and relies on) using locality in graphs. We believe that its scalability problems are to some extent due to weaknesses in its implementation which is not designed to scale to several thousands of cores.
For GNM and RMAT, local preprocessing is not effective as there are only very few local edges. Nevertheless, we are up to times faster than our competitors. Moreover, we see – especially for GNM – the effectiveness of our filter approach being up to times faster than our non-filter variant. In additional weak scaling experiments on denser graphs with edges per core, which we omit due to space limitations, this effect is even stronger. For graphs with high locality, our filter approach performs slightly worse since the locally contracted graphs are significantly smaller than the input and the overhead introduced by the filtering step does not pay off.
Our 8-thread variants are faster and scale better than their 1-thread counterparts for highly local graphs. This is as expected as each MPI process obtains a larger part of the graph which enables a more effective local contraction yielding a smaller remaining graph. Surprisingly, the 1-thread variant is noticeably faster for GNM (and on some configurations for RMAT). More detailed measurements reveal that one reason for this is the time spent within MPI_Alltoallv exchanges. Here, we seem to be limited by the single-threaded execution within MPI.
Overall, our algorithms prove to be scalable on up to cores even on RMAT graphs that are notorious for their highly skewed degree distribution which often imposes scalability problems. We attribute our good scalability on RMAT instances to our edge-based 1D-partitioning as it implicitly splits (very) high degree vertices (and their incident edges) in shared vertices over multiple processors preventing load imbalances.
Fig. 4 shows our algorithms with disabled local preprocessing on grids and random geometric/hyperbolic graphs with vertices and edges per core. We can see that local contraction makes our algorithms up to times faster. Moreover, it turns out that the filtering approach is also beneficial for graphs with many local edges if the graph size is not too small. This indicates that on machines with more memory per compute node, which would enable us to tackle larger graphs, filtering could also be beneficial on these graph families with local preprocessing enabled.
Fig. 6 shows the normalized running time distribution of the different phases of our four variants (boruvka-{1,8}, filterBoruvka-{1,8}). We see that for 3D-RGG (prototypical for the other graphs with high locality) a considerable amount of time is consumed by the local preprocessing. For GNM and RMAT, the local preprocessing time is negligible as these graphs possess many cut-edges and we skip this step after a quick check if the number of cut-edges exceeds . For these two graph families, most of the running time is spent in label exchange and the redistribution of the edges. The running time distribution of Filter-Borůvka shows that the time spent in these communication intense phases can be significantly reduced with filtering, which in turn becomes dominant for GNM and RMAT. We further see that by successfully applying our two-level all-to-all, the time spent for pointer doubling during component contraction does only contribute a minor factor to the running time for all graphs. This also justifies our focus on per-edge computation in the analysis (Section IV-E).
VII-B Strong Scaling
We now discuss the strong scaling experiments presented in Fig. 5. Due to memory limitations, our competitors could not process all graphs on all configurations. Our own algorithms are able to process all graphs on to processors except for wdc-14 for which we also need at least cores. Our algorithms exhibit good scalability and are to times faster than our competitors, which also scale worse for all graphs but US-road. For US-road, which is relatively small with only 57 million edges, we achieve our best running time for cores.
Furthermore, we see that the 8-thread variants tend to outperform their 1-thread counterparts with increasing number of cores used. This is not surprising as the number of vertices and edges per MPI process decreases and the latency-overhead introduced is less and less compensated by local work and communication volume.
For the social instances, our filtering approach tends to be faster than our non-filter algorithm. For all other graphs, our non-filter approach performs better.
VII-C Comparisons with Shared-Memory Algorithms
For their algorithm MASTIFF, Esfahani et al. [18] provide measurements on a -core shared-memory server with TB main memory for twitter, friendster, US-road and wdc-14. Comparing their running times for the first three graphs with the fastest of our algorithms for each graph on cores ( compute nodes with a total of GB main memory) yields an average speedup of MASTIFF over our algorithms of . From cores on, we are faster on friendster and US-road. For twitter, we need cores. Due to memory limitations, we need cores ( compute nodes with TB main memory) to be able to process wdc-14 in seconds while MASTIFF processes this graph in seconds on their machine. Although the evaluation setting is not identical, this rough comparison indicates that our algorithms are only a modest factor slower than state-of-the-art shared-memory algorithms.
VIII Conclusion and Future Work
We have demonstrated that MSTs on huge networks can be computed on large supercomputers in a scalable way. Improvements over previous approaches are typically one or two orders of magnitude, approaching three orders of magnitude for some large configurations. We achieve that by maximizing local computations and by using efficient primitives for global communication like two-phase sparse all-to-all and scalable sorters. We see this as an important step in a larger effort to obtain efficient massively parallel graph algorithms on a larger range of problems.
While this was generally successful, our results also demonstrate several challenges that may also apply to other graph problems. Perhaps most obviously, global communication remains the main bottleneck which might be mitigated by further refining the used primitives and by devising new algorithms that use less global synchronization. From a more theoretical perspective we also want algorithms that reduce the gap between theory and practice, e.g., by giving practical implementations with asymptotic worst-case performance at least as good as using PRAM emulation.
More surprisingly, we observe a complicated tradeoff regarding the most efficient number of threads per process. Arriving at an implementation of the shared-memory parts that efficiently use all hardware threads of a compute node would significantly improve also overall scalability since it enables more local contraction and more coarse-grained communication. Since similar effects show up in other problems we study, this seems to be an important and nontrivial area of further investigation.
From a more practical perspective, our approach of a sophisticated, relatively low-level implementation for one of the most basic graph problems makes sense for basic research on parallel algorithms but is untenable for more complex problems. Transferring our techniques in a generic way into more high-level tools like sparse matrix libraries or graph tools therefore seems an important direction of further research in order to reduce the observed performance and scalability gap.
Interestingly, single Borůvka rounds are also an important part of more sophisticated MST algorithms with better performance guarantees like the expected linear time algorithm[13] and the related PRAM algorithm[3]. Therefore, we believe that the algorithmic building blocks developed in this work can also be of interest for distributed implementations of such more complex MST algorithms.
Acknowledgment
The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer SuperMUC-NG at Leibniz Supercomputing Centre (www.lrz.de). This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Sanders and M. Schimek, “Engineering massively parallel MST algorithms,” in 37th Int. Parallel and Distributed Processing Symposium, (IPDPS) . IEEE, 2023, pp. 691–701.
- 2[2] B. Awerbuch and Y. Shiloach, “New connectivity and MSF algorithms for shuffle-exchange network and PRAM,” IEEE Transactions on Computers , vol. 36, no. 10, pp. 1258–1263, 1987.
- 3[3] R. Cole, P. N. Klein, and R. E. Tarjan, “Finding minimum spanning forests in logarithmic time and linear work using random sampling,” in 8th Symposium on Parallel Algorithms and Architectures (SPAA) . ACM, 1996, pp. 243–250.
- 4[4] M. Bateni, S. Behnezhad, M. Derakhshan et al. , “Affinity clustering: Hierarchical clustering at scale,” Advances in Neural Information Processing Systems , vol. 30, 2017.
- 5[5] J. Wassenberg, W. Middelmann, and P. Sanders, “An efficient parallel algorithm for graph-based image segmentation,” in Conf. on Computer Analysis of Images and Patterns . Springer, 2009, pp. 1003–1010.
- 6[6] N. Li, J. C. Hou, and L. Sha, “Design and analysis of an mst-based topology control algorithm,” IEEE Transactions on wireless communications , vol. 4, no. 3, pp. 1195–1206, 2005.
- 7[7] O. Borůvka, “O jistém problému minimálním,” Práce moravské přírodovědecké společnosti 6 , 1926.
- 8[8] V. Osipov, P. Sanders, and J. Singler, “The filter-kruskal minimum spanning tree algorithm,” in 11th Workshop on Algorithm Engineering and Experiments (ALENEX) . SIAM, 2009, pp. 52–61.
