State-of-The-Art Sparse Direct Solvers
Matthias Bollh\"ofer, Olaf Schenk, Radim Janal\'ik, Steve, Hamm, Kiran Gullapalli

TL;DR
This paper reviews modern sparse direct solvers, highlighting advances in preprocessing, parallelization, and dense submatrix handling that significantly improve efficiency for circuit simulation problems.
Contribution
It provides an overview of recent developments in sparse elimination methods, emphasizing preprocessing, parallelism, and dense submatrix detection techniques.
Findings
Enhanced preprocessing improves diagonal dominance and reduces fill-in.
Parallel processing accelerates sparse elimination.
Detection of dense submatrices enables efficient dense kernel computations.
Abstract
In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce fill-in, and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multithreaded level-3 BLAS. We will demonstrate for problems arising from circuit simulation, how the improvements in recent years have advanced direct solution methods significantly.
| Matrix | N | nnz | nnz | Selected | |
|---|---|---|---|---|---|
| circuit5M_DC | 3,523,317 | 19,194,193 | 2.87 | 82.3 h. | 1.3 s. |
| circuit5M | 5,558,326 | 59,524,291 | 1.04 | 371.1 h. | 2.1 s. |
| Freescale | 3,428,755 | 18,920,347 | 2.94 | 89.8 h. | 1.0 s. |
| Freescale2 | 2,999,349 | 23,042,677 | 2.92 | 8.5 h. | 1.2 s. |
| FullChip | 2,987,012 | 26,621,990 | 7.41 | 162.9 h. | 11.9 s. |
| memchip | 2,707,524 | 14,810,202 | 4.40 | 62.5 h. | 0.9 s. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Electromagnetic Scattering and Analysis · Parallel Computing and Optimization Techniques
11institutetext: Matthias Bollhöfer 22institutetext: Institute for Computational Mathematics, TU Braunschweig, Germany, 22email: [email protected] 33institutetext: Olaf Schenk 44institutetext: Institute of Computational Science, Faculty of Informatics, Università della Svizzera italiana, Switzerland, 44email: [email protected] 55institutetext: Radim Janalík 66institutetext: Institute of Computational Science, Faculty of Informatics, Università della Svizzera italiana, Switzerland, 66email: [email protected] 77institutetext: Steve Hamm 88institutetext: NXP, United States of America, 88email: [email protected] 99institutetext: Kiran Gullapalli 1010institutetext: NXP, United States of America, 1010email: [email protected]
State-of-The-Art Sparse Direct Solvers
Matthias Bollhöfer
Olaf Schenk
Radim Janalík
Steve Hamm
Kiran Gullapalli
Abstract
In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce fill–in and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multi-threaded level–3 BLAS. We will demonstrate for problems arising from circuit simulation how the improvement in recent years have advanced direct solution methods significantly.
1 Introduction
Solving large sparse linear systems is at the heart of many application problems arising from engineering problems. Advances in combinatorial methods in combination with modern computer architectures have massively influenced the design of state-of-the-art direct solvers that are feasible for solving larger systems efficiently in a computational environment with rapidly increasing memory resources and cores. Among these advances are novel combinatorial algorithms for improving diagonal dominance which pave the way to a static pivoting approach, thus improving the efficiency of the factorization phase dramatically. Besides, partitioning and reordering the system such that a high level of concurrency is achieved, the objective is to simultaneously achieve the reduction of fill-in and the parallel concurrency. While these achievements already significantly improve the factorization phase, modern computer architectures require one to compute as many operations as possible in the cache of the CPU. This in turn can be achieved when dense subblocks that show up during the factorization can be grouped together into dense submatrices which are handled by multithreaded and cache-optimized dense matrix kernels using level-3 BLAS and LAPACK AndBBDDDGHMOS95 .
This chapter will review some of the basic technologies together with the latest developments for sparse direct solution methods that have led to state-of-the-art decomposition methods. The paper is organized as follows. In Section 2 we will start with maximum weighted matchings which is one of the key tools in combinatorial optimization to dramatically improve the diagonal dominance of the underlying system. Next, Section 3 will review multilevel nested dissection as a combinatorial method to reorder a system symmetrically such that fill-in and parallelization can are improved simultaneously, once pivoting can be more or less ignored. After that, we will review established graph-theoretical approaches in Section 0.4, in particular the elimination tree, from which most of the properties of the factorization can be concluded. Among these properties is the prediction of dense submatrices in the factorization. In this way several subsequent columns of the factors and are collected in a single dense block. This is the basis for the use of dense matrix kernels using optimized level-3 BLAS as well to exploit fast computation using the cache hierarchy which is discussed in Section 0.5. Finally we will demonstrate in Section 0.6, for examples from circuit simulation, how the ongoing developments in sparse direct solution methods have accelerated state-of-the-art solution techniques. We assume that the reader is familiar with some elementary knowledge from graph theory, see; e.g., DufER86 ; GeoL81 and some simple computational algorithms based on graphs AhoHU83 .
2 Maximum weight matching
In modern sparse elimination methods the key to success is ability to work with efficient data structures and their underlying numerical templates. If we can increase the size of the diagonal entries as much as possible in advance, pivoting during Gaussian elimination can often be bypassed and we may work with static data structures and the numerical method will be significantly accelerated. A popular method to achieve this goal is the maximum weight matching method DufK99S ; olschowka:1996 which permutes (e.g.) the rows of a given nonsingular matrix by a permutation matrix such that has a non-zero diagonal. Moreover, it maximizes the product of the absolute diagonal values and yields diagonal scaling matrices such that satisfies and for all . The original idea on which these nonsymmetric permutations and scalings are based is to find a maximum weighted matching of a bipartite graphs. Finding a maximum weighted matching is a well known assignment problem in operation research and combinatorial analysis.
Definition 1
A graph with vertices and edges is called bipartite if can be partitioned into two sets and , such that no edge has both ends in or both ends in . In this case we denote by .
Definition 2
Given a matrix , then we can associate with it a canonical bipartite graph by assigning the labels of with the row indices of and being labeled by the column indices. In this case is defined via .
For the bipartite graph we see immediately that If , then we have that from the row set is connected by an edge to the column , but neither rows are connected with each other nor do the columns have inter connections.
Definition 3
A matching of a given graph is a subset of edges such that no two of which share the same vertex.
If is a matching of a bipartite graph , then each edge corresponds to a row and a column and there exists no other edge that has the same vertices, neither nor .
Definition 4
A matching of is called maximal, if no other edge from can be added to .
If, for an matrix a matching of with maximum cardinality is found, then by definition the edges must be with being the numbers in a suitable order and therefore we obtain , …. In this case we have established that the matrix is at least structurally nonsingular and we can use a row permutation matrix associated with row ordering to place a nonzero entry on each diagonal location of .
Definition 5
A perfect matching is a maximal matching with cardinality .
It can be shown that for a structurally nonsingular matrix there always exists a perfect matching .
Example 1
Perfect Matching In Figure 1, the set of edges represents a perfect maximum matching of the bipartite graph .
The most efficient combinatorial methods for finding maximum matchings in bipartite graphs make use of an augmenting path. We will introduce some graph terminology for the construction of perfect matchings.
Definition 6
If an edge in a graph joins a vertices , then we denote it as . A path then consists of edges , where each , .
If is a bipartite graph, then by definition of a path, any path is alternating between the vertices of and , e.g., pathes in could be such as .
Definition 7
Given a graph , a vertex is called free if it is not incident to any other edge in a matching of . An alternating path relative to a matching is a path where its edges are alternating between and . An augmenting path relative to a matching is an alternating path of odd length and both of it vertex endpoints are free.
Example 2
Augmenting Path Consider Figure 1. To better distinguish between row and column vertices we use \framebox{1},\framebox{2},\dots,\framebox{6} for the rows and \bigcirc\mkern-12.0mu$$1\;,\bigcirc\mkern-12.0mu$$2\;,…,\bigcirc\mkern-12.0mu$$6\; for the columns. A non-perfect but maximal matching is given by M=\{(\framebox{4},$$\bigcirc\mkern-12.0mu$$5\;$$),(\framebox{1},$$\bigcirc\mkern-12.0mu$$1\;$$),(\framebox{6},$$\bigcirc\mkern-12.0mu$$2\;$$),(\framebox{2},$$\bigcirc\mkern-12.0mu$$6\;$$),(\framebox{5},$$\bigcirc\mkern-12.0mu$$4\;$$)\}. We can easily see that an augmenting path alternating between rows and columns is given by 3$$\bigcirc\mkern-12.0mu$$5\; , \bigcirc\mkern-12.0mu$$5\;$$4 , 4$$\bigcirc\mkern-12.0mu$$1\; , \bigcirc\mkern-12.0mu$$1\;$$1 , 1$$\bigcirc\mkern-12.0mu$$2\; , \bigcirc\mkern-12.0mu$$2\;$$6 , 6$$\bigcirc\mkern-12.0mu$$6\; , \bigcirc\mkern-12.0mu$$6\;$$2 , 2$$\bigcirc\mkern-12.0mu$$4\; , \bigcirc\mkern-12.0mu$$4\;$$5 , 5$$\bigcirc\mkern-12.0mu$$3\;. Both endpoints and \bigcirc\mkern-12.0mu$$3\; of this augmenting path are free.
In a bipartite graph one vertex endpoint of any augmenting path must be in whereas the other one must be in . The symmetric difference, of two edge sets , is defined to be .
Using these definitions and notations, the following theorem Berge gives an constructive algorithm for finding perfect matchings in bipartite graphs.
Theorem 2.1
If is non-maximum matching of a bipartite graph , then there exists an augmenting path relative to such that and is a matching with cardinality .
According to this theorem, a combinatorial method of finding perfect matching in a bipartite graph is to seek for augmenting paths.
The perfect matching as discussed so far only takes the nonzero structure of the matrix into account. For their use as static pivoting methods prior to the decomposition one requires in addition to maximize the absolute value of the product of the diagonal entries. This is referred to as maximum weighted matching. In this case a permutation has to be found, which maximizes
[TABLE]
The maximization of this product is transferred into a minimization of a sum as follows. We define a matrix via
[TABLE]
where is the maximum element in row of matrix . A permutation which minimizes the sum
[TABLE]
also maximizes the product (1). The minimization problem is known as linear-sum assignment problem or bipartite weighted matching problem in combinatorial optimization. The problem is solved by a sparse variant of the Hungarian method. The complexity is for sparse matrices with entries. For matrices, whose associated graph fulfill special requirements, this bound can be reduced further to with . All graphs arising from finite-difference or finite element discretizations meet the conditions gupta:99 . As before, we finally get a perfect matching which in turn defines a nonsymmetric permutation.
When solving the assignment problem, two dual vectors and are computed which satisfy
[TABLE]
Using the exponential function these vectors can be used to scale the initial matrix. To do so define two diagonal matrices and through
[TABLE]
Using equations (2) and (3) and the definition of , it immediately follows that satisfies
[TABLE]
The permuted and scaled system has been observed to have significantly better numerical properties when being used for direct methods or for preconditioned iterative methods, cf. e.g. benzi:2000:phi ; DufK99S . Olschowka and Neumaier olschowka:1996 introduced these scalings and permutation for reducing pivoting in Gaussian elimination of full matrices. The first implementation for sparse matrix problems was introduced by Duff and Koster DufK99S . For symmetric matrices , these nonsymmetric matchings can be converted to a symmetric permutation and a symmetric scaling such that consists mostly of diagonal blocks of size and satisfying a similar condition as (6) and (7), where in practice it rarely happens that blocks are identical to [math] dupr:04a . Recently, successful parallel approaches to compute maximum weighted matchings have been proposed LanPM11 ; LanAM14 .
Example 3
Maximum Weight Matching To conclude this section we demonstrate the effectiveness of maximum weight matchings using a simple sample matrix “west0479” from the SuiteSparse Matrix Collection. The matrix can also directly be loaded in Matlab using load west0479. In Figure 2 we display the matrix before and after applying maximum weighted matchings. To illustrate the improved diagonal dominance we further compute for each row of and , . can be read as relative diagonal dominance of row and yields a number between [math] and . Moreover, whenever , the row is strictly diagonal dominant, i.e., . In Figure 2 we display for both matrices by sorting its values in increasing order and taking as reference line. We can see the dramatic impact of maximum weighted matchings in improving the diagonal dominance of the given matrix and thus paving its way to a static pivoting approach in incomplete or complete decomposition methods.
3 Symbolic symmetric reordering techniques
When dealing with large sparse matrices a crucial factor that determines the computation time is the amount of fill tat is produced during the factorization of the underlying matrix. To reduce the complexity there exist many mainly symmetric reordering techniques that attempt to reduce the fill–in heuristically. Here we will demonstrate only one of these methods, the so–called nested dissection method. The main reason for selecting this method is that it can be easily used for parallel computations.
3.1 Multilevel nested dissection
Recursive multilevel nested dissection methods for direct decomposition methods were firstly introduced in the context of multiprocessing. If parallel direct methods are used to solve a sparse systems of equations, then a graph partitioning algorithm can be used to compute a fill reducing ordering that leads to a high degree of concurrency in the factorization phase.
Definition 8
For a matrix we define the associated (directed) graph , where and the set of edges . The (undirected) graph is given by and is denoted simply by .
In graph terminology for a sparse matrix we simply have a directed edge for any nonzero entry in whereas the orientation of the edge is ignored in .
The research on graph-partitioning methods in the mid-nineteens has resulted in high-quality software packages, e.g. Metis karypis:98 . These methods often compute orderings that on the one hand lead small fill–in for (incomplete) factorization methods while on the other hand they provide a high level of concurrency. We will briefly review the main idea of multilevel nested dissection in terms graph-partitioning.
Definition 9
Let and consider its graph . A -way graph partitioning consists of partitionining into disjoint subsets such that for . The subset is called edge separator.
Typically we want a -way partitioning to be balanced, i.e., each should satisfy . The edge separator refers to the edges that have to be taken away from the graph in order to have separate subgraphs associated with and the number of elements of is usually referred to as edge-cut.
Definition 10
Given , a vertex separator of is a set of vertices such that there exists a -way partitioning of having no edge for .
A useful vertex separator should not only separate into independent subgraps associated with , it is intended that the numbers of edges is also small.
Nested dissection recursively splits a graph into almost equal parts by constructing a vertex separator until the desired number of partitionings are obtained. If is a power of , then a natural way of obtaining a vertex separator is to first obtain a -way partitioning of the graph, a so called graph bisection with its associated edge separator . After that a vertex separator is computed from , which gives a -way partitioning of . This process is then repeated separately for the subgraphs associated with until eventually a -way partitioning is obtained. For the reordering of the underlying matrix , the vertices associated with are taken first followed by and . This reordering is repeated similarly during repeated bisection of each . In general, vertex separators of small size result in low fill-in.
Example 4
Vertex Separators To illustrate vertex separators, we consider the reordered matrix from Figure 1 after a matching is applied. In Figure 4 we display its graph ignoring the orientation of the edges. A -way partitioning is obtained with , and a vertex separator . The associated reordering refers to taking the rows and the columns of in the order .
Since a naive approach to compute a recursive graph bisection is typically computationally expensive, combinatorial multilevel graph bisection has been used to accelerate the process. The basic structure is simple. The multilevel approach consists of three phases: at first there is a coarsening phase which compresses the given graph successively level by level by about half of its size. When the coarsest graph with about a few hundred vertices is reached, the second phase, namely the so–called bisection is applied. This is a high quality partitioning algorithm. After that, during the uncoarsening phase, the given bisection is successively refined as it is prolongated towards the original graph.
Coarsening Phase
The initial graph of is transformed during the coarsening phase into a sequence of graphs of decreasing size such that . Given the graph , the next coarser graph is obtained from by collapsing adjacent vertices. This can be done e.g. using a maximal matching of (cf. Definitions 3 and 4). Using , the next coarser graph is constructed from collapsing the vertices being matched into multinodes, i.e., the elements of together with the unmatched vertices of become the new vertices of . The new edges are the remaining edges from connected with the collapsed vertices. There are various differences in the construction of maximal matchings karypis:98 ; CheP08 . One of the most popular and efficient methods is heavy edge matching karypis:98 .
Partitioning Phase
At the coarsest level , a -way partitioning of is computed, each of them containing about half of the vertices of . This specific partitioning of can be obtained by using various algorithms such as spectral bisection fiedler:75 or combinatorial methods based on Kernighan-Lin variants KerL70 ; FidM97 . It is demonstrated in karypis:98 that for the coarsest graph, combinatorial methods typically compute smaller edge-cut separators compared with spectral bisection methods. However, since the size of the coarsest graph is small (typically , this step is negligible with respect to the total amount of computation time.
Uncoarsening Phase
Suppose that at the coarsest level , an edge separator of associated with the -way partitioning has been computed that has lead to a sufficient edge-cut of with , of almost equal size. Then is prolongated to by reversing the process of collapsing matched vertices. This leads to an initial edge separator for . But since is finer, is sub-optimal and one usually decreases the edge-cut of the partitioning by local refinement heuristics such as the Kernighan–Lin partitioning algorithm KerL70 or the Fiduccia–Mattheyses method FidM97 . Repeating this refinement procedure level–by-level we obtain a sequence of edge separators and eventually and edge separator of the initial graph is obtained. If one is seeking for a vertex separator of , then one usually computes from at the end.
There have been a number of methods that are used for graph partitioning, e.g. Metis karypis:98 , a parallel MPI version ParMetis KarSK99 , or a recent multithreaded approach MT-MetisLasK13 . Another example for a parallel partitioning algorithm is ScotchCheP08 .
Example 5
Multilevel Nested Dissection We will continue Example 3 using the matrix that has been rescaled and permuted using maximum weight matching. We illustrate in Figure 3.1 how multilevel nested dissection changes the pattern , where refers to the permutation matrix associated with the partitioning of .
Application of multilevel nested dissection after the matrix is already rescaled and permuted using maximum weight matching.
0.3.2 Other reordering methods
One of the first methods to reorder the system was the reverse Cuthill–McKee (Rcm) methods cm:69 ; LiuS76 which attempts to reduce the bandwidth of a given matrix. Though this algorithm is still attractive for sequential methods and incomplete factorization methods, its use for direct solvers is considered as obsolete. An attractive alternative to nested dissection as reordering method for direct factorization methods is the minimum degree algorithm (Mmd) Ros72 ; GeoL89 and its recent variants, in particular the approximate minimum degree algorithm (Amd) AmeDD96 ; Dav06 with or without constraints. The main objective of the minimum degree algorithm is to simulate the Gaussian elimination process symbolically by investigating the update process by means of graph theory, at least in the case of the undirected graph. The name-giving degree refers to the number of edges connected to a vertex and how the graph and therefore the degrees of its vertices change during the factorization process. Over the years this has lead to an evolution of the underlying minimum degree algorithm using the so-called external degree for selecting vertices as pivots and further techniques like incomplete degree update, element absorption and multiple elimination as well as data structures based on cliques. For an overview see GeoL89 . One of the most costly parts in the minimum degree algorithm is to update of the degrees. Instead of computing the exact external degree, in the approximate minimum degree algorithm AmeDD96 an approximate external degree is computed that significantly saves time while producing comparable fill in the decomposition. We like to conclude this section mentioning that if nested dissection is computed to produce a vertex separator and a related -way partitioning for the remaining vertices of of which allow for parallel computations, then the entries of each , could be taken in any order. Certainly, inside one could use nested dissection as well, which is the default choice in multilevel nested dissection methods. However, ass soon as the coarsest graph is small enough (typically about vertices), not only the separator is computed, but in addition the remaining entries of are reordered to lead to a fill-reducing ordering. In both cases, for as well as one could alternatively use different reordering methods such as variants of the minimum degree algorithm. Indeed, for this is what the Metissoftware is doing. Furthermore, a reordering method such as the constrained approximate minimum degree algorithm is also suitable as local reordering for as alternative to nested dissection, taking into account the edges connected with (also referred to as HALO structure), see e.g. PelRA00 .
0.4 Sparse Decomposition
In this section we will assume that the given matrix is nonsingular and that it can be factorized as , where is a lower triangular matrix with unit diagonal and is an upper triangular matrix. It is well–known GeoL81 , if , where and are lower triangular matrices, then in the generic case we will have , i.e., we will only get additional edges unless some entries cancel by “accident” during the elimination. In the sequel we will ignore cancellations. Throughout this section we will always assume that the diagonal entries of are nonzero as well. We also assume that is connected. In the preceding sections we have argued that maximum weight matching often leads to a rescaled and reordered matrix such that static pivoting is likely to be enough, i.e., pivoting is restricted to some dense blocks inside the factorization. Furthermore, reordering strategies such as multilevel nested dissection have further symmetrically permuted the system such that the fill–in that occurs during Gaussian elimination is acceptable and even parallel approaches could be drawn from this reordering. Thus assuming that does not need further reordering and a factorization exists is a realistic scenario in what follows.
0.4.1 The Elimination Tree
The basis of determining the fill-in in the triangular factors and as by-product of the Gaussian elimination can be characterized as follows (see Gil94 in the references therein).
Theorem 0.4.1
Given with the aforementioned assumptions, there exists an edge in if and only if there exists a path
[TABLE]
in such that .
In other words, during Gaussian elimination we obtain a fill edge for every path from to through vertices less than .
Example 6
Fill–in We will use the matrix from Example 4 and sketch the fill–in obtained during Gaussian elimination in Figure 0.4.1.
Fill-in with respect to is denoted by .
A problem in general is to predict the filled graph and the fastest known method to compute it, is Gaussian elimination. The situation simplifies if the graph is undirected. In the sequel we ignore the orientation of the edges and simply consider the undirected graph and , respectively.
Definition 11
The undirected graph that is derived from the undirected graph by applying Theorem 0.4.1 is called the filled graph and it will be denoted by .
Example 7
Fill-in with respect to the undirected graph When we consider the undirected graph in Example 6, the pattern of and its filled graph now equals (cf. Figure 0.4.1).
Entries of are denoted by .
The key tool to predict the fill–in easily for the undirected graph is the elimination tree Liu90 . Recall that an undirected and connected graph is called a tree, if it does not contain any cycle. Furthermore, one vertex is identified as root. As usual we call a vertex parent of , if there exists an edge in the tree such that is closer to the root. In this case is called child of . The subtree rooted at vertex is denoted by and the vertices of this subtree are called descendants of whereas is called their ancestor. Initially we will define the elimination tree algorithmically using the depth–first–search algorithm AhoHU83 . Later we will state a much simplified algorithm.
Definition 12
Given the filled graph the elimination tree is defined by the following algorithm.
Perform a depth–first–search in starting from vertex .
When vertex is visited, choose from its unvisited neighbors the index with the largest number and continue the search with .
A leaf of the tree is reached, when all neighbors have already been visited.
We like to point out that the application of the depth–first–search to starting at vertex behaves significantly different from other graphs. By Theorem 0.4.1 it follows that as soon as we visit a vertex , all its neighbors must have been visited prior to vertex . Thus the labels of the vertices are strictly decreasing until we reach a leaf node.
Example 8
Depth–first–search We illustrate the depth–first–search using the filled graph in Figure 5 and the pattern from Example 7. The extra fill edge is marked by the bold line. The ongoing depth–first search visits the vertices in the order . Since at vertex , all neighbors of are visited (and indeed have a larger number), the algorithm backtracks to and continues the search in the order . Again all neighbors of vertex are visited (and have larger number), thus the algorithm backtracks to and to and continues by . Then the algorithm terminates. Note that vertex is isolated and if the graph of is not connected one has to proceed for each connected component separately.
Remark 1
It follows immediately from the construction of and Theorem 0.4.1 that additional edges of which are not covered by the elimination tree can only show up between a vertex and some of its ancestors (referred to as “back–edges”). In contrast to that, “cross–edges” between unrelated vertices do not exist.
Remark 2
One immediate consequence of Remark 1 is that triangular factors can be computed independently starting from the leaves until the vertices meet a common parent, i.e., column of and only depend on those columns of and such that is a descendant of in the elimination tree .
Example 9
Elimination tree We use the matrix “west0479” from Example 5, after maximum weight matching and multilevel nested dissection have been applied. We use Matlab’s etreeplot to display its elimination tree (see Figure 6). The elimination tree displays the high level of concurrency that is induced by nested dissection, since by Remark 2 the computations can be executed independently at each leaf node towards to the root until a common parent vertex is reached.
Further conclusions can be easily derived from the elimination tree, in particular Remark 2 in conjunction with Theorem 0.4.1.
Remark 3
Consider some . Then there exists a (fill) edge with if and only if there exists a common descendant of in such that . This follows from the fact that once , by Theorem 0.4.1 this induces (fill) edges in the filled graph for all nodes between and in the elimination tree , i.e., for all ancestors of that are also descendents of . This way, propagates fill–edges along the branch from to in and the information can be used as path compression to advance from towards to along the elimination tree.
Example 10
Path compression Consider the graph and the elimination tree from Figure 5. Since there exists the edge in , therefore another (fill) edge must exist (here not a fill edge, but a regular edge). Similarly, the same conclusion can be drawn from the existence of the edge . Via we can advance from vertex to vertex bypassing vertex .
The elimination tree itself can be easily described by a vector of length such that for any , denotes the parent node while corresponds to the root. Consider some step with , for some . By Remark 3, must be a descendent of and there could be further ancestors of which are also descendents of . Possibly not all ancestors of have been assigned a parent node so far. Thus we can replace by until we end up with or . This way we traverse from towards to until we have found the child node of . If the parent of has not been assigned to yet, then and must be the parent of . If some were the parent of , then we would have assigned as parent of in an earlier step . In this case we set . Otherwise, if , then we have already assigned ’s parent in an earlier step .
Example 11
Computation of parent nodes Consider the elimination tree from Figure 5. Unless , no parents have been assigned, i.e. for all . Now for we have and using the fact that implies that we have to set . Next, and again requires to set . Finally, if , we have , we advance from to and since we set . Next, and imply that we have to set . will traverse from to to , which requires no further setting. Last but not least, requires no further action, since we already have . In total we have which perfectly reveals the parent properties of the elimination trees in Figure 5.
By Remark 3 (cf. Tar83 ; Dav06 ), we can also make use of path compression. Since our goal is to traverse the branch of the elimination tree from to as fast as possible, any ancestor of would be sufficient. With the same argument as before, an ancestor would refer to a vertex that does not have a parent yet. In this case we can again set . Moreover, is always an ancestor of . The algorithm including path compression can be summarized as follows (see also Liu90 ; Dav06 ). {programcode}Computation of the elimination tree
1: such that has the same pattern as .
2:vector such that is the parent of , , except .
3:let be an auxiliary vector used for path compression.
4:
5:for do
6: for all such that do
7: while and do
8:
9:
10: if then
11:
12: end if
13:
14: end while
15: end for
16:end for
0.4.2 The supernodal approach
We have already seen that the elimination tree reveals information about concurrency. It is further useful to determine the fill– in and . This information can be computed from the elimination tree together with . The basis for determining the fill–in in each column is again Remark 3. Suppose we are interested in the nonzero entries of column of and . Then for all decedents of , i.e. the nodes of the subtree rooted at vertex , a nonzero entry also implies . Thus, starting at any leaf , we obtain its fill by all such that and when we move forward from to its parent , vertex will inherit the fill from node for all plus the nonzero entries given by such that . When we reach a common parent node with multiple children, the same argument applies using the union of fill–in greater than from its children together with the nonzero entries such that . We summarize this result in a very simple algorithm {programcode}Computation of fill–in
1: such that has the same pattern as .
2:sparse strict lower triangular pattern with same pattern as , .
3:compute parent array of the elimination tree
4:for do
5: supplement nonzeros of column of with all such that
6:
7: if then
8: supplement nonzeros of column of with nonzeros of column of greater than
9: end if
10:end for
Algorithm 0.4.2 only deals with the fill pattern. One additional aspect that allows to raise efficiency and to speed up the numerical factorization significantly is to detect dense submatrices in the factorization. Block structures allow to collect parts of the matrix in dense blocks and to treat them commonly using dense matrix kernels such as level–3 BLAS and LAPACK DodL85 ; DonDHH88 . Dense blocks can be read off from the elimination tree employing Algorithm 0.4.2.
Definition 13
Denote by the nonzero indices of column of as computed by Algorithm 0.4.2. A sequence is called supernode of size if the columns of for all .
In simple words, Definition 13 states that for a supernode subsequent columns can be grouped together in one dense block with a triangular diagonal block and a dense subdiagonal block since they perfectly match the associated trapezoidal shape. We can thus easily supplement Algorithm 0.4.2 with a supernode detection. {programcode}Computation of fill–in and supernodes
1: such that has the same pattern as .
2:sparse strict lower triangular pattern with same pattern as , as well as column size of each supernode.
3:compute parent array of the elimination tree
4:
5:for do
6: supplement nonzeros of column of with all such that
7: denote by the number of entries in column of
8: if and and then
9: continue current supernode
10: else
11: , , start new supernode
12: end if
13:
14: if then
15: supplement nonzeros of column of with nonzeros of column of greater than
16: end if
17:end for
Example 12
Supernode Computation To illustrate the use of supernodes, we consider the matrix pattern from Figure 0.4.1 and illustrate the underlying dense block structure in Figure 0.4.2. Supernodes are the columns , , as scalar columns as well as columns – as one single supernode.
Supernodes in the triangular factor.
Supernodes form the basis of several improvements, e.g., a supernode can be stored as one or two dense matrices. Beside the storage scheme as dense matrices, the nonzero row indices for these blocks need only be stored once. Next the use of dense submatrices allows the usage of dense matrix kernels using level–3 BLAS DodL85 ; DonDHH88 .
Example 13
Supernodes We use the matrix “west0479” from Example 5, after maximum weight matching and multilevel nested dissection have been applied. We use its undirected graph to compute the supernodal structure. Certainly, since the matrix is nonsymmetric, the block structure is only sub-optimal. We display the supernodal structure for the associated Cholesky factor, i.e., for the Cholesky factor of an symmetric positive definite matrix with same undirected graph as our matrix (see left part of Figure 7). Furthermore, we display the supernodal structure for the factors and computed from the nonsymmetric matrix without pivoting (see right part of Figure 7).
While the construction of supernodes is fairly easy in the symmetric case, its generalization to the general case is significantly harder, since one has to deal with pivoting in each step of Gaussian elimination. In this case one uses the column elimination tree GeoN85 .
0.5 Sparse Direct Solvers — Supernodal Data Structures
High-performance sparse solver libraries have been a very important part of scientific and engineering computing for years, and their importance continues to grow as microprocessor architectures become more complex and software libraries become better designed to integrate easily within applications. Despite the fact that there are various science and engineering applications, the underlying algorithms typically have remarkable similarities, especially those algorithms that are most challenging to implement well in parallel. It is not too strong a statement to say that these software libraries are essential to the broad success of scalable high-performance computing in computational sciences. In this section we demonstrate the benefit of supernodal data structures within the sparse solver package PARDISO schenk-2004 . We illustrate it by using the triangular solution process. The forward and backward substitution is performed column wise with respect to the columns of , starting with the first column, as depicted in Figure 8. The data dependencies here allow to store vectors , , , and in only one vector . When column is reached, contains the solution for . All other elements of in this column, i. e. with , are used to update the remaining entries in by
[TABLE]
The backward substitution with will take place row wise, since we use and perform the substitution column wise with respect to , as shown in the lower part of Figure 8. In contrast to the forward substitution the iteration over columns starts at the last column and proceeds to the first one. If column is reached, then , which contains the -component of the solution vector , is computed by subtracting the dot-product of the remaining elements in the column and the corresponding elements of with from it:
[TABLE]
After all columns have been processed contains the required solution of . It is important to note that line 5 represents in both substitutions an indexed DAXPY and indexed DDOT kernel operations that has to be computed during the streaming operations of the vector and the column of the numerical factor . As we are dealing with sparse matrices it makes no sense to store the lower triangular matrix as a dense matrix. Hence PARDISO uses its own data structure to store , as shown in Figure 9.
Adjacent columns exhibiting the same row sparsity structure form a panel, also known as supernode. A panel’s column count is called the panel size . The columns of a panel are stored consecutively in memory excluding the zero entries. Note that columns of panels are padded in the front with zeros so they get the same length as the first column inside their panel. The padding is of utmost performance for the PARDISO solver to use Level-3 BLAS and LAPACK functionalities 20.500.11850/144477 . Furthermore panels are stored consecutively in the l array. Row and column information is now stored in accompanying arrays. The xsuper array stores for each panel the index of its first column. Also note that here column indices are the running count of nonzero columns. Column indices are used as indices into xl array to lookup the start of the column in the l array which contains the numerical values of the factor . To determine the row index of a column’s element an additional array id is used, which holds for each panel the row indices. The start of a panel inside id is found via xid array. The first row index of panel is id[xid[p]]. For serial execution this information is enough. However, during parallel forward/backward substitution concurrent updates to the same entry of r must be avoided. The parts structure contains the start (and end) indices of the panels which can be updated independently as they do not touch the same entries of . Two parts, colored blue and orange, are shown in Figure 9. The last part in the bottom right corner of is special and is called the separator and is colored green. Parts which would touch entries of r in the range of the separator perform their updates into separate temporary arrays t. Before the separator is then serially updated, the results of the temporary arrays are gathered back into r. The backward substitution works the same, just reversed and only updates to different temporary arrays are not required. The complete forward substitution and backward substitution is listed in Algorithm 1 and 2.
0.6 Application – Circuit Simulation
In this section we demonstrate how these developments in sparse direct linear solvers have advanced integrated circuit simulations. Integrated circuits are composed of interconnected transistors. The interconnects are modeled primarily with resistors, capacitors, and inductors. The interconnects route signals through the circuit, and also deliver power. Circuit equations arise out of Kirchhoff’s current law, applied at each node, and are generally nonlinear differential-algebraic equations. In transient simulation of the circuit, the differential portion is handled by discretizing the time derivative of the node charge by an implicit integration formula. The associated set of nonlinear equations is handled through use of quasi-Newton methods or continuation methods, which change the nonlinear problem into a series of linear algebraic solutions. Each component in the circuit contributes only to a few equations. Hence the resulting systems of linear algebraic equations are extremely sparse, and most reliably solved by using direct sparse matrix techniques. Circuit simulation matrices are peculiar in the universe of matrices, having the following characteristics davis:klu :
- •
they are unsymmetric, although often nearly structurally symmetric;
- •
they have a few dense rows and columns (e.g., power and ground connections);
- •
they are very sparse and the straightforward usage of BLAS routines (as in SuperLUsuperlu ) may be ineffective;
- •
their LU factors remain sparse if well-ordered;
- •
they can have high fill-in if ordered with typical strategies;
- •
and being unstructured, the highly irregular memory access causes factorization to proceed only at a few percent of the peak flop-rate.
Circuit simulation matrices also vary from being positive definite to being extremely ill-conditioned, making pivoting for stability important also. As circuit size increases, and depending on how much of the interconnect is modeled, sparse matrix factorization is the dominant cost in the transient analysis. To overcome the complexity of matrix factorization a new class of simulators arose in the 1990s, called fast-SPICE Rewienski2011APO . These simulators partition the circuit into subcircuits and use a variety of techniques, including model order reduction and multirate integration, to overcome the matrix bottleneck. However, the resulting simulation methods generally incur unacceptable errors for analog and tightly coupled circuits. As accuracy demands increase, these techniques become much slower than traditional SPICE methods. Even so, since much of the research effort was directed at fast-SPICE simulators, it brought some relief from impossibly slow simulations when some accuracy trade-off was acceptable. Because these simulators partitioned the circuit, and did not require the simultaneous solution of the entire system of linear equations at any given time, they did not push the state-of-the-art in sparse matrix solvers. Starting in the mid-2000s, increasing demands on accuracy, due to advancing semiconductor technology, brought attention back to traditional SPICE techniques. This was aided by the proliferation of multicore CPUs. Parallel circuit simulation, an area of much research focus in the 1980s and 1990s, but not particularly in practice, received renewed interest as a way to speed up simulation without sacrificing accuracy. Along with improved implementations to avoid cache misses, rearchitecture of code for parallel computing, and better techniques for exploitation of circuit latency, improved sparse matrix solvers, most notably the release of KLU davis:klu , played a crucial role in expanding the utility of SPICE. Along with the ability to simulate ever larger circuits with full SPICE accuracy came the opportunity to further improve sparse matrix techniques. A sparse matrix package for transient simulation needs to have the following features:
- •
must be parallel;
- •
fast matrix reordering
- •
incremental update of the and factors when only a few nonzeros change;
- •
fast computation of the diagonal entries of the inverse matrix;
- •
fast computation of Schur-complements for a submatrix;
- •
allow for multiple factors of the same structure to be stored;
- •
use the best-in-class method across the spectrum of sparsity;
- •
use iterative solvers with fast construction of sparse preconditioners;
- •
run on various hardware platforms (e.g. GPU acceleration).
Some of these features must be available in a single package. Others, such as iterative solvers and construction of preconditioners, can be implemented with a combination of different packages. The PARDISO solver111The PARDISO solver is available from http://www.pardiso-project.org. stands out as a package that does most of these very well. Here we touch on a few of these features. When applied in the simulation of very large circuits, the difference between a “good” and a “bad” matrix ordering can be the difference between seconds and days. PARDISO offers AMD and nested-dissection methods for matrix ordering, as well as permitting user-defined ordering. Because the matrix re-ordering method which has been used most often in circuit simulation is due to Markowitz markowitz , and because modern sparse matrix packages do not include this ordering method, we briefly describe it here. The Markowitz method is quite well-adapted for circuit simulation. Some desirable aspects of the typical implementation of the Markowitz method, as opposed to the MD variants, are that it works for nonsymmetric matrices and combines pivot choice with numerical decomposition, such that a pivot choice is a numerically “good” pivot which generates in a local sense the least fill-in at that step of the decomposition. Choosing pivots based on the Markowitz score often produces very good results: near-minimal fill-in, unfortunately at the cost of an algorithm (for dense blocks). Even though the Markowitz algorithm has some good properties when applied to circuit matrices, the complexity of the algorithm has become quite burdensome. When SPICE nagel:spice2 was originally conceived, a hundred-node circuit was huge and the Markowitz algorithm was not a problem. Now we routinely see netlists with hundreds of thousands of nodes and postlayout netlists with millions of elements. As matrix order and element counts increase, Markowitz reordering time can become an obstruction. Even as improved implementations of the Markowitz method have extended its reach, AMD and nested-dissection have become the mainstay of simulation of large denser-than-usual matrices. Next we turn our attention to parallel performance. While KLU remains a benchmark for serial solvers, for parallel solvers, MKL-PARDISO is often cited as the benchmark Booth2017 ; Chen2013 . To give the reader a sense of the progress in parallel sparse matrix methods, in Figure 10 we compare KLU, PARDISO (Version 6.2) to MKL-PARDISO on up to 16 cores on an Intel Xeon E7-4880 architecture with 2.5 GHz processors.
Some of the matrices here can be obtained from the SuiteSparse Matrix Collection, and arise in transistor level full-chip and memory array simulations. It is clear that implementation of sparse matrix solvers has improved significantly over the years. Exploiting latency in all parts of the SPICE algorithm is very important in enabling accurate circuit simulation, especially as the circuit size increases. By latency, we mean that only a few entries in the matrix change from one
Newton iteration to the next, and from one timepoint to the next. As the matrix depends on the time-step, some simulators hold the time-steps constant as much as feasible to allow increased reuse of matrix factorizations. The nonzero entries of a matrix change only when the transistors and other nonlinear devices change their operation point. In most circuits, very few devices change state from one iteration to the next and from one time-step to the next. Nonzeros contributed by entirely linear components do not change value during the simulation. This makes incremental LU factorization a very useful feature of any matrix solver used in circuit simulation. As of April 2019 the version PARDISO 6.2 has a very efficient exploitation of incremental LU factorization, both serial and parallel. In Figure 11 we show that PARDISO scales linearly with number of updated columns, and also scales well with number of cores. Here, the series of matrices were obtained from a full simulation of a post-layout circuit that includes all interconnects, power- and ground-networks). The factorization time is plotted against the number of columns that changed compared to the previous factorization. The scatter plot shows the number of rank- update and the corresponding factorization time in milliseconds. The regression analysis clearly demonstrates a linear trend both for the single and the multiple core versions. The dashed line shows the time for the full factorization. Another recent useful feature in PARDISO is parallel selective inverse matrix computation as demonstrated in Table 1. In circuit simulation, the diagonal of the inverse matrix is the driving point impedance. It is often required to flag nodes in the circuit with very high driving point impedance. Such nodes would indicate failed interfaces between different subcircuits, leading to undefined state and high current leakage and power dissipation. A naive approach to this is to solve for the driving point impedance, the diagonal of the inverse matrix, by triangular solves. This is sometimes unacceptably expensive even with exploiting the sparsity of the right hand side, and minimizing the number of entries needed in the diagonal of the inverse. To bypass this complexity, heuristics to compute the impedance of connected components are used. But this is error prone with many false positives and also false negatives. In the circuit Freescale, PARDISO, e.g., finished the required impedance calculations in 11.9 seconds compared to the traditional computation that consumed 162.9 hours.
The productivity gap in simulation continues to grow, and challenges remain. Signoff simulations demand 10X speedup in sparse matrix factorization. Simply using more cores does not help unless the matrices are very large and complex. For a majority of simulations, scaling beyond 8 cores is difficult. As a result, some of these simulations can take a few months to complete, making them essentially impossible. Some of the problems in parallelizing sparse matrix operations for circuit simulation are fundamental. Others may be related to implementation. Research on sparse matrix factorization for circuit simulation continues to draw attention, especially in the area of acceleration with Intel’s many integrated core (MIC) architecture Booth2017 and GPUs Chen2015 ; Nakhla2018 . Other techniques for acceleration include improved preconditioners for iterative solvers Feng2015 . We are presently addressing the need for runtime selection of optimal strategies for factorization, and also GPU acceleration. Given that circuits present a wide spectrum of matrices, no matter how we categorize them, it is possible to obtain a solver that is 2–10X better on a given problem. Improvements in parallel sparse matrix factorization targeted at circuit simulation is more necessary today than ever and will continue to drive applicability of traditional SPICE simulation methods. Availability of sparse matrix packages such as PARDISO that completely satisfy the needs of various circuit simulation methods is necessary for continued performance gains.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Aho, J. Hopcroft, and J. Ullman. Data structures and algorithms . Addison-Wesley, 1983.
- 2[2] P. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl. , 17(4):886–905, 1996.
- 3[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. Mc Kenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide, Second Edition . SIAM Publications, 1995.
- 4[4] M. Benzi, J. Haws, and M. Tuma. Preconditioning highly indefinite and nonsymmetric matrices. SIAM J. Sci. Comput. , 22(4):1333–1353, 2000.
- 5[5] C. Berge. Two theorems in graph theory. In Proceedings of National Academy of Science , pages 842–844, USA, 1957.
- 6[6] J. D. Booth, N. D. Ellingwoodb, and S. R. Heidi K. Thornquist. Basker: Parallel sparse lu factorization utilizing hierarchical parallelism and data layouts. Parallel Computing , 68:17–31, 2017.
- 7[7] X. Chen, L. Ren, Y. Wang, and H. Yang. Gpu-accelerated sparse lu factorization for circuit simulation with performance modeling. IEEE Tr Ansactions on Parallel and Distributed Systems , 26:786–795, 2015.
- 8[8] X. Chen, Y. Wang, and H. Yang. Nicslu: An adaptive sparse matrix solver for parallel circuit simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 32:261–274, 2013.
