State-of-The-Art Sparse Direct Solvers

Matthias Bollh\"ofer; Olaf Schenk; Radim Janal\'ik; Steve; Hamm; Kiran Gullapalli

arXiv:1907.05309·cs.DS·July 12, 2019

State-of-The-Art Sparse Direct Solvers

Matthias Bollh\"ofer, Olaf Schenk, Radim Janal\'ik, Steve, Hamm, Kiran Gullapalli

PDF

Open Access

TL;DR

This paper reviews modern sparse direct solvers, highlighting advances in preprocessing, parallelization, and dense submatrix handling that significantly improve efficiency for circuit simulation problems.

Contribution

It provides an overview of recent developments in sparse elimination methods, emphasizing preprocessing, parallelism, and dense submatrix detection techniques.

Findings

01

Enhanced preprocessing improves diagonal dominance and reduces fill-in.

02

Parallel processing accelerates sparse elimination.

03

Detection of dense submatrices enables efficient dense kernel computations.

Abstract

In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce fill-in, and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multithreaded level-3 BLAS. We will demonstrate for problems arising from circuit simulation, how the improvements in recent years have advanced direct solution methods significantly.

Tables1

Table 1. Table 1: Details of the benchmark matrices. ’N’ is the number of matrix rows and ’nnz’ is the number of nonzeros. The table shows the fill-in factor related to the numbers of nonzerors in L + U A 𝐿 𝑈 𝐴 \frac{L+U}{A} , the time for computing all diagonal elements of the inverse A − 1 superscript 𝐴 1 A^{-1} using N 𝑁 N multiple forward/backward substitution in hours, and using the selected inverse method in PARDISO for computing all diagonal elements of the inverse A − 1 superscript 𝐴 1 A^{-1} in seconds.

Matrix	N	nnz $(A)$	nnz $(\frac{L + U}{A})$	$A^{- 1}$	Selected $A^{- 1}$
circuit5M_DC	3,523,317	19,194,193	2.87	82.3 h.	1.3 s.
circuit5M	5,558,326	59,524,291	1.04	371.1 h.	2.1 s.
Freescale	3,428,755	18,920,347	2.94	89.8 h.	1.0 s.
Freescale2	2,999,349	23,042,677	2.92	8.5 h.	1.2 s.
FullChip	2,987,012	26,621,990	7.41	162.9 h.	11.9 s.
memchip	2,707,524	14,810,202	4.40	62.5 h.	0.9 s.

Equations21

i = 1 \prod n ∣ a_{π (i) i} ∣.

i = 1 \prod n ∣ a_{π (i) i} ∣.

c_{ij} = {lo g a_{i} - lo g ∣ a_{ij} ∣ \infty a_{ij} \neq = 0 otherwise,

c_{ij} = {lo g a_{i} - lo g ∣ a_{ij} ∣ \infty a_{ij} \neq = 0 otherwise,

i = 1 \sum n c_{π (i) i}

i = 1 \sum n c_{π (i) i}

u_{i} + v_{j}

u_{i} + v_{j}

u_{i} + v_{j}

D_{r}

D_{r}

D_{c}

∣ \tilde{a}_{ii} ∣

∣ \tilde{a}_{ii} ∣

∣ \tilde{a}_{ij} ∣

i x_{1}, x_{2} x_{3}, \dots, x_{k} j

i x_{1}, x_{2} x_{3}, \dots, x_{k} j

r_{i} = r_{i} - r_{j} L_{ij} .

r_{i} = r_{i} - r_{j} L_{ij} .

r_{j} = r_{j} - r_{i} L_{ij} .

r_{j} = r_{j} - r_{i} L_{ij} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Electromagnetic Scattering and Analysis · Parallel Computing and Optimization Techniques

Full text

11institutetext: Matthias Bollhöfer 22institutetext: Institute for Computational Mathematics, TU Braunschweig, Germany, 22email: [email protected] 33institutetext: Olaf Schenk 44institutetext: Institute of Computational Science, Faculty of Informatics, Università della Svizzera italiana, Switzerland, 44email: [email protected] 55institutetext: Radim Janalík 66institutetext: Institute of Computational Science, Faculty of Informatics, Università della Svizzera italiana, Switzerland, 66email: [email protected] 77institutetext: Steve Hamm 88institutetext: NXP, United States of America, 88email: [email protected] 99institutetext: Kiran Gullapalli 1010institutetext: NXP, United States of America, 1010email: [email protected]

State-of-The-Art Sparse Direct Solvers

Matthias Bollhöfer

Olaf Schenk

Radim Janalík

Steve Hamm

Kiran Gullapalli

Abstract

In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce fill–in and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multi-threaded level–3 BLAS. We will demonstrate for problems arising from circuit simulation how the improvement in recent years have advanced direct solution methods significantly.

1 Introduction

Solving large sparse linear systems is at the heart of many application problems arising from engineering problems. Advances in combinatorial methods in combination with modern computer architectures have massively influenced the design of state-of-the-art direct solvers that are feasible for solving larger systems efficiently in a computational environment with rapidly increasing memory resources and cores. Among these advances are novel combinatorial algorithms for improving diagonal dominance which pave the way to a static pivoting approach, thus improving the efficiency of the factorization phase dramatically. Besides, partitioning and reordering the system such that a high level of concurrency is achieved, the objective is to simultaneously achieve the reduction of fill-in and the parallel concurrency. While these achievements already significantly improve the factorization phase, modern computer architectures require one to compute as many operations as possible in the cache of the CPU. This in turn can be achieved when dense subblocks that show up during the factorization can be grouped together into dense submatrices which are handled by multithreaded and cache-optimized dense matrix kernels using level-3 BLAS and LAPACK AndBBDDDGHMOS95 .

This chapter will review some of the basic technologies together with the latest developments for sparse direct solution methods that have led to state-of-the-art $LU$ decomposition methods. The paper is organized as follows. In Section 2 we will start with maximum weighted matchings which is one of the key tools in combinatorial optimization to dramatically improve the diagonal dominance of the underlying system. Next, Section 3 will review multilevel nested dissection as a combinatorial method to reorder a system symmetrically such that fill-in and parallelization can are improved simultaneously, once pivoting can be more or less ignored. After that, we will review established graph-theoretical approaches in Section 0.4, in particular the elimination tree, from which most of the properties of the $LU$ factorization can be concluded. Among these properties is the prediction of dense submatrices in the factorization. In this way several subsequent columns of the factors $L$ and $U^{T}$ are collected in a single dense block. This is the basis for the use of dense matrix kernels using optimized level-3 BLAS as well to exploit fast computation using the cache hierarchy which is discussed in Section 0.5. Finally we will demonstrate in Section 0.6, for examples from circuit simulation, how the ongoing developments in sparse direct solution methods have accelerated state-of-the-art solution techniques. We assume that the reader is familiar with some elementary knowledge from graph theory, see; e.g., DufER86 ; GeoL81 and some simple computational algorithms based on graphs AhoHU83 .

2 Maximum weight matching

In modern sparse elimination methods the key to success is ability to work with efficient data structures and their underlying numerical templates. If we can increase the size of the diagonal entries as much as possible in advance, pivoting during Gaussian elimination can often be bypassed and we may work with static data structures and the numerical method will be significantly accelerated. A popular method to achieve this goal is the maximum weight matching method DufK99S ; olschowka:1996 which permutes (e.g.) the rows of a given nonsingular matrix $A\in\mathbb{R}^{n,n}$ by a permutation matrix $\Pi\in\mathbb{R}^{n,n}$ such that $\Pi^{T}A$ has a non-zero diagonal. Moreover, it maximizes the product of the absolute diagonal values and yields diagonal scaling matrices $D_{r},D_{c}\in\mathbb{R}^{n,n}$ such that $\tilde{A}=\Pi^{T}D_{r}AD_{c}$ satisfies $|\tilde{a}_{ij}|\leqslant 1$ and $|\tilde{a}_{ii}|=1$ for all $i,j=1,\dots,n$ . The original idea on which these nonsymmetric permutations and scalings are based is to find a maximum weighted matching of a bipartite graphs. Finding a maximum weighted matching is a well known assignment problem in operation research and combinatorial analysis.

Definition 1

A graph $G=(V,E)$ with vertices $V$ and edges $E\subset V^{2}$ is called bipartite if $V$ can be partitioned into two sets $V_{r}$ and $V_{c}$ , such that no edge $e=(v_{1},v_{2})\in E$ has both ends $v_{1},v_{2}$ in $V_{r}$ or both ends $v_{1},v_{2}$ in $V_{c}$ . In this case we denote $G$ by $G_{b}=(V_{r},V_{c},E)$ .

Definition 2

Given a matrix $A$ , then we can associate with it a canonical bipartite graph $G_{b}(A)=(V_{r},V_{c},E)$ by assigning the labels of $V_{r}=\{r_{1},\dots,r_{n}\}$ with the row indices of $A$ and $V_{c}=\{c_{1},\dots,c_{n}\}$ being labeled by the column indices. In this case $E$ is defined via $E=\{(r_{i},c_{j})|\;a_{ij}\not=0\}$ .

For the bipartite graph $G_{b}(A)$ we see immediately that If $a_{ij}\not=0$ , then we have that $r_{i}\in V_{r}$ from the row set is connected by an edge $(r_{i},c_{j})\in E$ to the column $c_{j}\in V_{c}$ , but neither rows are connected with each other nor do the columns have inter connections.

Definition 3

A matching $\mathcal{M}$ of a given graph $G=(V,E)$ is a subset of edges $e\in E$ such that no two of which share the same vertex.

If $\mathcal{M}$ is a matching of a bipartite graph $G_{b}(A)$ , then each edge $e=(r_{i},c_{j})\in\mathcal{M}$ corresponds to a row $i$ and a column $j$ and there exists no other edge $\hat{e}=(r_{k},c_{l})\in\mathcal{M}$ that has the same vertices, neither $r_{k}=r_{i}$ nor $c_{l}=c_{j}$ .

Definition 4

A matching $\mathcal{M}$ of $G=(V,E)$ is called maximal, if no other edge from $E$ can be added to $\mathcal{M}$ .

If, for an $n\times n$ matrix $A$ a matching $\mathcal{M}$ of $G_{b}(A)$ with maximum cardinality $n$ is found, then by definition the edges must be $(i_{1},1),\dots,(i_{n},n)$ with $i_{1},\dots,i_{n}$ being the numbers $1,\dots,n$ in a suitable order and therefore we obtain $a_{i_{1},1}\not=0$ , … $a_{i_{n},n}\not=0$ . In this case we have established that the matrix $A$ is at least structurally nonsingular and we can use a row permutation matrix $\Pi^{T}$ associated with row ordering $i_{1},\dots,i_{n}$ to place a nonzero entry on each diagonal location of $\Pi^{T}A$ .

Definition 5

A perfect matching is a maximal matching with cardinality $n$ .

It can be shown that for a structurally nonsingular matrix $A$ there always exists a perfect matching $\mathcal{M}$ .

Example 1

Perfect Matching In Figure 1, the set of edges $\mathcal{M}=\{(1,2),(2,4),(3,5),(4,1),(5,3),(6,6)\}$ represents a perfect maximum matching of the bipartite graph $G_{b}(A)$ .

The most efficient combinatorial methods for finding maximum matchings in bipartite graphs make use of an augmenting path. We will introduce some graph terminology for the construction of perfect matchings.

Definition 6

If an edge $e=(u,v)$ in a graph $G=(V,E)$ joins a vertices $u,v\in V$ , then we denote it as $uv$ . A path then consists of edges $u_{1}u_{2},u_{2}u_{3},u_{3}u_{4}\ldots,u_{k-1}u_{k}$ , where each $(u_{i},u_{i+1})\in E$ , $i=1,\dots,k-1$ .

If $G_{b}=(V_{r},V_{c},E)$ is a bipartite graph, then by definition of a path, any path is alternating between the vertices of $V_{r}$ and $V_{c}$ , e.g., pathes in $G_{b}$ could be such as $r_{1}c_{2},c_{2}r_{3},r_{3}c_{4},\dots$ .

Definition 7

Given a graph $G=(V,E)$ , a vertex is called free if it is not incident to any other edge in a matching $\mathcal{M}$ of $G$ . An alternating path relative to a matching $\mathcal{M}$ is a path $P=u_{1}u_{2},u_{2}u_{3},\ldots,u_{s-1}u_{s}$ where its edges are alternating between $E\setminus\mathcal{M}$ and $\mathcal{M}$ . An augmenting path relative to a matching $\mathcal{M}$ is an alternating path of odd length and both of it vertex endpoints are free.

Example 2

Augmenting Path Consider Figure 1. To better distinguish between row and column vertices we use $\framebox{$ 1 $},\framebox{$ 2 $},\dots,\framebox{$ 6 $}$ for the rows and $\bigcirc\mkern-12.0mu$$1\;$ , $\bigcirc\mkern-12.0mu$$2\;$ ,…, $\bigcirc\mkern-12.0mu$$6\;$ for the columns. A non-perfect but maximal matching is given by $M=\{(\framebox{$ 4 $},$$\bigcirc\mkern-12.0mu$$5\;$$),(\framebox{$ 1 $},$$\bigcirc\mkern-12.0mu$$1\;$$),(\framebox{$ 6 $},$$\bigcirc\mkern-12.0mu$$2\;$$),(\framebox{$ 2 $},$$\bigcirc\mkern-12.0mu$$6\;$$),(\framebox{$ 5 $},$$\bigcirc\mkern-12.0mu$$4\;$$)\}$ . We can easily see that an augmenting path alternating between rows and columns is given by $3$$\bigcirc\mkern-12.0mu$$5\;$ , $\bigcirc\mkern-12.0mu$$5\;$$4$ , $4$$\bigcirc\mkern-12.0mu$$1\;$ , $\bigcirc\mkern-12.0mu$$1\;$$1$ , $1$$\bigcirc\mkern-12.0mu$$2\;$ , $\bigcirc\mkern-12.0mu$$2\;$$6$ , $6$$\bigcirc\mkern-12.0mu$$6\;$ , $\bigcirc\mkern-12.0mu$$6\;$$2$ , $2$$\bigcirc\mkern-12.0mu$$4\;$ , $\bigcirc\mkern-12.0mu$$4\;$$5$ , $5$$\bigcirc\mkern-12.0mu$$3\;$ . Both endpoints $3$ and $\bigcirc\mkern-12.0mu$$3\;$ of this augmenting path are free.

In a bipartite graph $G_{b}=(V_{r},V_{c},E)$ one vertex endpoint of any augmenting path must be in $V_{r}$ whereas the other one must be in $V_{c}$ . The symmetric difference, $A\oplus B$ of two edge sets $A$ , $B$ is defined to be $(A\setminus B)\cup(B\setminus A)$ .

Using these definitions and notations, the following theorem Berge gives an constructive algorithm for finding perfect matchings in bipartite graphs.

Theorem 2.1

If $\mathcal{M}$ is non-maximum matching of a bipartite graph $G_{b}=(V_{r},V_{c},E)$ , then there exists an augmenting path $P$ relative to $\mathcal{M}$ such that $P=\tilde{\mathcal{M}}\oplus\mathcal{M}$ and $\tilde{\mathcal{M}}$ is a matching with cardinality $|\mathcal{M}|+1$ .

According to this theorem, a combinatorial method of finding perfect matching in a bipartite graph is to seek for augmenting paths.

The perfect matching as discussed so far only takes the nonzero structure of the matrix into account. For their use as static pivoting methods prior to the $LU$ decomposition one requires in addition to maximize the absolute value of the product of the diagonal entries. This is referred to as maximum weighted matching. In this case a permutation $\pi$ has to be found, which maximizes

[TABLE]

The maximization of this product is transferred into a minimization of a sum as follows. We define a matrix $C=(c_{ij})$ via

[TABLE]

where $a_{i}=\max_{j}|a_{ij}|$ is the maximum element in row $i$ of matrix $A$ . A permutation $\pi$ which minimizes the sum

[TABLE]

also maximizes the product (1). The minimization problem is known as linear-sum assignment problem or bipartite weighted matching problem in combinatorial optimization. The problem is solved by a sparse variant of the Hungarian method. The complexity is $\mathcal{O}(n\tau\log n)$ for sparse matrices with $\tau$ entries. For matrices, whose associated graph fulfill special requirements, this bound can be reduced further to $\mathcal{O}(n^{\alpha}(\tau+n\log n)$ with $\alpha<1$ . All graphs arising from finite-difference or finite element discretizations meet the conditions gupta:99 . As before, we finally get a perfect matching which in turn defines a nonsymmetric permutation.

When solving the assignment problem, two dual vectors $u=(u_{i})$ and $v=(v_{i})$ are computed which satisfy

[TABLE]

Using the exponential function these vectors can be used to scale the initial matrix. To do so define two diagonal matrices $D_{r}$ and $D_{c}$ through

[TABLE]

Using equations (2) and (3) and the definition of $C$ , it immediately follows that $\tilde{A}=\Pi^{T}D_{r}AD_{c}$ satisfies

[TABLE]

The permuted and scaled system $\tilde{A}$ has been observed to have significantly better numerical properties when being used for direct methods or for preconditioned iterative methods, cf. e.g. benzi:2000:phi ; DufK99S . Olschowka and Neumaier olschowka:1996 introduced these scalings and permutation for reducing pivoting in Gaussian elimination of full matrices. The first implementation for sparse matrix problems was introduced by Duff and Koster DufK99S . For symmetric matrices $|A|$ , these nonsymmetric matchings can be converted to a symmetric permutation $P$ and a symmetric scaling $D_{s}=(D_{r}D_{c})^{1/2}$ such that $P^{T}D_{s}AD_{s}P$ consists mostly of diagonal blocks of size $1\times 1$ and $2\times 2$ satisfying a similar condition as (6) and (7), where in practice it rarely happens that $1\times 1$ blocks are identical to [math] dupr:04a . Recently, successful parallel approaches to compute maximum weighted matchings have been proposed LanPM11 ; LanAM14 .

Example 3

Maximum Weight Matching To conclude this section we demonstrate the effectiveness of maximum weight matchings using a simple sample matrix “west0479” from the SuiteSparse Matrix Collection. The matrix can also directly be loaded in Matlab using load west0479. In Figure 2 we display the matrix before and after applying maximum weighted matchings. To illustrate the improved diagonal dominance we further compute $r_{i}=|a_{ii}|/\sum_{j=1}^{n}|a_{ij}|$ for each row of $A$ and $\tilde{A}=\Pi^{T}D_{r}AD_{s}$ , $i=1,\dots,n$ . $r_{i}$ can be read as relative diagonal dominance of row $i$ and yields a number between [math] and $1$ . Moreover, whenever $r_{i}>\frac{1}{2}$ , the row is strictly diagonal dominant, i.e., $|a_{ii}|>\sum_{j:j\not=i}|a_{ij}|$ . In Figure 2 we display for both matrices $r_{i}$ by sorting its values in increasing order and taking $\frac{1}{2}$ as reference line. We can see the dramatic impact of maximum weighted matchings in improving the diagonal dominance of the given matrix and thus paving its way to a static pivoting approach in incomplete or complete $LU$ decomposition methods.

3 Symbolic symmetric reordering techniques

When dealing with large sparse matrices a crucial factor that determines the computation time is the amount of fill tat is produced during the factorization of the underlying matrix. To reduce the complexity there exist many mainly symmetric reordering techniques that attempt to reduce the fill–in heuristically. Here we will demonstrate only one of these methods, the so–called nested dissection method. The main reason for selecting this method is that it can be easily used for parallel computations.

3.1 Multilevel nested dissection

Recursive multilevel nested dissection methods for direct decomposition methods were firstly introduced in the context of multiprocessing. If parallel direct methods are used to solve a sparse systems of equations, then a graph partitioning algorithm can be used to compute a fill reducing ordering that leads to a high degree of concurrency in the factorization phase.

Definition 8

For a matrix $A\in\mathbb{R}^{n,n}$ we define the associated (directed) graph $G_{d}(A)=(V,E)$ , where $V=\{1,\dots,n\}$ and the set of edges $E=\left\{(i,j)|\,a_{ij}\not=0\right\}$ . The (undirected) graph is given by $G_{d}(|A|+|A|^{T})$ and is denoted simply by $G(A)$ .

In graph terminology for a sparse matrix $A$ we simply have a directed edge $(i,j)$ for any nonzero entry $a_{ij}$ in $G_{d}(A)$ whereas the orientation of the edge is ignored in $G(A)$ .

The research on graph-partitioning methods in the mid-nineteens has resulted in high-quality software packages, e.g. Metis karypis:98 . These methods often compute orderings that on the one hand lead small fill–in for (incomplete) factorization methods while on the other hand they provide a high level of concurrency. We will briefly review the main idea of multilevel nested dissection in terms graph-partitioning.

Definition 9

Let $A\in\mathbb{R}^{n,n}$ and consider its graph $G(A)=(V,E)$ . A $k$ -way graph partitioning consists of partitionining $V$ into $k$ disjoint subsets $V_{1},V_{2},\ldots,V_{k}$ such that $V_{i}\cap V_{j}=\emptyset$ for $i\neq j$ $\cup_{i}V_{i}=V$ . The subset $E_{s}=E\cap\bigcup_{i\not=j}(V_{i}\times V_{j})$ is called edge separator.

Typically we want a $k$ -way partitioning to be balanced, i.e., each $V_{i}$ should satisfy $|V_{i}|\approx n/k$ . The edge separator $E_{s}$ refers to the edges that have to be taken away from the graph in order to have $k$ separate subgraphs associated with $V_{1},\dots,V_{k}$ and the number of elements of $E_{s}$ is usually referred to as edge-cut.

Definition 10

Given $A\in\mathbb{R}^{n,n}$ , a vertex separator $V_{s}$ of $G(A)=(V,E)$ is a set of vertices such that there exists a $k$ -way partitioning $V_{1},V_{2},\ldots,V_{k}$ of $V\setminus V_{s}$ having no edge $e\in V_{i}\times V_{j}$ for $i\neq j$ .

A useful vertex separator $V_{s}$ should not only separate $G(A)$ into $k$ independent subgraps associated with $V_{1},\dots,V_{k}$ , it is intended that the numbers of edges $\cup_{i=1}^{k}|\{e_{is}\in V_{i},s\in V_{s}\}|$ is also small.

Nested dissection recursively splits a graph $G(A)=(V,E)$ into almost equal parts by constructing a vertex separator $V_{s}$ until the desired number $k$ of partitionings are obtained. If $k$ is a power of $2$ , then a natural way of obtaining a vertex separator is to first obtain a $2$ -way partitioning of the graph, a so called graph bisection with its associated edge separator $E_{s}$ . After that a vertex separator $V_{s}$ is computed from $E_{s}$ , which gives a $2$ -way partitioning $V_{1},V_{2}$ of $V\setminus V_{s}$ . This process is then repeated separately for the subgraphs associated with $V_{1},V_{2}$ until eventually a $k=2^{l}$ -way partitioning is obtained. For the reordering of the underlying matrix $A$ , the vertices associated with $V_{1}$ are taken first followed by $V_{2}$ and $V_{s}$ . This reordering is repeated similarly during repeated bisection of each $V_{i}$ . In general, vertex separators of small size result in low fill-in.

Example 4

Vertex Separators To illustrate vertex separators, we consider the reordered matrix $\Pi^{T}A$ from Figure 1 after a matching is applied. In Figure 4 we display its graph $G(\Pi^{T}A)$ ignoring the orientation of the edges. A $2$ -way partitioning is obtained with $V_{1}=\{1,3\}$ , $V_{2}=\{5,6\}$ and a vertex separator $V_{s}=\{2,4\}$ . The associated reordering refers to taking the rows and the columns of $\Pi^{T}A$ in the order $1,3,5,6,2,4$ .

Since a naive approach to compute a recursive graph bisection is typically computationally expensive, combinatorial multilevel graph bisection has been used to accelerate the process. The basic structure is simple. The multilevel approach consists of three phases: at first there is a coarsening phase which compresses the given graph successively level by level by about half of its size. When the coarsest graph with about a few hundred vertices is reached, the second phase, namely the so–called bisection is applied. This is a high quality partitioning algorithm. After that, during the uncoarsening phase, the given bisection is successively refined as it is prolongated towards the original graph.

Coarsening Phase

The initial graph $G_{0}=(V_{0},E_{0})=G(A)$ of $A\in\mathbb{R}^{n,n}$ is transformed during the coarsening phase into a sequence of graphs $G_{1},G_{2},\ldots,G_{m}$ of decreasing size such that $|V_{0}|\gg|V_{1}|\gg|V_{2}|\gg\cdots\gg|V_{m}|$ . Given the graph $G_{i}=(V_{i},E_{i})$ , the next coarser graph $G_{i+1}$ is obtained from $G_{i}$ by collapsing adjacent vertices. This can be done e.g. using a maximal matching $\mathcal{M}_{i}$ of $G_{i}$ (cf. Definitions 3 and 4). Using $\mathcal{M}_{i}$ , the next coarser graph $G_{i+1}$ is constructed from $G_{i}$ collapsing the vertices being matched into multinodes, i.e., the elements of $\mathcal{M}_{i}$ together with the unmatched vertices of $G_{i}$ become the new vertices $V_{i+1}$ of $G_{i+1}$ . The new edges $E_{i+1}$ are the remaining edges from $E_{i}$ connected with the collapsed vertices. There are various differences in the construction of maximal matchings karypis:98 ; CheP08 . One of the most popular and efficient methods is heavy edge matching karypis:98 .

Partitioning Phase

At the coarsest level $m$ , a $2$ -way partitioning $V_{m,1}\dot{\cup}V_{m,2}=V_{m}$ of $G_{m}=(V_{m},E_{m})$ is computed, each of them containing about half of the vertices of $G_{m}$ . This specific partitioning of $G_{m}$ can be obtained by using various algorithms such as spectral bisection fiedler:75 or combinatorial methods based on Kernighan-Lin variants KerL70 ; FidM97 . It is demonstrated in karypis:98 that for the coarsest graph, combinatorial methods typically compute smaller edge-cut separators compared with spectral bisection methods. However, since the size of the coarsest graph $G_{m}$ is small (typically $|V_{m}|<100)$ , this step is negligible with respect to the total amount of computation time.

Uncoarsening Phase

Suppose that at the coarsest level $m$ , an edge separator $E_{m,s}$ of $G_{m}$ associated with the $2$ -way partitioning has been computed that has lead to a sufficient edge-cut of $G_{m}$ with $V_{m,1}$ , $V_{m,2}$ of almost equal size. Then $E_{m,s}$ is prolongated to $G_{m-1}$ by reversing the process of collapsing matched vertices. This leads to an initial edge separator $E_{m-1,s}$ for $G_{m-1}$ . But since $G_{m-1}$ is finer, $E_{m-1,s}$ is sub-optimal and one usually decreases the edge-cut of the partitioning by local refinement heuristics such as the Kernighan–Lin partitioning algorithm KerL70 or the Fiduccia–Mattheyses method FidM97 . Repeating this refinement procedure level–by-level we obtain a sequence of edge separators $E_{m,s},E_{m-1,s},\dots,E_{0,s}$ and eventually and edge separator $E_{s}=E_{0,s}$ of the initial graph $G(A)$ is obtained. If one is seeking for a vertex separator $V_{s}$ of $G(A)$ , then one usually computes $V_{s}$ from $E_{s}$ at the end.

There have been a number of methods that are used for graph partitioning, e.g. Metis karypis:98 , a parallel MPI version ParMetis KarSK99 , or a recent multithreaded approach MT-MetisLasK13 . Another example for a parallel partitioning algorithm is ScotchCheP08 .

Example 5

Multilevel Nested Dissection We will continue Example 3 using the matrix $\tilde{A}=\Pi^{T}D_{r}AD_{s}$ that has been rescaled and permuted using maximum weight matching. We illustrate in Figure 3.1 how multilevel nested dissection changes the pattern $\hat{A}=P^{T}\tilde{A}P$ , where $P$ refers to the permutation matrix associated with the partitioning of $G(\tilde{A})$ .

Application of multilevel nested dissection after the matrix is already rescaled and permuted using maximum weight matching.

0.3.2 Other reordering methods

One of the first methods to reorder the system was the reverse Cuthill–McKee (Rcm) methods cm:69 ; LiuS76 which attempts to reduce the bandwidth of a given matrix. Though this algorithm is still attractive for sequential methods and incomplete factorization methods, its use for direct solvers is considered as obsolete. An attractive alternative to nested dissection as reordering method for direct factorization methods is the minimum degree algorithm (Mmd) Ros72 ; GeoL89 and its recent variants, in particular the approximate minimum degree algorithm (Amd) AmeDD96 ; Dav06 with or without constraints. The main objective of the minimum degree algorithm is to simulate the Gaussian elimination process symbolically by investigating the update process $a_{ij}\to a_{ij}-a_{ik}a_{kk}^{-1}a_{kj}$ by means of graph theory, at least in the case of the undirected graph. The name-giving degree refers to the number of edges connected to a vertex and how the graph and therefore the degrees of its vertices change during the factorization process. Over the years this has lead to an evolution of the underlying minimum degree algorithm using the so-called external degree for selecting vertices as pivots and further techniques like incomplete degree update, element absorption and multiple elimination as well as data structures based on cliques. For an overview see GeoL89 . One of the most costly parts in the minimum degree algorithm is to update of the degrees. Instead of computing the exact external degree, in the approximate minimum degree algorithm AmeDD96 an approximate external degree is computed that significantly saves time while producing comparable fill in the $LU$ decomposition. We like to conclude this section mentioning that if nested dissection is computed to produce a vertex separator $V_{s}$ and a related $k$ -way partitioning $V_{1},\dots,V-k$ for the remaining vertices of $V\setminus V_{s}$ of $G(A)=(V,E)$ which allow for parallel computations, then the entries of each $V_{i}$ , $i,\dots,k$ could be taken in any order. Certainly, inside $V_{i}$ one could use nested dissection as well, which is the default choice in multilevel nested dissection methods. However, ass soon as the coarsest graph $G_{m}$ is small enough (typically about $100$ vertices), not only the separator is computed, but in addition the remaining entries of $G_{m}$ are reordered to lead to a fill-reducing ordering. In both cases, for $G_{m}$ as well as $V_{1},\dots,V_{k}$ one could alternatively use different reordering methods such as variants of the minimum degree algorithm. Indeed, for $G_{m}$ this is what the Metissoftware is doing. Furthermore, a reordering method such as the constrained approximate minimum degree algorithm is also suitable as local reordering for $V_{1},\dots,V_{k}$ as alternative to nested dissection, taking into account the edges connected with $V_{s}$ (also referred to as HALO structure), see e.g. PelRA00 .

0.4 Sparse $LU$ Decomposition

In this section we will assume that the given matrix $A\in\mathbb{R}^{n,n}$ is nonsingular and that it can be factorized as $A=LU$ , where $L$ is a lower triangular matrix with unit diagonal and $U$ is an upper triangular matrix. It is well–known GeoL81 , if $A=LU$ , where $L$ and $U^{\top}$ are lower triangular matrices, then in the generic case we will have $G_{d}(L+U)\supset G_{d}(A)$ , i.e., we will only get additional edges unless some entries cancel by “accident” during the elimination. In the sequel we will ignore cancellations. Throughout this section we will always assume that the diagonal entries of $A$ are nonzero as well. We also assume that $G_{d}(A)$ is connected. In the preceding sections we have argued that maximum weight matching often leads to a rescaled and reordered matrix such that static pivoting is likely to be enough, i.e., pivoting is restricted to some dense blocks inside the $LU$ factorization. Furthermore, reordering strategies such as multilevel nested dissection have further symmetrically permuted the system such that the fill–in that occurs during Gaussian elimination is acceptable and even parallel approaches could be drawn from this reordering. Thus assuming that $A$ does not need further reordering and a factorization $A=LU$ exists is a realistic scenario in what follows.

0.4.1 The Elimination Tree

The basis of determining the fill-in in the triangular factors $L$ and $U$ as by-product of the Gaussian elimination can be characterized as follows (see Gil94 in the references therein).

Theorem 0.4.1

Given $A=LU$ with the aforementioned assumptions, there exists an edge $(i,j)$ in $G_{d}(L+U)$ if and only if there exists a path

[TABLE]

in $G_{d}(A)$ such that $x_{1},\dots,x_{k}<\min(i,j)$ .

In other words, during Gaussian elimination we obtain a fill edge $(i,j)$ for every path from $i$ to $j$ through vertices less than $\min(i,j)$ .

Example 6

Fill–in We will use the matrix $\Pi^{T}A$ from Example 4 and sketch the fill–in obtained during Gaussian elimination in Figure 0.4.1.

Fill-in with respect to $L+U$ is denoted by $\times$ .

A problem in general is to predict the filled graph $G_{d}(L+U)$ and the fastest known method to compute it, is Gaussian elimination. The situation simplifies if the graph is undirected. In the sequel we ignore the orientation of the edges and simply consider the undirected graph $G(A)$ and $G(L+U)$ , respectively.

Definition 11

The undirected graph $G(L+U)$ that is derived from the undirected graph $G(A)$ by applying Theorem 0.4.1 is called the filled graph and it will be denoted by $G_{f}(A)$ .

Example 7

Fill-in with respect to the undirected graph When we consider the undirected graph $G(A)$ in Example 6, the pattern of $|\Pi^{T}A|+|\Pi^{T}A|^{T}$ and its filled graph $G_{f}(A)$ now equals $G(A)$ (cf. Figure 0.4.1).

Entries of $G(A)$ are denoted by $\bullet$ .

The key tool to predict the fill–in easily for the undirected graph is the elimination tree Liu90 . Recall that an undirected and connected graph is called a tree, if it does not contain any cycle. Furthermore, one vertex is identified as root. As usual we call a vertex $j$ parent of $i$ , if there exists an edge $(i,j)$ in the tree such that $j$ is closer to the root. In this case $i$ is called child of $j$ . The subtree rooted at vertex $j$ is denoted by $T(j)$ and the vertices of this subtree are called descendants of $j$ whereas $j$ is called their ancestor. Initially we will define the elimination tree algorithmically using the depth–first–search algorithm AhoHU83 . Later we will state a much simplified algorithm.

Definition 12

Given the filled graph $G_{f}(A)$ the elimination tree $T(A)$ is defined by the following algorithm.

Perform a depth–first–search in $G_{f}(A)$ starting from vertex $n$ .

When vertex $m$ is visited, choose from its unvisited neighbors $i_{1},\dots,i_{k}$ the index $j$ with the largest number $j=\max\{i_{1},\dots,i_{k}\}$ and continue the search with $j$ .

A leaf of the tree is reached, when all neighbors have already been visited.

We like to point out that the application of the depth–first–search to $G_{f}(A)$ starting at vertex $n$ behaves significantly different from other graphs. By Theorem 0.4.1 it follows that as soon as we visit a vertex $m$ , all its neighbors $j>m$ must have been visited prior to vertex $m$ . Thus the labels of the vertices are strictly decreasing until we reach a leaf node.

Example 8

Depth–first–search We illustrate the depth–first–search using the filled graph in Figure 5 and the pattern from Example 7. The extra fill edge is marked by the bold line. The ongoing depth–first search visits the vertices in the order $6\to 5\to 4$ . Since at vertex $4$ , all neighbors of $4$ are visited (and indeed have a larger number), the algorithm backtracks to $5$ and continues the search in the order $5\to 1$ . Again all neighbors of vertex $1$ are visited (and have larger number), thus the algorithm backtracks to $5$ and to $6$ and continues by $6\to 2$ . Then the algorithm terminates. Note that vertex $3$ is isolated and if the graph of $A$ is not connected one has to proceed for each connected component separately.

Remark 1

It follows immediately from the construction of $T(A)$ and Theorem 0.4.1 that additional edges of $G_{f}(A)$ which are not covered by the elimination tree can only show up between a vertex and some of its ancestors (referred to as “back–edges”). In contrast to that, “cross–edges” between unrelated vertices do not exist.

Remark 2

One immediate consequence of Remark 1 is that triangular factors can be computed independently starting from the leaves until the vertices meet a common parent, i.e., column $j$ of $L$ and $U^{T}$ only depend on those columns $s$ of $L$ and $U^{T}$ such that $s$ is a descendant of $j$ in the elimination tree $T(A)$ .

Example 9

Elimination tree We use the matrix “west0479” from Example 5, after maximum weight matching and multilevel nested dissection have been applied. We use Matlab’s etreeplot to display its elimination tree (see Figure 6). The elimination tree displays the high level of concurrency that is induced by nested dissection, since by Remark 2 the computations can be executed independently at each leaf node towards to the root until a common parent vertex is reached.

Further conclusions can be easily derived from the elimination tree, in particular Remark 2 in conjunction with Theorem 0.4.1.

Remark 3

Consider some $k\in\{1,\dots,n\}$ . Then there exists a (fill) edge $(j,k)$ with $j<k$ if and only if there exists a common descendant $i$ of $k,j$ in $T(A)$ such that $a_{ik}\not=0$ . This follows from the fact that once $a_{ik}\not=0$ , by Theorem 0.4.1 this induces (fill) edges $(j,k)$ in the filled graph $G_{f}(A)$ for all nodes $j$ between $i$ and $k$ in the elimination tree $T(A)$ , i.e., for all ancestors of $i$ that are also descendents of $k$ . This way, $i$ propagates fill–edges along the branch from $i$ to $k$ in $T(A)$ and the information $a_{ik}\not=0$ can be used as path compression to advance from $i$ towards to $k$ along the elimination tree.

Example 10

Path compression Consider the graph and the elimination tree from Figure 5. Since there exists the edge $(1,6)$ in $G(A)$ , therefore another (fill) edge $(5,6)$ must exist (here not a fill edge, but a regular edge). Similarly, the same conclusion can be drawn from the existence of the edge $(4,6)$ . Via $a_{16}\not=0$ we can advance from vertex $1$ to vertex $6$ bypassing vertex $5$ .

The elimination tree itself can be easily described by a vector $p$ of length $n$ such that for any $i<n$ , $p_{i}$ denotes the parent node while $p_{n}=0$ corresponds to the root. Consider some step $k$ with $a_{ik}\not=0$ , for some $i<k$ . By Remark 3, $i$ must be a descendent of $k$ and there could be further ancestors $j$ of $i$ which are also descendents of $k$ . Possibly not all ancestors of $i$ have been assigned a parent node so far. Thus we can replace $i$ by $j=p_{i}$ until we end up with $p_{j}=0$ or $p_{j}\geqslant k$ . This way we traverse $T(A)$ from $i$ towards to $k$ until we have found the child node $j$ of $k$ . If the parent of $j$ has not been assigned to $j$ yet, then $p_{j}=0$ and $k$ must be the parent of $j$ . If some $l<k$ were the parent of $j$ , then we would have assigned $l$ as parent of $j$ in an earlier step $l<k$ . In this case we set $p_{j}\leftarrow k$ . Otherwise, if $p_{j}\geqslant k$ , then we have already assigned $j$ ’s parent in an earlier step $l<k$ .

Example 11

Computation of parent nodes Consider the elimination tree $T(A)$ from Figure 5. Unless $k=5$ , no parents have been assigned, i.e. $p_{i}=0$ for all $i$ . Now for $k=5$ we have $a_{15}\not=0$ and using the fact that $p_{1}=0$ implies that we have to set $p_{1}\leftarrow 5$ . Next, $a_{45}\not=0$ and again $p_{4}=0$ requires to set $p_{4}\leftarrow 5$ . Finally, if $k=6$ , we have $a_{16}\not=0$ , we advance from $1$ to $p_{1}=5$ and since $p_{5}=0$ we set $p_{5}\leftarrow 6$ . Next, $a_{26}\not=0$ and $p_{2}=0$ imply that we have to set $p_{2}\leftarrow 6$ . $a_{46}\not=0$ will traverse from $4$ to $p_{4}=5$ to $p_{5}=6$ , which requires no further setting. Last but not least, $a_{56}\not=0$ requires no further action, since we already have $p_{5}=6$ . In total we have $p=[5,6,0,5,6,0]$ which perfectly reveals the parent properties of the elimination trees in Figure 5.

By Remark 3 (cf. Tar83 ; Dav06 ), we can also make use of path compression. Since our goal is to traverse the branch of the elimination tree from $i$ to $k$ as fast as possible, any ancestor $j=a_{i}$ of $i$ would be sufficient. With the same argument as before, an ancestor $a_{j}=0$ would refer to a vertex that does not have a parent yet. In this case we can again set $p_{j}\leftarrow k$ . Moreover, $k$ is always an ancestor of $a_{i}$ . The algorithm including path compression can be summarized as follows (see also Liu90 ; Dav06 ). {programcode}Computation of the elimination tree

1: $A\in\mathbb{R}^{n,n}$ such that $A$ has the same pattern as $|A|+|A|^{T}$ .

2:vector $p\in\mathbb{R}^{n}$ such that $p_{i}$ is the parent of $i$ , $i=1,\dots,n-1$ , except $p_{n}=0$ .

3:let $a\in\mathbb{R}^{n}$ be an auxiliary vector used for path compression.

4: $p\leftarrow 0,a\leftarrow 0$

5:for $k=2,\dots,n$ do

6: for all $i<k$ such that $a_{ik}\not=0$ do

7: while $i\not=0$ and $i<k$ do

8: $j\leftarrow a_{i}$

9: $a_{i}\leftarrow k$

10: if $j=0$ then

11: $p_{i}\leftarrow k$

12: end if

13: $i\leftarrow j$

14: end while

15: end for

16:end for

0.4.2 The supernodal approach

We have already seen that the elimination tree reveals information about concurrency. It is further useful to determine the fill– in $L$ and $U^{T}$ . This information can be computed from the elimination tree $T(A)$ together with $G(A)$ . The basis for determining the fill–in in each column is again Remark 3. Suppose we are interested in the nonzero entries of column $j$ of $L$ and $U^{T}$ . Then for all decedents of $j$ , i.e. the nodes of the subtree $T(j)$ rooted at vertex $j$ , a nonzero entry $a_{ik}\not=0$ also implies $l_{kj}\not=0$ . Thus, starting at any leaf $i$ , we obtain its fill by all $a_{ik}\not=0$ such that $k>i$ and when we move forward from $i$ to its parent $j$ , vertex $j$ will inherit the fill from node $i$ for all $k>j$ plus the nonzero entries given by $a_{jk}\not=0$ such that $k>j$ . When we reach a common parent node $k$ with multiple children, the same argument applies using the union of fill–in greater than $k$ from its children together with the nonzero entries $a_{kl}\not=0$ such that $l>k$ . We summarize this result in a very simple algorithm {programcode}Computation of fill–in

1: $A\in\mathbb{R}^{n,n}$ such that $A$ has the same pattern as $|A|+|A|^{T}$ .

2:sparse strict lower triangular pattern $P\in\mathbb{R}^{n,n}$ with same pattern as $L$ , $U^{T}$ .

3:compute parent array $p$ of the elimination tree $T(A)$

4:for $j=1,\dots,n$ do

5: supplement nonzeros of column $j$ of $P$ with all $i>j$ such that $a_{ij}\not=0$

6: $k=p_{j}$

7: if $k>0$ then

8: supplement nonzeros of column $k$ of $P$ with nonzeros of column $j$ of $P$ greater than $p$

9: end if

10:end for

Algorithm 0.4.2 only deals with the fill pattern. One additional aspect that allows to raise efficiency and to speed up the numerical factorization significantly is to detect dense submatrices in the factorization. Block structures allow to collect parts of the matrix in dense blocks and to treat them commonly using dense matrix kernels such as level–3 BLAS and LAPACK DodL85 ; DonDHH88 . Dense blocks can be read off from the elimination tree employing Algorithm 0.4.2.

Definition 13

Denote by $\mathcal{P}_{j}$ the nonzero indices of column $j$ of $P$ as computed by Algorithm 0.4.2. A sequence $k,k+1,\dots,k+s-1$ is called supernode of size $s$ if the columns of $\mathcal{P}_{j}=\mathcal{P}_{j+1}\cup\{j+1\}$ for all $j=k,\dots,k+s-2$ .

In simple words, Definition 13 states that for a supernode $s$ subsequent columns can be grouped together in one dense block with a triangular diagonal block and a dense subdiagonal block since they perfectly match the associated trapezoidal shape. We can thus easily supplement Algorithm 0.4.2 with a supernode detection. {programcode}Computation of fill–in and supernodes

1: $A\in\mathbb{R}^{n,n}$ such that $A$ has the same pattern as $|A|+|A|^{T}$ .

2:sparse strict lower triangular pattern $P\in\mathbb{R}^{n,n}$ with same pattern as $L$ , $U^{T}$ as well as column size $s\in\mathbb{R}^{m}$ of each supernode.

3:compute parent array $p$ of the elimination tree $T(A)$

4: $m\leftarrow 0$

5:for $j=1,\dots,n$ do

6: supplement nonzeros of column $j$ of $P$ with all $i>j$ such that $a_{ij}\not=0$

7: denote by $r$ the number of entries in column $j$ of $P$

8: if $j>1$ and $j=p_{j-1}$ and $s_{m}+r=l$ then

9: $s_{m}\leftarrow s_{m}+1$ $\triangleright$ continue current supernode

10: else

11: $m\leftarrow m+1$ , $s_{m}\leftarrow 1$ , $l\leftarrow r$ $\triangleright$ start new supernode

12: end if

13: $k=p_{j}$

14: if $k>0$ then

15: supplement nonzeros of column $k$ of $P$ with nonzeros of column $j$ of $P$ greater than $p$

16: end if

17:end for

Example 12

Supernode Computation To illustrate the use of supernodes, we consider the matrix pattern from Figure 0.4.1 and illustrate the underlying dense block structure in Figure 0.4.2. Supernodes are the columns $1$ , $2$ , $3$ as scalar columns as well as columns $4$ – $6$ as one single supernode.

Supernodes in the triangular factor.

Supernodes form the basis of several improvements, e.g., a supernode can be stored as one or two dense matrices. Beside the storage scheme as dense matrices, the nonzero row indices for these blocks need only be stored once. Next the use of dense submatrices allows the usage of dense matrix kernels using level–3 BLAS DodL85 ; DonDHH88 .

Example 13

Supernodes We use the matrix “west0479” from Example 5, after maximum weight matching and multilevel nested dissection have been applied. We use its undirected graph to compute the supernodal structure. Certainly, since the matrix is nonsymmetric, the block structure is only sub-optimal. We display the supernodal structure for the associated Cholesky factor, i.e., for the Cholesky factor of an symmetric positive definite matrix with same undirected graph as our matrix (see left part of Figure 7). Furthermore, we display the supernodal structure for the factors $L$ and $U$ computed from the nonsymmetric matrix without pivoting (see right part of Figure 7).

While the construction of supernodes is fairly easy in the symmetric case, its generalization to the general case is significantly harder, since one has to deal with pivoting in each step of Gaussian elimination. In this case one uses the column elimination tree GeoN85 .

0.5 Sparse Direct Solvers — Supernodal Data Structures

High-performance sparse solver libraries have been a very important part of scientific and engineering computing for years, and their importance continues to grow as microprocessor architectures become more complex and software libraries become better designed to integrate easily within applications. Despite the fact that there are various science and engineering applications, the underlying algorithms typically have remarkable similarities, especially those algorithms that are most challenging to implement well in parallel. It is not too strong a statement to say that these software libraries are essential to the broad success of scalable high-performance computing in computational sciences. In this section we demonstrate the benefit of supernodal data structures within the sparse solver package PARDISO schenk-2004 . We illustrate it by using the triangular solution process. The forward and backward substitution is performed column wise with respect to the columns of $L$ , starting with the first column, as depicted in Figure 8. The data dependencies here allow to store vectors $y$ , $z$ , $b$ , and $x$ in only one vector $r$ . When column $j$ is reached, $r_{j}$ contains the solution for $y_{j}$ . All other elements of $L$ in this column, i. e. $L_{ij}$ with $i=j+1,\ldots,N$ , are used to update the remaining entries in $r$ by

[TABLE]

The backward substitution with $L^{T}$ will take place row wise, since we use $L$ and perform the substitution column wise with respect to $L$ , as shown in the lower part of Figure 8. In contrast to the forward substitution the iteration over columns starts at the last column $N$ and proceeds to the first one. If column $j$ is reached, then $r_{j}$ , which contains the $j$ -component of the solution vector $x_{j}$ , is computed by subtracting the dot-product of the remaining elements in the column $L_{ij}$ and the corresponding elements of $r_{i}$ with $i=j+1,\ldots,N$ from it:

[TABLE]

After all columns have been processed $r$ contains the required solution of $x$ . It is important to note that line 5 represents in both substitutions an indexed DAXPY and indexed DDOT kernel operations that has to be computed during the streaming operations of the vector $r$ and the column $j$ of the numerical factor $L$ . As we are dealing with sparse matrices it makes no sense to store the lower triangular matrix $L$ as a dense matrix. Hence PARDISO uses its own data structure to store $L$ , as shown in Figure 9.

Adjacent columns exhibiting the same row sparsity structure form a panel, also known as supernode. A panel’s column count is called the panel size $n_{p}$ . The columns of a panel are stored consecutively in memory excluding the zero entries. Note that columns of panels are padded in the front with zeros so they get the same length as the first column inside their panel. The padding is of utmost performance for the PARDISO solver to use Level-3 BLAS and LAPACK functionalities 20.500.11850/144477 . Furthermore panels are stored consecutively in the l array. Row and column information is now stored in accompanying arrays. The xsuper array stores for each panel the index of its first column. Also note that here column indices are the running count of nonzero columns. Column indices are used as indices into xl array to lookup the start of the column in the l array which contains the numerical values of the factor $L$ . To determine the row index of a column’s element an additional array id is used, which holds for each panel the row indices. The start of a panel inside id is found via xid array. The first row index of panel $p$ is id[xid[p]]. For serial execution this information is enough. However, during parallel forward/backward substitution concurrent updates to the same entry of r must be avoided. The parts structure contains the start (and end) indices of the panels which can be updated independently as they do not touch the same entries of $r$ . Two parts, colored blue and orange, are shown in Figure 9. The last part in the bottom right corner of $L$ is special and is called the separator and is colored green. Parts which would touch entries of r in the range of the separator perform their updates into separate temporary arrays t. Before the separator is then serially updated, the results of the temporary arrays are gathered back into r. The backward substitution works the same, just reversed and only updates to different temporary arrays are not required. The complete forward substitution and backward substitution is listed in Algorithm 1 and 2.

0.6 Application – Circuit Simulation

In this section we demonstrate how these developments in sparse direct linear solvers have advanced integrated circuit simulations. Integrated circuits are composed of interconnected transistors. The interconnects are modeled primarily with resistors, capacitors, and inductors. The interconnects route signals through the circuit, and also deliver power. Circuit equations arise out of Kirchhoff’s current law, applied at each node, and are generally nonlinear differential-algebraic equations. In transient simulation of the circuit, the differential portion is handled by discretizing the time derivative of the node charge by an implicit integration formula. The associated set of nonlinear equations is handled through use of quasi-Newton methods or continuation methods, which change the nonlinear problem into a series of linear algebraic solutions. Each component in the circuit contributes only to a few equations. Hence the resulting systems of linear algebraic equations are extremely sparse, and most reliably solved by using direct sparse matrix techniques. Circuit simulation matrices are peculiar in the universe of matrices, having the following characteristics davis:klu :

•

they are unsymmetric, although often nearly structurally symmetric;

•

they have a few dense rows and columns (e.g., power and ground connections);

•

they are very sparse and the straightforward usage of BLAS routines (as in SuperLUsuperlu ) may be ineffective;

•

their LU factors remain sparse if well-ordered;

•

they can have high fill-in if ordered with typical strategies;

•

and being unstructured, the highly irregular memory access causes factorization to proceed only at a few percent of the peak flop-rate.

Circuit simulation matrices also vary from being positive definite to being extremely ill-conditioned, making pivoting for stability important also. As circuit size increases, and depending on how much of the interconnect is modeled, sparse matrix factorization is the dominant cost in the transient analysis. To overcome the complexity of matrix factorization a new class of simulators arose in the 1990s, called fast-SPICE Rewienski2011APO . These simulators partition the circuit into subcircuits and use a variety of techniques, including model order reduction and multirate integration, to overcome the matrix bottleneck. However, the resulting simulation methods generally incur unacceptable errors for analog and tightly coupled circuits. As accuracy demands increase, these techniques become much slower than traditional SPICE methods. Even so, since much of the research effort was directed at fast-SPICE simulators, it brought some relief from impossibly slow simulations when some accuracy trade-off was acceptable. Because these simulators partitioned the circuit, and did not require the simultaneous solution of the entire system of linear equations at any given time, they did not push the state-of-the-art in sparse matrix solvers. Starting in the mid-2000s, increasing demands on accuracy, due to advancing semiconductor technology, brought attention back to traditional SPICE techniques. This was aided by the proliferation of multicore CPUs. Parallel circuit simulation, an area of much research focus in the 1980s and 1990s, but not particularly in practice, received renewed interest as a way to speed up simulation without sacrificing accuracy. Along with improved implementations to avoid cache misses, rearchitecture of code for parallel computing, and better techniques for exploitation of circuit latency, improved sparse matrix solvers, most notably the release of KLU davis:klu , played a crucial role in expanding the utility of SPICE. Along with the ability to simulate ever larger circuits with full SPICE accuracy came the opportunity to further improve sparse matrix techniques. A sparse matrix package for transient simulation needs to have the following features:

•

must be parallel;

•

fast matrix reordering

•

incremental update of the $L$ and $U$ factors when only a few nonzeros change;

•

fast computation of the diagonal entries of the inverse matrix;

•

fast computation of Schur-complements for a submatrix;

•

allow for multiple $LU$ factors of the same structure to be stored;

•

use the best-in-class method across the spectrum of sparsity;

•

use iterative solvers with fast construction of sparse preconditioners;

•

run on various hardware platforms (e.g. GPU acceleration).

Some of these features must be available in a single package. Others, such as iterative solvers and construction of preconditioners, can be implemented with a combination of different packages. The PARDISO solver111The PARDISO solver is available from http://www.pardiso-project.org. stands out as a package that does most of these very well. Here we touch on a few of these features. When applied in the simulation of very large circuits, the difference between a “good” and a “bad” matrix ordering can be the difference between seconds and days. PARDISO offers AMD and nested-dissection methods for matrix ordering, as well as permitting user-defined ordering. Because the matrix re-ordering method which has been used most often in circuit simulation is due to Markowitz markowitz , and because modern sparse matrix packages do not include this ordering method, we briefly describe it here. The Markowitz method is quite well-adapted for circuit simulation. Some desirable aspects of the typical implementation of the Markowitz method, as opposed to the MD variants, are that it works for nonsymmetric matrices and combines pivot choice with numerical decomposition, such that a pivot choice is a numerically “good” pivot which generates in a local sense the least fill-in at that step of the decomposition. Choosing pivots based on the Markowitz score often produces very good results: near-minimal fill-in, unfortunately at the cost of an $O(n^{3})$ algorithm (for dense blocks). Even though the Markowitz algorithm has some good properties when applied to circuit matrices, the complexity of the algorithm has become quite burdensome. When SPICE nagel:spice2 was originally conceived, a hundred-node circuit was huge and the Markowitz algorithm was not a problem. Now we routinely see netlists with hundreds of thousands of nodes and postlayout netlists with millions of elements. As matrix order and element counts increase, Markowitz reordering time can become an obstruction. Even as improved implementations of the Markowitz method have extended its reach, AMD and nested-dissection have become the mainstay of simulation of large denser-than-usual matrices. Next we turn our attention to parallel performance. While KLU remains a benchmark for serial solvers, for parallel solvers, MKL-PARDISO is often cited as the benchmark Booth2017 ; Chen2013 . To give the reader a sense of the progress in parallel sparse matrix methods, in Figure 10 we compare KLU, PARDISO (Version 6.2) to MKL-PARDISO on up to 16 cores on an Intel Xeon E7-4880 architecture with 2.5 GHz processors.

Some of the matrices here can be obtained from the SuiteSparse Matrix Collection, and arise in transistor level full-chip and memory array simulations. It is clear that implementation of sparse matrix solvers has improved significantly over the years. Exploiting latency in all parts of the SPICE algorithm is very important in enabling accurate circuit simulation, especially as the circuit size increases. By latency, we mean that only a few entries in the matrix change from one

Newton iteration to the next, and from one timepoint to the next. As the matrix depends on the time-step, some simulators hold the time-steps constant as much as feasible to allow increased reuse of matrix factorizations. The nonzero entries of a matrix change only when the transistors and other nonlinear devices change their operation point. In most circuits, very few devices change state from one iteration to the next and from one time-step to the next. Nonzeros contributed by entirely linear components do not change value during the simulation. This makes incremental LU factorization a very useful feature of any matrix solver used in circuit simulation. As of April 2019 the version PARDISO 6.2 has a very efficient exploitation of incremental LU factorization, both serial and parallel. In Figure 11 we show that PARDISO scales linearly with number of updated columns, and also scales well with number of cores. Here, the series of matrices were obtained from a full simulation of a post-layout circuit that includes all interconnects, power- and ground-networks). The factorization time is plotted against the number of columns that changed compared to the previous factorization. The scatter plot shows the number of rank- $k$ update and the corresponding factorization time in milliseconds. The regression analysis clearly demonstrates a linear trend both for the single and the multiple core versions. The dashed line shows the time for the full factorization. Another recent useful feature in PARDISO is parallel selective inverse matrix computation as demonstrated in Table 1. In circuit simulation, the diagonal of the inverse matrix is the driving point impedance. It is often required to flag nodes in the circuit with very high driving point impedance. Such nodes would indicate failed interfaces between different subcircuits, leading to undefined state and high current leakage and power dissipation. A naive approach to this is to solve for the driving point impedance, the diagonal of the inverse matrix, by $N$ triangular solves. This is sometimes unacceptably expensive even with exploiting the sparsity of the right hand side, and minimizing the number of entries needed in the diagonal of the inverse. To bypass this complexity, heuristics to compute the impedance of connected components are used. But this is error prone with many false positives and also false negatives. In the circuit Freescale, PARDISO, e.g., finished the required impedance calculations in 11.9 seconds compared to the traditional computation that consumed 162.9 hours.

The productivity gap in simulation continues to grow, and challenges remain. Signoff simulations demand 10X speedup in sparse matrix factorization. Simply using more cores does not help unless the matrices are very large and complex. For a majority of simulations, scaling beyond 8 cores is difficult. As a result, some of these simulations can take a few months to complete, making them essentially impossible. Some of the problems in parallelizing sparse matrix operations for circuit simulation are fundamental. Others may be related to implementation. Research on sparse matrix factorization for circuit simulation continues to draw attention, especially in the area of acceleration with Intel’s many integrated core (MIC) architecture Booth2017 and GPUs Chen2015 ; Nakhla2018 . Other techniques for acceleration include improved preconditioners for iterative solvers Feng2015 . We are presently addressing the need for runtime selection of optimal strategies for factorization, and also GPU acceleration. Given that circuits present a wide spectrum of matrices, no matter how we categorize them, it is possible to obtain a solver that is 2–10X better on a given problem. Improvements in parallel sparse matrix factorization targeted at circuit simulation is more necessary today than ever and will continue to drive applicability of traditional SPICE simulation methods. Availability of sparse matrix packages such as PARDISO that completely satisfy the needs of various circuit simulation methods is necessary for continued performance gains.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Aho, J. Hopcroft, and J. Ullman. Data structures and algorithms . Addison-Wesley, 1983.
2[2] P. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl. , 17(4):886–905, 1996.
3[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. Mc Kenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide, Second Edition . SIAM Publications, 1995.
4[4] M. Benzi, J. Haws, and M. Tuma. Preconditioning highly indefinite and nonsymmetric matrices. SIAM J. Sci. Comput. , 22(4):1333–1353, 2000.
5[5] C. Berge. Two theorems in graph theory. In Proceedings of National Academy of Science , pages 842–844, USA, 1957.
6[6] J. D. Booth, N. D. Ellingwoodb, and S. R. Heidi K. Thornquist. Basker: Parallel sparse lu factorization utilizing hierarchical parallelism and data layouts. Parallel Computing , 68:17–31, 2017.
7[7] X. Chen, L. Ren, Y. Wang, and H. Yang. Gpu-accelerated sparse lu factorization for circuit simulation with performance modeling. IEEE Tr Ansactions on Parallel and Distributed Systems , 26:786–795, 2015.
8[8] X. Chen, Y. Wang, and H. Yang. Nicslu: An adaptive sparse matrix solver for parallel circuit simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 32:261–274, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

State-of-The-Art Sparse Direct Solvers

Abstract

1 Introduction

2 Maximum weight matching

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Example 1

Definition 6

Definition 7

Example 2

Theorem 2.1

Example 3

3 Symbolic symmetric reordering techniques

3.1 Multilevel nested dissection

Definition 8

Definition 9

Definition 10

Example 4

Coarsening Phase

Partitioning Phase

Uncoarsening Phase

Example 5

0.3.2 Other reordering methods

0.4 Sparse LULULU Decomposition

0.4.1 The Elimination Tree

Theorem 0.4.1

Example 6

Definition 11

Example 7

Definition 12

Example 8

Remark 1

Remark 2

Example 9

Remark 3

Example 10

Example 11

0.4.2 The supernodal approach

Definition 13

Example 12

Example 13

0.5 Sparse Direct Solvers — Supernodal Data Structures

0.6 Application – Circuit Simulation

0.4 Sparse $LU$ Decomposition