Algorithms and data structures for matrix-free finite element operators   with MPI-parallel sparse multi-vectors

Denis Davydov; Martin Kronbichler

arXiv:1907.01005·cs.MS·July 3, 2019

Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors

Denis Davydov, Martin Kronbichler

PDF

TL;DR

This paper presents an efficient, matrix-free, parallel implementation of finite element operators with sparse multivectors in deal.II, enabling scalable quantum mechanical simulations with improved performance.

Contribution

It introduces novel algorithms and data structures for matrix-free sparse multivector operations within deal.II, optimizing performance and scalability for quantum mechanical finite element problems.

Findings

01

Achieves 157 GFlop/s performance on Intel Cascade Lake.

02

Demonstrates strong and weak scaling on benchmark problems.

03

Provides efficient algorithms for sparse multivector operations.

Abstract

Traditional solution approaches for problems in quantum mechanics scale as $O (M^{3})$ , where $M$ is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling $O (M)$ . One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the…

Tables3

Table 1. Table 1 . Statistics for the smallest example with 160 160 160 atoms depicted in Figure 6 run on 10 MPI processes.

	quadratic	quartic
DoFs per cell	27	125
number of cells	102400	12800
DoFs (=rows)	846369	846369
columns	320	320
number of row blocks	12800	1600
number of column blocks	38	38
row blocks size	$66.12$ $\pm$ $18.13$	$528.98$ $\pm$ $71.56$
column blocks size	$8.42$ $\pm$ $1.15$	$8.42$ $\pm$ $1.06$
locally owned row blocks in $\bar{𝑯}$	$3.8$ $\pm$ $1.32$	$3.8$ $\pm$ $1.32$
non-empty column blocks for FE cell op.	$97429.8$ $\pm$ $16796$	$14877.4$ $\pm$ $2685.25$

Table 2. Table 2 . Comparison of the BCSR-based implementation to the dense column vectors compiled with Intel compiler.

	quadratic		quartic
	BCSR	Full	BCSR	Full
FE operator [s]	$0.58$	$2.18$	$0.44$	$1.36$
mmult [s]	$0.88$	$11.26$	$0.94$	$12.34$
Tmmult [s]	$0.2$	$16.1$	$0.45$	$16.02$
Memory [Mb]	$607.79$	$2292.35$	$717.72$	$2292.35$

Table 3. Table 3 . Statistics for the example with 320 320 320 atoms run on 20 MPI processes.

	quadratic	quartic
DoFs per cell	27	125
number of cells	204800	25600
DoFs/rows	1686177	1686177
columns	640	640
number of row blocks	25600	3200
number of column blocks	69	71
row blocks size	$65.87$ $\pm$ $18.21$	$526.93$ $\pm$ $71.24$
column blocks size	$9.28$ $\pm$ $2.27$	$9.01$ $\pm$ $2.05$
locally owned row blocks in $\bar{𝑯}$	$3.45$ $\pm$ $1.23$	$3.55$ $\pm$ $1.32$
non-empty column blocks for FE cell op.	$1.01 \cdot 10^{5}$ $\pm$ $21082.6$	$15728.4$ $\pm$ $3551.67$

Equations27

E (Φ) = tr ([2 \overline{I} - \overline{M} Φ^{T} M Φ] \overline{H} Φ^{T} H Φ) .

E (Φ) = tr ([2 \overline{I} - \overline{M} Φ^{T} M Φ] \overline{H} Φ^{T} H Φ) .

G (Φ) = - 2 M Φ \overline{H} + 2 H Φ [2 \overline{I} - \overline{M}] .

G (Φ) = - 2 M Φ \overline{H} + 2 H Φ [2 \overline{I} - \overline{M}] .

\displaystyle\frac{\mathrm{d}E(\boldsymbol{\Phi}+\epsilon\boldsymbol{D})}{\mathrm{d}\epsilon}\Bigr{\rvert}_{\epsilon=0}={\rm{tr}}(\boldsymbol{D}^{T}\boldsymbol{G})=\sum_{\alpha}\boldsymbol{D}_{\alpha}\cdot\boldsymbol{G}_{\alpha}\,.

\displaystyle\frac{\mathrm{d}E(\boldsymbol{\Phi}+\epsilon\boldsymbol{D})}{\mathrm{d}\epsilon}\Bigr{\rvert}_{\epsilon=0}={\rm{tr}}(\boldsymbol{D}^{T}\boldsymbol{G})=\sum_{\alpha}\boldsymbol{D}_{\alpha}\cdot\boldsymbol{G}_{\alpha}\,.

Φ_{M} = M Φ,

Φ_{M} = M Φ,

Φ_{H} = H Φ,

\overline{M}

\overline{M}

\overline{H}

G

G

G

g = tr (D^{T} G) = α \sum D_{α} \cdot G_{α},

g = tr (D^{T} G) = α \sum D_{α} \cdot G_{α},

ϕ_{α}^{h} (x)

ϕ_{α}^{h} (x)

M_{ij} H_{ij} = K \sum \int_{Ω_{K}^{h}} N_{i} (x) N_{j} (x) d x, = K \sum \int_{Ω_{K}^{h}} [\frac{1}{2} \nabla N_{i} (x) \cdot \nabla N_{j} (x) + N_{i} (x) V (x) N_{j} (x)] d x,

M_{ij} H_{ij} = K \sum \int_{Ω_{K}^{h}} N_{i} (x) N_{j} (x) d x, = K \sum \int_{Ω_{K}^{h}} [\frac{1}{2} \nabla N_{i} (x) \cdot \nabla N_{j} (x) + N_{i} (x) V (x) N_{j} (x)] d x,

M = K \sum P_{K}^{T} M_{K} P_{K}, H = K \sum P_{K}^{T} H_{K} P_{K} .

M = K \sum P_{K}^{T} M_{K} P_{K}, H = K \sum P_{K}^{T} H_{K} P_{K} .

\exists M_{s} \in N \forall x \in Ω^{h} card {α ∣ x \in Ω_{α}^{h}} \leq M_{s} .

\exists M_{s} \in N \forall x \in Ω^{h} card {α ∣ x \in Ω_{α}^{h}} \leq M_{s} .

M_{s} ≪ M,

M_{s} ≪ M,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors

Denis Davydov

0000-0002-3779-0101

Chair of Applied Mechanics, Friedrich-Alexander-Universität Erlangen-NürnbergEgerlandstr. 5Erlangen91058Germany

[email protected]

and

Martin Kronbichler

0000-0001-8406-835X

Institute for Computational Mechanics, Technical University of MunichBoltzmannstr. 15Garching85748Germany

[email protected]

Abstract.

Traditional solution approaches for problems in quantum mechanics scale as $\mathcal{O}(M^{3})$ , where $M$ is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling $\mathcal{O}(M)$ . One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support.

In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of $157$ GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases.

finite element method, matrix-free method, density functional theory

††journal: TOPC††journalvolume: X††journalnumber: 2††article: 3††journalyear: 2019††publicationmonth: 6††copyright: acmcopyright††doi: 0000001.0000001††ccs: Mathematics of computing Partial differential equations††ccs: Mathematics of computing Mathematical software performance

1. Introduction

In recent years there has been a growing interest in using finite element (FE) discretizations for problems in quantum mechanics (Pask et al., 2001; Tsuchida and Tsukada, 1996, 1998; Fattebert et al., 2007; Motamarri et al., 2012; Pask and Sukumar, 2017; Ram-Mohan, 2002; Davydov et al., 2017, 2018; Bao et al., 2012; Fang et al., 2012; Young et al., 2013; Zhang et al., 2008; Bylaska et al., 2009; Schauer and Linder, 2015), including an open-source production-ready FE-based implementation (Motamarri et al., 2019) of the Density Functional Theory (DFT) (Hohenberg and Kohn, 1964; Kohn and Sham, 1965). An FE basis has the following advantages: (i) it is a locally adaptive basis with variational convergence; (ii) it has well-developed rigorous error estimates that can be used to efficiently drive locally adaptive refinement; (iii) a hierarchy of nested functional spaces can be used to formulate geometric multigrid (GMG) preconditioners and solvers (Davydov et al., 2018; Beck, 2009), that are important for the computationally efficient solution of source problems and can also improve convergence of direct minimization methods (Davydov et al., 2018; Fattebert and Bernholc, 2000); (iv) an MPI-parallel implementation is achieved by domain decomposition of the mesh and does not require global communication; (v) the FE formulation can be equipped with both periodic and non-periodic boundary conditions; (vi) high order polynomial bases can be efficiently employed using matrix-free sum factorization approaches (Kronbichler and Kormann, 2012, 2019); (vii) pseudo-potentials and all electron calculations can be treated within the same framework. Some of those advantages were employed to develop an open-source DFT-FE code (Motamarri et al., 2019). The code was shown to outperform several commonly used plane-wave codes for systems with more than a few thousands of electrons.

Being a real-space method, an FE basis is also suitable for orbital minimization (OM) approaches (Hirose and Ono, 2001; Hirose et al., 2005; Mauri and Galli, 1994; Ordejón et al., 1995; Tsuchida and Tsukada, 1998; King-Smith and Vanderbilt, 1994; Peng et al., 2013; Burger and Yang, 2008; Fattebert and Bernholc, 2000; Fattebert and Gygi, 2004; Galli and Parrinello, 1992; Fattebert et al., 2007; Bowler et al., 2000; Goringe et al., 1997; Davydov et al., 2018) which can be combined with localization in real space to achieve linear scaling with respect to the number of unknown vectors $M$ in electronic structure calculations. In this case the solution is sought in terms of non-orthogonal orbitals with local support. Note that traditional eigensolver-based solution strategies have an $\mathcal{O}(M^{3})$ computational complexity and an $\mathcal{O}(M^{2})$ memory requirement.

Although finite elements have been used with localized orbitals (Tsuchida and Tsukada, 1998), to the best of our knowledge the implementation details of the underlying linear algebra have not been discussed. In the context of finite differences (FD), the authors of (Fattebert and Gygi, 2006) address this problem by “packing together several nonoverlapping localized orbitals into a single global array that represents a grid-based function over the whole discretization grid” using a graph-coloring algorithm. However implementational details, performance and scaling studies were not presented.

In this contribution we propose algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multivectors within the deal.II finite element library (Alzetta et al., 2018). Implementational details are thoroughly described for three key operations: (i) application of a matrix-free operator to a sparse multivector; (ii) multivector inner products; (iii) post-multiplication of a sparse multivector with a sparse square matrix. Node level performance is analyzed using the roofline performance model. Performance of the matrix-free operator is compared to the element stiffness-matrix based operators, adopted in (Motamarri et al., 2019). Strong and weak scaling are studied on a typical example in $\mathbb{R}^{3}$ .

2. Required operations

Within the OM solution approach of DFT, the typical problem is to find the minimum of the functional

[TABLE]

Here $\boldsymbol{\Phi}=\{\boldsymbol{\phi}_{1},\boldsymbol{\phi}_{2},\dots,\boldsymbol{\phi}_{M}\}\in\mathbb{R}^{N\times M}$ denote a set of $M$ vectors (a “multivector”) where each vector is of size $N$ . $\boldsymbol{M}\in\mathbb{R}^{N\times N}$ and $\boldsymbol{H}\in\mathbb{R}^{N\times N}$ are symmetric matrices representing the mass and Hamiltonian operators in the FE basis. For the current study we limit ourselves to a local potential. In this case the Hamiltonian operator has the form $-\frac{1}{2}\Delta+V(\boldsymbol{x})$ , where $V(\boldsymbol{x})$ is a given scalar-valued potential field. Given that FE shape functions have local support, both $\boldsymbol{M}$ and $\boldsymbol{H}$ are sparse. $\overline{\boldsymbol{H}}\in\mathbb{R}^{M\times M}$ and $\overline{\boldsymbol{M}}\in\mathbb{R}^{M\times M}$ are projections of mass and Hamiltonian operators onto the subspace spanned by $\boldsymbol{\Phi}$ . The number of columns $M$ represents the number of unknown vectors (on the order of the total number of electrons in the systems), whereas the number of rows $N$ is related to the spatial discretization using the FE basis. Within the FE context applied to such problems, the multivector $\boldsymbol{\Phi}$ is tall and skinny, i.e., $N\gg M$ . Here and below we denote by $\overline{\{\bullet\}}$ comparatively small matrices that do not contain $N$ as either of their dimensions.

The gradient of (1) reads

[TABLE]

Additionally the directional derivative of the functional (1) is required for line searches and is given by the sum of inner products for each column $\alpha$ ,

[TABLE]

In order to perform direct minimization (i.e., by adopting the steepest descent method or BFGS (Davydov et al., 2018)), the following operations are required:

•

sparse-matrix multivector multiplications:

[TABLE]

•

multivector inner products:

[TABLE]

where $\overline{\boldsymbol{M}}$ and $\overline{\boldsymbol{H}}$ are symmetric Hermitian (thanks to the properties of $\boldsymbol{M}$ and $\boldsymbol{H}$ ).

•

right-multiplication with a square matrix:

[TABLE]

where $\boldsymbol{G}\in\mathbb{R}^{N\times M}$ .

•

column-wise inner product:

[TABLE]

where $\boldsymbol{D}\in\mathbb{R}^{N\times M}$ is the search direction.

Our goal is to implement these operations for the case when the minimum of (1) is sought in terms of an a-priori chosen111An alternative approach is to augment (1) to favor sparsity (e.g. using the $l^{1}$ -regularization (Lu and Thicke, 2017)) sparsity of an unknown multivector $\boldsymbol{\Phi}$ . Note that in this case $\overline{\boldsymbol{H}}$ and $\overline{\boldsymbol{M}}$ are also sparse. We are interested in the case when the number of unknown multivectors $M$ is on the order of a few thousands, whereas the number of basis functions $N$ in the FE basis is several orders of magnitude larger. Consequently, the operations we focus on are those that contain the largest dimension $N$ , namely application of mass and Hamiltonian operators, multivector inner products and post-multiplication.

2.1. Available parallel sparse linear algebra packages

Given the sparse structure of $\boldsymbol{\Phi}$ , the operations listed above are essentially sparse matrix-matrix multiplications (SpMM). Below we give a brief overview of existing open source libraries focusing on the required operations and explain why they are not well-suited for our application.

GHOST (Kreutzer et al., 2017) is a relatively new high-performance sparse linear algebra package aimed at heterogeneous systems. Sparse matrices are stored in the SELL-C- $\sigma$ format that allows to employ SIMD vectorization over a chunk of rows in a matrix and achieve very competitive implementation of sparse matrix vector (SpMV) products (Kreutzer et al., 2014). The storage format, however, is not well suited for random access to rows of the matrix, which is the case of matrix-free FE operators applied to sparse multivectors (that we shall discuss in the next sections). Also, at the time of writing, GHOST does not provide any kernels for SpMM multiplication, the ghost_tsmttsm kernel only supports dense tall and skinny matrices.

A commonly used linear algebra package within the FE community is TPetra222We do not consider EPetra as all ongoing efforts on linear algebra packages within Trilinos are concentrated on TPetra. (Nusbaum, 2011) from the Trilinos (Heroux et al., 2005) suite. It provides sparse matrices in either compressed row storage (CRS) or compressed column storage (CCS). However, sparse matrix-matrix multiplication kernels provided by Tpetra::MatrixMatrix::Multiply are not flexible enough for our needs. More precisely, when SpMM is to be performed multiple times for the same structure of matrices, the sparsity of the destination matrix should be consistent to the product. For the case of non-local potentials, this will not hold for (8, 9), where each of those products will result in different sparsity patterns and the sparsity pattern of $\boldsymbol{G}$ is their union. SpMM in TPetra also does not take into account the symmetry of (6, 7), originating from the symmetry of mass and Hamiltonian operators. Additionally, we mention that $\boldsymbol{A}^{T}\boldsymbol{B}$ creates an actual transpose of the matrix prior to evaluation of the product (Nusbaum, 2011). Finally, within the matrix-free context of FE operators source (multi)vectors are required to keep information not only about locally owned rows (assuming 1D row-wise partitioning of $\boldsymbol{\Phi}$ ), but also know values for a relatively small number of rows owned by other MPI processes. Within the TPetra implementation model that would require two instances of $\boldsymbol{\Phi}$ , which would effectively double the memory footprint.

Another commonly used alternative is the PETSc (Balay et al., 2015) package. Similar limitations regarding the sparsity of the destination matrix, evaluation of transpose and symmetry also apply to PETSc’s MatTransposeMatMult implementation of SpMM multiplication kernel. Additionally, at the time of writing, PETSc does not support shared memory parallelization.

Finally, the sparsity structure of $\boldsymbol{\Phi}$ is relatively dense compared to the sparsities originating from FEM or FD discretizations of local operators in PDEs, which are one of the main application scenarios for the packages discussed above.

3. Implementation details

In this section we detail the main algorithmic building blocks for high-performance DFT calculations with finite elements using sparse vectors.

3.1. Finite element method and MPI parallelization

We start by summarizing the MPI parallelization of the finite element method with standard dense vectors and related concepts needed for matrix-free operator evaluation (Kronbichler and Kormann, 2012). First we introduce a FE triangulation $\mathcal{T}^{h}$ of the domain $\Omega$ and an associated FE space of continuous piecewise shape functions of a fixed polynomial degree $p$ in $d$ spatial dimension $\phi_{\alpha}^{h}\in V^{h}\subset H^{1}_{0}(\Omega)$ . The unknown orbital fields (a multivector) are given in a vector space spanned by standard FE basis functions $\{N_{i}(\boldsymbol{x})\}$ (e.g. polynomials with local support):

[TABLE]

where the superscript $h$ indicates that this representation is related to the FE mesh with size function $h(\boldsymbol{x})$ . In this work, we consider shape functions that are Lagrange polynomials on the elements and continuous over element boundaries. The interpolant $\phi_{\alpha}^{h}$ can be described in terms of its values $\Phi_{i\alpha}$ in the points $\boldsymbol{x}_{i}$ via the interpolation property $N_{j}(\boldsymbol{x}_{i})=\delta_{ij}$ . These interpolation points are called “nodes” in the following. It is common within the FE community to refer to node values as degrees of freedom (DoFs) since the discretization is typically applied to source problems, in which case the number of unknowns is the same as the number of basis functions. Below we follow the same notation and effectively denote by DoFs the rows in the unknown multivector $\boldsymbol{\Phi}$ .

The entries of the mass and Hamiltonian matrices are given by

[TABLE]

where $\Omega^{h}_{K}$ is an element $K$ of the triangulation of the domain $\Omega^{h}=\bigcup_{K}\Omega^{h}_{K}$ . In this work, we consider hexahedral elements.

As an example consider a setup with quadratic elements on a mesh partitioned into 2 MPI processes, see Figure 1. Elements are partitioned one-to-one into a given number of MPI processes (e.g., using graph coloring or space-filling curves). Based on the mesh partitioning, index sets of locally “owned” DoFs is constructed. Note that the relationship between the “owned” DoFs and MPI processes is a one-to-one map. In Figure 1(a) and 1(b) such DoFs are marked with crosses. Clearly DoFs that correspond to nodes at the interface between two MPI domains can be assigned to either process. In deal.II the process with a lower rank will claim the ownership. Figure 1(c) depicts two shape functions for the given triangulation and a quadratic Lagrange basis. Given the local support of the shape functions $N_{i}(\boldsymbol{x})$ , the matrices $M_{ij}$ and $H_{ij}$ are sparse, with a sparsity pattern implied by the element connectivity and the adopted basis. A common realization of FE algorithms is to numerically evaluate the integrals in (12) and store non-zero elements in a sparse matrix organized in e.g. the compressed sparse row (CSR) format. Sparse matrix-vector products can then be evaluated in parallel based on a one-to-one partitioning of DoFs and matrix rows.

3.2. Matrix-free approach

As an alternative to the global assembled view of FE matrices with global DoF numbers $i$ and $j$ , one can take an element-based view of the operators,

[TABLE]

Here $\boldsymbol{H}_{K}\in\mathbb{R}^{[p+1]^{d}\times[p+1]^{d}}$ and $\boldsymbol{M}_{K}\in\mathbb{R}^{[p+1]^{d}\times[p+1]^{d}}$ are the element mass and Hamiltonian matrices, which represent integrals of the basis functions inserted into the bilinear forms integrating over each element $K$ , and $p$ is the polynomial degree of the shape functions. $\boldsymbol{P}_{K}$ is a sparse matrix which extracts local values on element $K$ , whereas $\boldsymbol{P}_{K}^{T}$ is its inverse (i.e. $\boldsymbol{P}_{K}^{T}\boldsymbol{P}=\boldsymbol{I}$ ) that maps DoFs on a cell $K$ to global DoFs, typically implemented by an index array and indirect addressing. Here and below we assume absence of constraints (e.g., due to hanging nodes). This case is not persued here, but it can be handled straight-forwardly by adopting algorithms and data structures developed for standard SpMV multiplication in (Kronbichler and Kormann, 2012).

This view of the problem allows to evaluate the matrix-vector product, i.e., the discretized differential operator in terms of the finite element shape functions, in a matrix-free fashion (Kronbichler and Kormann, 2012, 2019). Entries of the sparse matrices are neither explicitly calculated nor stored, and only the operator action on a (multi-)vector is requested in (iterative) solution algorithm. The local multiplications by $\boldsymbol{M}_{K}$ and $\boldsymbol{H}_{K}$ on local (multi-)vector are computed by a fast integration using sum factorization. Sum factorization utilizes the tensor product structure of shape functions and quadrature points on hexahedra and expresses interpolations and derivatives in reference coordinate by a series of one-dimensional operations. The theoretical complexity of operator evaluation in terms of the polynomial degree $p$ is $\mathcal{O}(p)$ per DoF. The runtime often behaves as $\mathcal{O}(1)$ because memory bandwidth is the limiting factor rather than arithmetic. Matrix-based CSR approaches lose the tensor-product structure as soon as non-affine meshes are considered or operators with non-constant coefficients are involved, and give rise to arithmetic and memory complexities of $\mathcal{O}(p^{3})$ in three space dimensions. The arithmetic work, expressed as the number of floating point operations per DoF, to evaluate the Hamiltonian operator with a matrix-free approach using fast integration is compared to a matrix-based CSR approach in Figure 2 using data from (Kronbichler and Kormann, 2019). Even though computing integrals on the fly clearly increases work compared to the final sparse matrix for $p=1$ , sum factorization eventually reverses the picture. Note that the initial reduction of the work per unknown in the matrix-free case is due to the reduced proportion of DoFs shared by several elements, that contribute with separate arithmetic operations on each element.

The matrix-free method also comes along with a significantly reduced memory consumption. In classical finite element implementations where the operator is applied to a single vector, the lower memory transfer implies that the case $p=2$ is up to five times faster than the sparse matrix-vector product, despite a higher arithmetic work (Kronbichler and Kormann, 2019), given the fact that SpMV is memory-bandwidth limited on all current hardware architectures. In the context of the DFT, the operator is applied to a multivector and the matrix entries can be reused for several vectors. SpMMV implementations with a considerably higher utilization of floating point units have been developed, see e.g. the recent work by Anzt et al. (Anzt et al., 2015) and Hong et al. (Hong et al., 2018) for GPUs. An alternative approach adopted in the DFT context for SpMMV FE operators is to store element matrices $\boldsymbol{M}_{K}$ and $\boldsymbol{H}_{K}$ and resort to dense matrix-matrix products on the element level (Motamarri et al., 2019). In this work, we consider methods with moderate polynomial degrees $p=2$ and $p=4$ which balance the needs for localization (favoring many low-degree elements) and better accuracy per DoF of higher-degree elements in terms of the underlying orbitals. According to Figure 2, matrix-free approaches are the most favorable setup in this case with around 170 Flop/DoF.

The element-based matrix-free approach requires knowledge of the vector values associated with the locally owned elements. At the interface between MPI domains this requires access to so-called “ghost” DoFs, that are owned by another MPI processes (depicted by squares in Figure 1(b)). Within the deal.II library, the required MPI communication is termed “update ghost values” . This is a synchronization operation which copies some values from process which “owns” a DoF to all processes which access that DoF as a “ghost” DoF, ensuring access to the replicated data.

On the other hand, the loop over cells according to (13) implies that the operator on cell $K$ will contribute to all DoFs on this cell in the destination (multi-)vector. This necessitates a second MPI communication operation upon the completion of the cell loop in (13) which sends accumulated data (representing partial integrals with the local variant of the test function) to the owning MPI process, which then adds it to the local integral contribution. We refer to that operation as “compress”.

In order to ensure a direct array access for reading and writing to each locally owned DoF, MPI-parallel vectors in deal.II perform all index access using a process-local enumeration of DoFs, including first all locally owned DoFs and then “ghost” DoFs. The latter are grouped according to the rank of the MPI process owning them so that the non-blocking “MPI_Irecv” in “update ghost values” and “MPI_Isend” in “compress” can use this memory directly. Note that an additional temporary array is needed to receive contributions to the locally owned DoFs during “compress” operation or send them during “update ghost values” operation in an unpack/pack fashion. This way, memory for the vector can be allocated as a single large block. Global operations on the vectors then only operate on a vector view of the locally owned DoFs, whereas cell loops operate on a view including the owned and ghost DoFs, without a deep copy between the two states. By contrast, in the TPetra linear algebra one would need to create a fully distributed vector with one-to-one correspondence between DoFs and MPI ranks, and a vector with one-to-many relationship between DoFs and MPI ranks. The operation labeled “compress” corresponds to the “export” operation from the vector with one-to-many map to vector with one-to-one map, whereas the “update ghost values” corresponds to the “import” operation in the other direction. This “view” scenario obviates allocating two vectors and will be especially important for the sparse multivectors considered below. In order to avoid ambiguity in terms of the data in the ghosted arrays, the MPI-parallel vectors natively provided by deal.II are either in the state where it is allowed to write into the “ghost” DoFs followed by the “compress” operation, or when “update ghost values” is called and ghost DoFs contain values consistent with the owning process. In the latter case only read operation is allowed.

3.3. Sparse multivectors

Let us now introduce sparse FE multivectors. Consider a set of overlapping subdomains $\{\Omega_{\alpha}^{h}\}$ covering the domain $\Omega^{h}$ that satisfy the point-wise overlap condition

[TABLE]

In other words, not more than $M_{s}$ patches overlap at any point in the domain. We are interested in the case when

[TABLE]

where each sparse FE vector is typically non-zero only on a handful of MPI processes irrespective of the number $P$ of MPI procsses that the domain $\Omega^{h}$ is partitioned into.

As an example, consider three different support domains $\Omega_{\alpha}^{h}\in\Omega^{h}$ for such vectors as illustrated in Figure 3. We assume a-priori knowledge about the sparsity of the FE multivectors, which is usually based on proximity to a given set of points in space (e.g., location of atoms). In such applications it is either assumed or it can be shown that the solution can be represented with functions localized in space with exponential decay (Ismail-Beigi and Arias, 1999; Kohn, 1959).

Our implementation of sparse FE vectors is based on the blocked compressed sparse row (BCSR) format with 1D row-wise MPI partitioning. BCSR has been adopted for problems in quantum mechanics (Challacombe, 2000; Saravanan et al., 2003; Borštnik et al., 2014) with Gaussian bases. BCSR matrix can be considered as an extension of CSR matrices where each element is a dense matrix. Their sizes are specified by (independent) row and column blocking. As a result, operations like SpMM products will eventually perform dense matrix-matrix multiplications for non-zero blocks, typically using BLAS dgemm routines. For matrices with a large number of non-zeroes, such a storage format is expected to be more efficient and have a higher performance.

An important step of BCSR is the re-ordering of rows (DoFs in the FEM context) and columns (unknown vectors) of sparse matrices. To that end, Hilbert space filling curves (HSFC) are often adopted (Challacombe, 2000; Saravanan et al., 2003), which can be considered as a heuristic solution to the travelling salesman problem.

In order to perform the renumbering of locally owned DoFs we propose Algorithm 1. An illustrative example of this algorithm is given in Figures 4(a) and 4(b), producing a DoF numbering adopted in Figure 3. We emphasize that HSFC are not applied on the finesh mesh level (which could be constructed from adaptive mesh refinement (AMR)), but possibly on a coarser level $L$ . This gives additional control over the blocking of rows and combines well with algorithms that construct support domains for sparse multivectors also by starting from some coarse level $L$ . Although AMR is not considered in this manuscript, we are convinced that support domains (in terms of element patches) for sparse multivectors should be constructed based on a mesh without hanging nodes. In this case, AMR will lead to geometrically equivalent support (i.e., a patch of elements will cover exactly the same domain) and thus the variational property of the FE solution will be preserved. The enumeration of DoFs based on the cell order (see step 7 in Algorithm 1) is done using DoFRenumbering::cell_wise function of deal.II, which assigns new DoF indices based on the ordering of cells. DoFs are assigned by going through the vertex, line, surface, and volume DoFs of each cell. DoFs associated to several cells are numbered on the cell where they are encountered first. Based on this idea, we can construct a blocking of DoFs using the ordered cells as well. That is, each block consists of all DoFs in a group of active cells $N_{K}$ (see step 5 in Algorithm 1) that have not yet been assigned to other blocks. The resulting row blocking is shown in Figure 4(c).

Renumbering of vectors (columns) is also done using HSFC according to Algorithm 2. First we assign an MPI process to each column based on the location of the center of the localization domain within the MPI-partitioned mesh. Given that the MPI partitioning of the mesh is less likely to produce disjoint subdomains, this distribution approach is similar to the one adopted in (Bowler et al., 2001), where several spatially compact groups of atoms are assigned to a single MPI process. In order to speed up the search for a cell that contains the localization center (step 1 in Algorithm 2), we precompute the bounding box of all locally owned cells. This allows to skip the search for points $\boldsymbol{L}_{\alpha}$ that are outside of this box. In order to further speed up the search for a cell, an rtree provided by boost::index::rtree is employed through the GridTools::Cache class of deal.II. For $\boldsymbol{L}_{\alpha}$ that are located exactly at the interface between MPI domains of the mesh, we may find $K_{\alpha}$ it belongs to on different MPI processes. To circumvent this ambiguity we assign the process with the lower MPI rank as the owner. Note that the blocking of columns is typically physically motivated (e.g., orbitals belonging to the same atom can be grouped together). For the current study and the numerical examples below, we perform the blocking based on a pre-defined size $B$ (step 4 in Algorithm 2), but the algorithm is flexible in terms of the blocking parameters. Algorithm 2 implies that the MPI partitioning of the triangulation together with the location of the localization centers completely determine the MPI partitioning of the columns in the multivector. Note that the MPI partitioning of the solution vectors is not affected and rather only the projected matrices $\overline{\boldsymbol{M}}$ and $\overline{\boldsymbol{H}}$ . In the case of a large number of MPI processes and relatively few columns, some processes may not “own” any localization domains, but they may still contribute to projected matrices, unless a locally owned subdomain does not overlap with any support of sparse multivectors. An illustration of the column enumeration is given in Figure 4(c), where we assume two sparse vectors per localization domain from Figure 3.

3.4. Memory allocation and MPI communication

In this section we discuss the implementation of the BCSR matrix BlockCSRMatrix in terms of the specifics of the deal.II library. We emphasize that the ingredients are common building blocks of parallel finite element algorithms and could thus easily adopted in a different context as well. Non-empty dense blocks in the BCSR matrix are indexed using standard CSR indexing. In our implementation we separate the storage of matrix elements from its sparsity pattern. The sparsity pattern is implemented as a derived class from deal.II’s SparsityPatternBase. For each block row non-empty column block indices are stored in ascending order, which is crucial for the algorithms discussed below. Row and column blocks are stored using the BlockIndices class of deal.II, which has been equipped with a binary search using std::upper_bound when translating a global index to a block index and an index within the block. The memory to store the non-empty blocks of sparse FE multivectors is allocated as a single 64-bytes aligned array. Different from CSR, BCSR matrices require an additional data structure to map from the linear CSR index of the block sparsity to the beginning of the data array for this block, which we store using std::vector. Similar to the design of parallel vectors, briefly introduced in Section 3.2, BlockCSRMatrix adopts an ascending partitioning of the DoFs/rows with respect to the MPI rank of each process. Local to each MPI process, “ghost rows” are arranged according to the rank of the MPI owner process (see Figure 5(a)). The “update ghost values” and “compress” operations are implemented similar to dense vectors using non-blocking MPI communication. The key difference is the blocking of the ghost exchange within the matrix-free loops, i.e., the indices marked by squares on Figure 1(b)). Given that ghost rows are only relevant to the access pattern of current MPI process without influencing the global enumeration of the unknowns, we can block them arbitrarily. We implement a blocking by grouping the ghost rows within the block rows of the owning MPI process, see Figure 5(a) for an example. Algorithm 3 lists the key ingredients to achieve this numbering. Note that a similar strategy is applicable to the MPI partitioning of projected matrices $\overline{\boldsymbol{M}}$ and $\overline{\boldsymbol{H}}$ . In this case, however, in order to execute a local sparse matrix-matrix product we need to know not about single rows owned by other MPI processes, but about the complete row blocks.

3.5. Matrix-free operators with sparse multivectors

The evaluation of discretized mass and Hamiltonian operators is implemented by a matrix-free strategy. In this section we discuss data structures and algorithms to extend the algorithm described in Sec. 3.2 to sparse FE multivectors stored in BCSR format. Consider an element with DoFs as depicted in Figure 5(c). To identify the local degrees of freedom, we traverse the block-sparsity pattern for a given set of rows, see Figure 5(b). The local operator $\boldsymbol{H}_{K}$ must be evaluated for each column block that has a non-zero value for at least one DoF on this cell. The identification of the relevant columns is done as follows. We pre-compute the association from cell DoFs to row blocks and indices within the block and group this according to the block index. This information is stored in three vectors. The first std::vector<std::pair<uint, uint>> dof_indices stores the row index within the block and the index of this DoF within the cell for all locally owned cells, grouped by row blocks. With the reference to Figure 5(b) this vector will contain $\dots\{0,0\},\{1,1\},\{2,6\},\{1,2\},\{3,4\},\{0,3\},\{1,7\},\{0,5\},\{1,8\}\dots$ . The second std::vector<std::pair<uint, uint>> block_indices stores the row block number and the number of elements within this block for all cells, stored in an order consistent with the first one. With the reference to Figure 5(b) this vector will contain $\dots\{8,3\},\{0,2\},\{2,2\},\{3,2\}\dots$ . Finally, we need a third vector std::vector<std::pair<uint, uint>> which stores the CSR-like indices that link each cell $K$ to the beginning of the data in the other two vectors.

This cache-friendly data structure is used to implement the cell level read/write access to BCSR vectors in an auxiliary RowsBlockAccessor class. There are two main operations: initialization of the read/write access on a cell $K$ (see Algorithm 4) and advancement of iterators to the next non-empty column block (see Algorithm 5). Given the row-block numbers on a cell $K$ , we store iterators to the beginning of each block row in BCSR multivector (step 2 in Algorithm 4). Clearly not all iterators will point to the same column block (e.g. beginning iterator for row block $3$ in Figure 5(b) will point to the second column block) and some may even point to the end iterator for this row (e.g. row blocks [math] and $2$ in Figure5(b) are empty). Next we determine the lowest column block number among chosen row blocks and take it as the currently active block-column $C$ (step 6 in Algorithm 4). We then equip each row-block iterator with a flag that is ‘true’ if its column block is the same as currently active block-column and ‘false’ otherwise (step 8 in Algorithm 4).

Given that we store the sparsity of BCSR with strictly increasing block-column indices, the iteration through non-empty column blocks for a given set of block rows is implemented straightforwardly by advancing iterators until all iterators point to the “end” of this block row, see Algorithm 5. Note that incrementing such iterators is a relatively cheap operation: a pointer to the beginning of each dense block is moved using an auxiliary vector which stores an offset for linear CSR index.

The matrix-free framework in deal.II has originally been developed for single vectors (i.e., matrix-vector products) (Kronbichler and Kormann, 2012), utilizing SIMD vectorization by evaluating the operator on several cells at once via wrapper classes around intrinsics. While providing the best performance for low and moderate polynomial degrees $p\leq 10$ in the single-vector scenario (Kronbichler and Kormann, 2019), this setup is clearly not optimal in the context of sparse FE multivectors: when considering a batch of cells, the number of active row blocks will often be larger than on an individual cell, which implies unnecessary computations. A natural choice for the current application is to instead perform SIMD vectorization over columns within each column block. To achieve this goal, a row-major storage of dense blocks is desired. Furthermore, in order to perform aligned SIMD read/write we further pad column blocks to the size of a cache line (i.e., 64 bytes). This means that each row within each dense block is also aligned to cache boundaries. The rationale behind this choice is to minimize the cache line transfer when reading and writing data. As a result, we can read/write within each dense block using SIMD aligned intrinsic instructions. An additional benefit is that we avoid a special treatment of loop remainders when reading/writing data for each row using SIMD intrinsics. The user interface to each active row block can be conveniently combined with C++11 lambda functions and provided for the currently active block column $C$ by RowsBlockAccessor class via

⬇

RowsBlockAccessor::process_active_rows_vectorized([&](

const ArrayView<const std::pair<uint,uint>> &dof_view,

vectorized_pointer val,

const uint stride) {

//…

});

where vectorized_pointer is a pointer to the beginning of the dense block in terms of deal.II’s wrapper class around SIMD intrinsics, called VectorizedArray<Number>, dof_view is a view-like object (similar to std::span) to the dof_indices data structure described above that stores rows within the block and their index within the current cell; and stride is a column stride counted in SIMD-vectorized numbers. That is, the beginning of a row within the block can be obtained via val[dof_view[i].first * stride]. Algorithm 6 gives a complete overview of the matrix-free operator evaluation for sparse multivectors. Note that while applying the operator on each element, the row numbering and consequently rows within dense blocks do not have a particular pattern. Nonetheless, no software prefetching is used as the hardware is capable of hiding access latencies.

The evaluation of the FE operator on a sparse multivector will produce a result vector with more non-zero elements than there are in the source vector. This is the direct consequence of a fact that the element-wise finite element operator couples all DoF on a cell. Given a-priori knowledge about the support of the source vector, it is relatively straight forward to deduce the support for the destination vector, especially in the absence of hanging nodes constraints.

3.6. Sparse matrix-matrix multiplication

For completeness, this section presents the algorithms for “mmult” and “Tmmult” multiplications between BCSR matrices. The rationale is to document straightforward implementations including their limitations that we have adopted for the numerical examples below, rather than novel algorithms.

SpMM is a central operation for many fields, such as graph algorithms (Kepner and Gilbert, 2011), quantum mechanics calculations (Borštnik et al., 2014; Rubensson et al., 2006; Challacombe, 2000; Rudberg et al., 2018) and algebraic multigrid preconditioners (McCourt et al., 2013). Irregular memory access and poor data locality make SpMM a difficult kernel to optimize. Various algorithms have been proposed in the past (Buluc and Gilbert, 2008; Akbudak et al., 2018; McCourt et al., 2013; Borštnik et al., 2014; Patwary et al., 2015; Gustavson, 1978; Challacombe, 2000). In (McCourt et al., 2013) sparse vectors were compressed into full vectors using graph coloring and used in sparse matrix dense multivector product. A modification of the Canon algorithm for 2D MPI-partitioned matrices in doubly compressed sparse column format was proposed in (Buluc and Gilbert, 2008). BCSR matrices have been used for quantum mechanics calculations with atom-centered basis functions in (Challacombe, 2000; Saravanan et al., 2003; Borštnik et al., 2014).

In the present context, multiplication between tall and skinny sparse multivectors is necessary. The unknown solution vector $\boldsymbol{\Phi}$ employs a 1D row-wise MPI partitioning associated with the FE discretization of the domain $\Omega$ , as discussed above. Furthermore, the efficient implementation of the matrix-free operator evaluation with fast read/write access to each non-zero DoF on a cell motivated our choice of BCSR matrix format as opposed to quadtree representation of matrices (Bock et al., 2016; Bock and Challacombe, 2013) or hierarchical matrices (Rubensson et al., 2006, 2007; Rudberg et al., 2018). We note that projected matrices (denoted by $\overline{\{\bullet\}}$ in Section 2) are relatively small compared to the unknown solution vectors $\boldsymbol{\Phi}$ . This suggests adopting 1D row-wise MPI partitioning for these matrices as well.

A matrix-matrix product $\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}$ involves three nested loops. An outer loop runs over the rows of $\boldsymbol{A}$ , for each row we then loop over all the columns, and then we need to multiply each element with all the elements in that row in $\boldsymbol{B}$ , see Algorithm 7. Note that $\boldsymbol{A}$ and $\boldsymbol{C}$ are assumed to have the same row partitioning, whereas $\boldsymbol{B}$ should have ghost values consistent with the sparsity of $\boldsymbol{A}$ on this MPI process. That is, step 7 of Algorithm 7 may result in access outside of locally owned row blocks of $\boldsymbol{B}$ . Therefore ghost values of $\boldsymbol{B}$ have to be communicated before the evaluation of the product. Additionally we also allow for an inexact product, that is, when the sparsity of $\boldsymbol{C}$ is not enough to store the product and some entries are discarded. This is a typical scenario for OM presented in Section 2 as gradients of the functional $\boldsymbol{G}$ are more dense than the unknown multivectors $\boldsymbol{\Phi}$ .

Algorithm 8 presents our implementation of the transpose matrix-matrix multiplication. The product involves three nested loops in analogy to the non-transposed multiplication. Note that block rows of $\boldsymbol{C}$ are accessed in a non-uniform way, which requires synchronization at the end of the local products, a step we call (“compress”).

3.7. A summary of new functionality added to deal.II

In this section we present a list of classes and methods that are being added to the open source library deal.II

•

Utilities::inverse_Hilbert_space_filling_curve() assigns to each point Point<dim,double> in dim spacial dimensions an index std::arraystd::uint64_t,dim using the Hilbert space filling curve;

•

BlockCSRMatrix that implements BCSR matrix, as well as matrix-matrix products and MPI communication for “compress” and “update ghosts” operations;

•

RowsBlockAccessor an auxiliary class that provides read/write access to rows in BlockCSRMatrix during matrix-free FE operator evaluation;

•

various improvements and extensions to DynamicSparsityPattern and SparsityPattern classes, which we rely upon in implementing BCSR matrix and its iterators;

The algorithms developed here will be made publicly available as a part of the deal.II library.

4. Numerical results

As a benchmark problem, we look at carbon nanotubes of different lengths. Note that within this contribution we do not solve the DFT but only benchmark and study the required operations for the OM approach. Carbon nanotubes are chosen as a benchmark since these systems are suitable for OM methods and they are convenient for scaling studies in a strictly reproducible way. Nanotubes with $(10,0)$ chirality were generated by the TubeGen (Frey and Doren, 2011) tool and consist of $160$ , $320$ , $640$ , $1280$ , $2560$ , $5120$ , $10240$ atoms. We assign two sparse vectors to each Carbon atom. Each vector has support within the distance of $8$ Bohr from the corresponding atom (Fattebert and Bernholc, 2000). In order to have full control over the meshing, we choose a simple structured mesh and make sure that the number of elements (and therefore the number of DoFs) is proportional to the number of atoms. The sparsity of multivectors in terms of cell patches is constructed on a mesh level one coarser than the finest level based on the proximity of cell centers to each orbital. For each patch we can construct the sparsity in terms of DoFs by taking shape functions $N_{i}(\boldsymbol{x})$ that have support only within the patch (see Figure 3 for a 2D example). The same mesh level is used for the enumeration and blocking of the DoFs using HSFC. The blocking parameter $B$ for orbitals in Algorithm 2 is set to $8$ . Judging from the arithmetic complexity of the matrix-free approach (see Figure 2), one can argue that for scalar operators the method is especially attractive for degrees between two and six, where the cost per unknown is lowest. For this benchmark we choose quadratic and quartic polynomial bases. The smallest element size (in Bohr) for the two cases are $0.5625\times 0.5\times 0.5$ and $1.125\times 1\times 1$ , respectively. The length of the mesh for each nanotube is scaled proportionally to the number of atoms. As an example, Figure 6 shows the mesh for the $\rm C160$ nanotube used with quadratic FEs. The crossection of the mesh is $40\times 40$ square.

Unless noted otherwise, the numerical experiments are performed on a cluster with each node having two Xeon 2660v2 “Ivy Bridge” chips (10 cores per chip + SMT) running at 2.2 GHz (turbo boost disabled) with 25 MB Shared Cache per chip and 64 GB of RAM. The code has been run with

•

the Intel compiler version 18.03 with flags “-O3 -march=native”, Intel MPI version 18.03 and Intel MKL 2018.3; and

•

the GCC compiler version 8.2.0 with flags “-O3 -march=native”, OpenMPI version 3.1.3 and Intel MKL 2018.3.

The memory bandwidth and the peak performance for a single socket are 47.2 GBytes/s and 176 GFlop/s, respectively.

4.1. Comparison of sparse multivectors to dense column vectors

First we compare our implementation in the BCSR format to a setup using full column vectors for the smallest example with $160$ atoms run on a single socket (i.e. with 10 MPI processes). Table 1 reports various information for this setup. Clearly tall and skinny sparse multivectors result in dense blocks of the same character. Note that most of the column blocks are of size $8$ (as expected), which indicates that the matrix-free cell operator will require two SIMD swipes per column block on average (see Step 8 in Algorithm 6). Additionally the table reports the number of non-empty column blocks accumulated over all cells during application of the matrix-free operator to sparse multivector $\boldsymbol{\Phi}$ , see Step 6 of Algorithm 6. The variation in this number can be considered as a measure of work imbalance for this kernel.

Table 2 compares the run time and memory consumption of an implementation in the BCSR format versus a setup using full column vectors. Besides a reduction in memory of over $3\times$ (measured as the resident memory reported by Linux), the run times for all components are also considerably reduced. Clearly, this is most obvious for the matrix-matrix multiplications. However, also the matrix-multivector product is more than 3.5 times faster, verifying the efficiency of the implementation. For larger configurations with more atoms, the advantage of the BCSR-based implementation grows, and a setup with full column vector would soon become intractable.

4.2. Node-level performance

Next we analyze the node-level performance of our kernels for BCSR sparse FE vectors using the roofline model $p=\rm{min}(P,BI)$ , where $P$ is the peak arithmetic performance, $B$ is the peak main memory bandwidth and $I$ is the computational intensity. The sustained memory bandwidth (MBytes/s) and the number of floating point operations per second (MFLOP/s) are measured using the MEM DP group of the LIKWID (Treibig et al., 2010) hardware performance counter tool, version 4.3.3. The measurements are done on a single socket of the RRZE cluster by pinning all 10 MPI processes to this socket. The smallest example with $160$ atoms is considered. All results are reported for double precision numbers with explicit SIMD vectorization over 4 lanes organized according to Section 3.5. Walltime is reported as an average of each kernel executed ten times in a row.

Figure 7 shows results of the roofline model for the considered kernels for quadratic and quartic FE bases. The mesh for quadratic FEs is depicted in Figure 6, whereas the quartic mesh is coarser by one level, which results in the same number of total DoFs, see Table 1. Additionally we compare the matrix-free approach to cell-level dgemm operation using BLAS, labeled “FE op. (BCSR gemm)”, according to the recent open-source implementation of DFT using the deal.II library (Motamarri et al., 2019), as well as dense column vectors with filtering of the matrix-free FE operator based on support for each vector, labeled “FE op. (full filtered)”. In order to facilitate comparison we also report the wallclock time for the three flavors of the matrix-free operators in the roofline diagram.

The sparse matrix-matrix products (“mmult”, “Tmmult” and “Tr_tmmult”) are in the core bound regime. By profiling these kernels we observe that more than 90% of time is spent in the BLAS dgemm function, showing that no significant overhead is introduced by our implementation. For the matrix-vector products, the matrix-free operator evaluation is considerably faster than the cell-wise full matrices for the quartic basis. This is expected as there is a large reduction in arithmetic complexity for higher order elements by sum factorization, see Figure 2. However, both operators are relatively far away from the arithmetic peak performance. We also observe that a basic matrix-free evaluation with filtered column vectors, labeled “FE op. (full filtered)”, involves a lower arithmetic throughput as well as a lower computational intensity, albeit for a comparable run time to the BCSR matrix-free implementation. For this setup the memory for dense column vectors is allocated regardless of the underlying sparsity and thus the indirections in the index access are avoided for “FE op (full filtered)”. Apart from the filter and some associated additional memory traffic (which can be related to overhead in the index storage as well as eager prefetching of filtered-out vector data), this data point represents the best case for the single-vector matrix-free operation evaluation provided by the deal.II library. The increased arithmetic intensity confirms the beneficial properties of the column-oriented algorithm in the BCSR format in general and our implementation in particular for matrix-free computations on sparse multivectors.

When comparing results obtained by using different compilers, we observe that the GCC compiler gives 30-40% higher performance for the matrix-free FE operator ( $34.56$ vs $24.03$ GFLOP/s for quadratic FEs and $41.76$ vs $32.89$ for quartic FEs ). This aligns with our previous observations that the GCC compiler generates better machine code for the sum factorization algorithms implemented within deal.II (Kronbichler and Kormann, 2019, Sec. 3). Note that Intel’s MKL is used for the BLAS and LAPACK interface with both compilers, so the performance of the matrix-matrix products, as expected, is independent of the utilized compiler. GCC compiler leads to a higher a memory throughput (e.g., $10.86$ GB/s vs $7.38$ GB/s for quadratic FEs).

Details about the various stages of the BCSR matrix-free algorithm in “FE op. (BCSR MF-col)” of Figure 7 are listed in Figure 8 by a stack plot. Since adding timers inside the element loop (step 3 of the algorithm) can severely change the total runtime due to timer overheads, we measured the time indirectly as follows. Using a template parameter in C++, we selectively disable various stages of the algorithm. The following variations of Algorithm 6 are considered: (i) element loop without read/write operations (steps 9 and 11); (ii) element loop without evaluating local integrals (step 10); (iii) element loop (step 3); (iv) the complete algorithm without resetting destination vector to zero (step 1); (v) the complete algorithm; and (vi) the element loop without read/write operations and local integrals (step 9, 10 and 11)333We prevented the compiler from optimizing away the loop at step 8 using inline assembly language. .

From these measurements we can reliably deduce the wall-clock time for five key steps plotted in Figure 8: (i) resetting destination vector (step 1); (ii) MPI communication (steps 2 and 16); (iii) read from and write into BCSR vectors (steps 9 and 11); (iv) evaluation of local integrals (step 10); (v) cell loop and BCSR accessors overhead (steps 3-8, 13). The measurements show that for quadratic elements and the code compiled with Intel compiler 47.2% of the time is spent in the local evaluation of integrals, i.e., the product $\boldsymbol{H}_{K}\boldsymbol{U}_{K}$ done via quadrature with sum factorization, whereas 22.1% is spent reading from/writing to BCSR vectors. 21.3% of the time is spent in MPI communication. Note that this step also includes reading from and writing to auxiliary vectors in terms of pack/unpack operations within the non-blocking MPI communication. While this proportion might seem high, it is a consequence of the code optimizations in the other parts, in particular for the local integrals and the BCSR data access, see also (Kronbichler and Kormann, 2019) where a similar observation was made. We also emphasize that overlapping of communication and computation is not applicable here as the slowdown is in fact observed within the shared-memory region of a single socket where both the pack/unpack as well as the actual exchange between MPI contribute with similar shares. Most importantly, the loop overhead for BCSR multivectors only takes 3.6% of the walltime, which indicates that our implementation of row iterators for BCSR in RowsBlockAccessor adds little overhead to the overall algorithm. Results for quartic elements are largely similar. When comparing the measurements for quadratic FE obtained with the Intel compiler to those obtained by employing the GCC compiler (see Figure 8(b)), we observe that the local evaluation of integrals takes a smaller fraction of time (42.6%). Note that the BCSR loop overhead for this case is still remarkably low – only 4.8% of the walltime. Finally, we emphasize that the share of the “Loop” and “MPI” parts in Figure 8 is also affected by the slight load imbalance as some MPI ranks must wait for neighbors that still perform local computations in the ghost exchange.

In order to identify the influence of the sparse vectors on the sub-optimal performance, we plot two additional data points in the roofline diagram, plotted as dashed horizontal lines because the arithmetic intensity is partly outside of the picture. The lower line reports the performance of the matrix-free operator without reading from and writing into the sparse vectors (steps 9 and 11 in Algorithm 6). The measured actual matrix-free kernel is relatively close to this roof of $32.33$ GFLOP/s (Intel compiler). In other words, data movement for BCSR multivectors is shown to be efficient. The second line is obtained by additionally disabling the zero-out of the source vector and the MPI communication (steps 1,2 and 16 in Algorithm 6). Thus, this line represents the local cell operations for BCSR vectors without any data access except the quadrature point data. For quadratic elements its performance value is $45.54$ GFLOP/s with the Intel compiler, corresponding to one fourth of the peak performance. The minimum and maximum walltime for this stage of the algorithm are $0.22$ s and $0.38$ s, respectively. This work imbalance is caused by the variation in the total number of non-empty column blocks for all cells during application of the matrix-free operator, see the last row of Table 1. Therefore, future studies with a focus on better balancing between the cost of the MPI data exchange, the local integrals as well as the read/write operations could bring the overall algorithm somewhat closer to this arithmetic.

Matrix-free FE operators with BCSR multivectors potentially perform a different amount of work on each cell. Consequently it is not possible to fuse work over multiple FE cells. As an alternative we tried to pipeline work within the inner most loop in the matrix-free kernel (step 8 in Algorithm 6) for column blocks of sizes larger than or equal to 8 (SIMD vectorization is over 4 doubles). In particular one can read/write vector data within the a single step into two FEEvaluation objects, which are part of the Matrix-free framework in deal.II. However this had no observable (positive) influence on the performance, neither for quadratic nor for quartic FEs. Most likely because the second part of the row is already in the cache thanks to row-major layout and alignment of each dense row to cache boundaries.

4.3. Comparison between Intel Xeon Ivy Bridge and Intel Xeon Cascade Lake

As a next experiment, we evaluate the algorithm on a newer architecture, the Intel Xeon Gold 6230 (codenamed Cascade Lake) with $20$ cores per socket. The arithmetic peak performance is 1280 GFlop/s (turbo mode enabled, maximal AVX-512 frequency is 2.0 GHz) and the peak memory bandwidth is 140 GB/s (6 channels of DDR4-2933). Since 96 GB/s are measured for a stream copy benchmark, this number is used as a memory bandwidth limit on this architecture. The code is compiled with the GCC compiler version 8.1 with flags “-O3 -march=native”, OpenMPI version 3.1.4 and Intel MKL 2018.2. We run an example with $320$ atoms on 20 MPI processes, which results in very similar row and column block sizes to those considered in the previous section (compare Table 3 and Table 1). Figure 9 plots the achieved performance on the Cascade Lake architecture in terms of a roofline plot. The algorithm “FE op. (BCSR MF-col)” reaches an arithmetic throughput of $136.45$ GFlop/s for $p=2$ and $156.97$ GFlop/s for $p=4$ , as well as a memory throughput of $38.61$ GB/s for $p=2$ and $58.09$ GB/s for $p=4$ , respectively. Compared to the Ivy Bridge architecture, the run time per core on Cascade Lake is 40% faster (e.g. $0.22$ s versus $0.34$ s for $p=4$ ), which shows the gains from the newer microarchitecture with AVX-512 SIMD vectorization (8-wide) and fused multiply-add (FMA) instructions.

The roofline plot of Figure 9 shows that the two empirical rooflines, the lower one without the read/write operations into the sparse vectors and the higher one only involving the arithmetic work of sum factorization, are farther away from the achieved performance. This increasing gap can be explained by the fact that the in-core performance of the sum factorization kernels in the “w/o vec” part is almost four times faster per core (or eight times faster per socket). As a consequence, the cost of data access increases in importance and represents the main bottleneck on Cascade Lake. Especially the memory-intensive stage of zeroing the destination vector and the MPI communication contribute $43.6$ % of the run time for $p=4$ (comapred to $36$ % on Ivy Bridge with GCC compiler). Since the arithmetically intense local integrals of “w/o vec” can only be overlapped with the read/write access but not the zeroing of the vector and the MPI exchange, the overall memory throughput of $58.09$ GB/s ends up clearly below the theoretical value of 96 GB/s.

Taking also the algorithmic alternatives into account, the comparison between Ivy Bridge and Cascade Lake shows that the proposed algorithm is especially attractive on newer architectures with wider vectorization and higher Flop/Byte ratio.

4.4. Inter-node scaling

Finally, we show results for the inter-node scaling of the proposed algorithms. In this section, we present results for the setup with Intel compiler and Intel MPI. Even though the GCC compiler delivers slightly faster binaries on the node level, we found the combination GCC+OpenMPI to lead to inferior scaling results. We speculate that this is related to the fact that projected matrices (e.g. $\overline{\boldsymbol{H}}$ ) are too small, which makes very few MPI processes own row blocks (see Tables 1 and 3 for examples with ten and twenty MPI processes) and consequently results in many point-to-point MPI communication in flight during Tmmult “compress” stage, for which the Intel MPI library is more optimized than OpenMPI. Alternative libraries such as Intel MPI or MPICH combined with GCC were not available on the Emmy HPC cluster at the time of writing.

In our experiments, we record how the wallclock time of the required operations scales with respect to

(i)

the number of sparse vectors $M$ for a fixed number of processes $P$ , see Figure 10;

(ii)

the number processes $P$ for a fixed number of vectors $M$ , known as strong scaling, see Figure 11;

(iii)

the number of processes $P$ with the number of vectors $M$ proportional to $P$ , known as weak scaling, see Figure 12.

The results from Figures 10 and 12 demonstrate an optimal algorithmic complexity with a fixed work per column vector. Furthermore, the strong scaling results in Figure 11 illustrate that the optimal algorithmic complexity is combined with a good distribution of work and that no communication bottlenecks are present. The proposed column-based matrix-free operator evaluation shows very good scalability all the way to 1280 MPI ranks or around $2.5\cdot 10^{-2}$ seconds for a single operator evaluation on 640 vectors. This number can be compared to the scaling limit of around $2\cdot 10^{-4}$ seconds for an operator evaluation on a single vector with matrix-free algorithms (Kronbichler and Kormann, 2019). Thus, the proposed algorithms with block columns allow to overcome this limit and efficiently address large-scale atomic systems. The parallel speedup of the Tmmult and mmult operations is excellent for enough parallelism as well, but it levels off for very large sizes. This is expected given the more global communication nature along the long and skinny matrix dimensions.

Further improvements could probably be made by overlapping the MPI communication with computations both for the matrix-free FE operator as well as the matrix-matrix products. The matrix-free framework of deal.II supports ordering of cells in such a way that each MPI process first loops through cells that only contain locally owned DoFs and thus does not need to wait for “ghost” values from other MPI processes. In order to support such a setup, the MPI communication (i.e., “update ghost values” and “compress”) in the BCSR matrix structure has been implemented as a two-stage procedure, that involves initiation of the non-blocking point-to-point communication and waiting for it to complete. This shall allow us to overlap computation and communication in FE operator through deal.II’s matrix-free framework. We deem it possible to also hide MPI communication for matrix-matrix products in Algorithm 7. To that end BCSR row iterators could be extended so that one can iterate only over locally owned or ghost column blocks. Note that this is similar to the approach adopted in PETSc (Balay et al., 2015) for matrix-matrix multiplication, where each MPI process stores the locally owned part of a sparse matrix as two blocks: one square diagonal block with column indices corresponding to locally owned rows in the domain vector, and one block which stores all the other columns corresponding to ghosted entries. We leave investigation of these (and possibly other) aspects of MPI scaling to further studies. The results of this section should rather be taken as a proof of concept, that the BCSR storage format with 1D row-wise MPI partitioning is a promissing approach for the required operations on tall and skinny sparse multivectors with the FE matrix-free operators.

5. Summary

In this contribution we have proposed algorithms for sparse finite element multivectors suitable for matrix-free implementations using BCSR matrices with row-wise partitioning to utilize MPI and SIMD vectorization over columns. In addition to standard matrix-matrix products, we developed tailored algorithms and data structures to efficiently support the matrix-free evaluation of operators in the sparse vector context. To that end we proposed a way to integrate ghost DoFs into the BCSR structure and implemented the required MPI non-blocking communication. Our single-node performance studies demonstrate that the loop overhead due to BCSR format consumes less than 5% of the walltime, while enabling the study of big atomic systems in linear time. For higher polynomial degrees the BCSR matrix-free operator outperforms both the cell matrix based dgemm variant as well as column vectors based matrix-free counterparts. We were able to achieve around on fourth of the maximum arithmetic performance of a 10-core Intel Ivy Bridge processor and around an eighth of the arithmetic peak on a 20-core Intel Cascade Lake processor. The latter also saturates up to 60% of the available memory bandwidth. We have shown that the performance gap can be explained by the cost of data access in the matrix-free operator evaluation, such as the MPI communication or as simple operations as zeroing the destination vector. The BCSR-specific components within the matrix-free operator have been shown to only marginally increasing the run time, showing that the proposed algorithms are efficient. The inter-node scaling results confirm that the advocated BCSR storage format is a good candidate for efficient and scalable operations on sparse multivectors within the context of the FE method applied to problems in quantum mechanics.

In order to enable the large-scale finite element solution of DFT via the orbital minimization approach with BCSR multivectors, a few additional steps are necessary, such as the extension of the proposed algorithms for adaptive mesh refinement for which ideas from (Kronbichler and Kormann, 2012) can be used, a more transparent handling of column-based evaluation in terms of the metric terms within the deal.II library, as well as geometric multigrid preconditioners to improve the convergence of minimizers (Davydov et al., 2018). For the latter, efficient intergrid matrix-free transfer operators would need to be developed. This will be the focus of our future work in this direction. Eventually, performance of the proposed matrix-free kernels for BCSR setups could be improved by limiting message-passing communication of ghosts only between the nodes, and use shared memory concepts within the nodes.

Acknowledgements.

DD is grateful to Georg Haager (FAU), Jan Eitzinger(FAU) and Thomas Gruber (FAU) for fruitful discussions on node level performance measurements using LIKWID tool. DD acknowledge the financial support of the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG), grant DA 1664/2-1 and the Bayerisches Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen (KONWIHR). MK was partly supported by the German Research Foundation (DFG) under the project ”High-order discontinuous Galerkin for the EXA-scale” (ExaDG) within the priority program ”Software for Exascale Computing” (SPPEXA), grant agreement no. KR4661/2-1.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Akbudak et al . (2018) Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC) 4, 3 (2018), 13.
3Alzetta et al . (2018) Giovanni Alzetta, Daniel Arndt, Wolfgang Bangerth, Vishal Boddu, Benjamin Brands, Denis Davydov, Rene Gassmöller, Timo Heister, Luca Heltai, Katharina Kormann, Martin Kronbichler, Matthias Maier, Jean-Paul Pelteret, Bruno Turcksin, and David Wells. 2018. The deal.II Library, Version 9.0. Journal of Numerical Mathematics 26, 4 (2018), 173–183. https://doi.org/10.1515/jnma-2018-0054 · doi ↗
4Anzt et al . (2015) Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Accelerating the LOBPCG Method on GP Us Using a Blocked Sparse Matrix Vector Product. In Proceedings of the Symposium on High Performance Computing (HPC ’15) . Society for Computer Simulation International, 75–82. http://dl.acm.org/citation.cfm?id=2872599.2872609
5Balay et al . (2015) Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman Mc Innes, Karl Rupp, Barry F. Smith, Stefano Zampini, and Hong Zhang. 2015. PET Sc Users Manual . Technical Report ANL-95/11 - Revision 3.6. Argonne National Laboratory.
6Bao et al . (2012) Gang Bao, Guanghui Hu, and Di Liu. 2012. Numerical Solution of the Kohn-Sham Equation by Finite Element Methods with an Adaptive Mesh Redistribution Technique. Journal of Scientific Computing 55, 2 (#sep# 2012), 372–391. https://doi.org/10.1007/s 10915-012-9636-1 · doi ↗
7Beck (2009) Thomas L. Beck. 2009. Real-Space and Multigrid Methods in Computational Chemistry . John Wiley & Sons, Inc., Hoboken, New Jersey, Chapter 5, 223–285. https://doi.org/10.1002/9780470399545.ch 5 · doi ↗
8Bock and Challacombe (2013) Nicolas Bock and Matt Challacombe. 2013. An optimized sparse approximate matrix multiply for matrices with decay. SIAM Journal on Scientific Computing 35, 1 (2013), C 72–C 98.