Fast electrostatic solvers for kinetic Monte Carlo simulations
William Robert Saunders, James Grant, Eike Hermann M\"uller, Ian, Thompson

TL;DR
This paper introduces a linear-scaling electrostatic solver based on the Fast Multipole Method for kinetic Monte Carlo simulations, significantly improving efficiency in modeling charged particle systems in energy materials.
Contribution
It adapts the Fast Multipole Method for KMC, enabling linear scaling electrostatic calculations and overcoming limitations of existing solvers.
Findings
Achieves linear computational complexity per KMC step.
Demonstrates high performance and scalability in simulations.
Provides a user-friendly Python interface for the solver.
Abstract
Kinetic Monte Carlo (KMC) is an important computational tool in physics and chemistry. In contrast to standard Monte Carlo, KMC permits the description of time dependent dynamical processes and is not restricted to systems in equilibrium. Recently KMC has been applied successfully in modelling of novel energy materials such as Lithium-ion batteries and solar cells. We consider general solid state systems which contain free, interacting particles which can hop between localised sites in the material. The KMC transition rates for those hops depend on the change in total potential energy of the system. For charged particles this requires the frequent calculation of electrostatic interactions, which is usually the bottleneck of the simulation. To avoid this issue and obtain results in reasonable times, many studies replace the long-range potential by a short range approximation. This,…
| machine | chip | sockets | cores per | cores per | MPI | OpenMP threads | |
| socket | node | ranks | per MPI rank | ||||
| Balena | Intel Ivy Bridge E5-2650v2 | 2 | 8 | 16 | 4 | 4 | |
| Intel Skylake Gold 6126 | 2 | 12 | 24 | 4 | 6 | ||
| Isambard | Cavium ThunderX2 | 2 | 32 | 64 | 8 | 8 | |
| CSD3 Peta4 | Intel Skylake Gold 6142 | 2 | 16 | 32 | 8 | 4 |
| Time per KMC step [s] | ||||||||
| CSD3 Peta4 | Isambard | Ivy Bridge | Skylake | |||||
| 1 | 7.43 | (100.0%) | 11.93 | (100.0%) | 21.33 | (100.0%) | 10.26 | (100.0%) |
| 2 | 3.72 | (99.9%) | 5.96 | (100.1%) | 10.55 | (101.1%) | 5.09 | (100.8%) |
| 4 | 1.65 | (112.8%) | 3.03 | (98.4%) | 5.36 | (99.5%) | 2.57 | (99.7%) |
| 8 | 0.94 | (99.1%) | 1.48 | (100.7%) | 2.70 | (98.7%) | 1.30 | (98.3%) |
| 16 | 0.51 | (91.5%) | 0.77 | (96.3%) | 1.33 | (100.2%) | ||
| 32 | 0.27 | (86.8%) | 0.42 | (88.3%) | 0.71 | (94.2%) | ||
| 64 | 0.13 | (89.3%) | 0.23 | (79.4%) | 0.39 | (86.3%) | ||
| 128 | 0.08 | (72.5%) | 0.14 | (65.3%) | 0.21 | (79.2%) | ||
| Time per KMC step [s] | ||||||||
| CSD3 Peta4 | Isambard | Ivy Bridge | Skylake | |||||
| 1 | 4.30 | (100.0%) | 6.24 | (100.0%) | 12.35 | (100.0%) | 6.65 | (100.0%) |
| 2 | 4.42 | (97.3%) | 7.01 | (89.1%) | 13.05 | (94.7%) | 7.59 | (87.6%) |
| 4 | 4.51 | (95.5%) | 8.76 | (71.2%) | 12.92 | (95.6%) | 8.52 | (78.0%) |
| 8 | 4.68 | (92.1%) | 6.25 | (99.9%) | 12.45 | (99.2%) | 7.55 | (88.1%) |
| 16 | 4.66 | (92.3%) | 7.05 | (88.5%) | 13.10 | (94.3%) | ||
| 32 | 4.56 | (94.3%) | 8.80 | (70.9%) | 12.99 | (95.1%) | ||
| 64 | 4.79 | (89.9%) | 6.31 | (98.9%) | 12.52 | (98.6%) | ||
| 128 | 4.52 | (95.2%) | 7.09 | (88.1%) | 13.24 | (93.3%) | ||
| doping | charges | Time per KMC step [s] | |
| % | () | Ivy Bridge | Skylake |
| 0.01 | 102 | 0.058 | |
| 0.05 | 510 | 0.073 | |
| 0.1 | 1020 | 0.096 | |
| 0.5 | 5103 | 0.68 | 0.12 |
| 1.0 | 10206 | 1.37 | 0.19 |
| 2.0 | 20412 | 2.74 | 0.35 |
| # particles | |||||||
| implementation | VV-step | FMM | VV-step | FMM | |||
| PPMD | (MPI only) | 3.03 | 2.91 | [96%] | 9.78 | 8.96 | [92%] |
| (MPI+OpenMP) | 3.96 | 3.78 | [95%] | 12.39 | 11.48 | [93%] | |
| ScalFMM | (OpenMP only) | 1.16 | 1.14 | [99%] | 4.00 | 3.93 | [98%] |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Fast electrostatic solvers for kinetic Monte Carlo simulations
William Robert Saunders
Department of Mathematical Sciences
Department of Physics
James Grant
Computing Services
Eike Hermann Müller
Department of Mathematical Sciences
Email: [email protected]
Ian Thompson
Department of Physics
Abstract
Kinetic Monte Carlo (KMC) is an important computational tool in theoretical physics and chemistry. In contrast to standard Monte Carlo, KMC permits the description of time dependent dynamical processes and is not restricted to systems in equilibrium. Compared to Molecular Dynamics, it allows simulations over significantly longer timescales. Recently KMC has been applied successfully in modelling of novel energy materials such as Lithium-ion batteries and organic/perovskite solar cells. Motivated by this, we consider general solid state systems which contain free, interacting particles which can hop between localised sites in the material. The KMC transition rates for those hops depend on the change in total potential energy of the system. For charged particles this requires the frequent calculation of electrostatic interactions, which is usually the bottleneck of the simulation. To avoid this issue and obtain results in reasonable times, many studies replace the long-range potential by a phenomenological short range approximation. This, however, leads to systematic errors and unphysical results. On the other hand standard electrostatic solvers such as Ewald summation or fast Poisson solvers are highly inefficient in the KMC setup or introduce uncontrollable systematic errors at high resolution.
In this paper we describe how the Fast Multipole Method by Greengard and Rokhlin can be adapted to overcome this issue by dramatically reducing computational costs. We exploit the fact that each update in the transition rate calculation corresponds to a single particle move and changes the configuration only by a small amount. This allows us to construct an algorithm which scales linearly in the number of charges for each KMC step, something which had not been deemed to be possible before.
We demonstrate the performance and parallel scalability of the method by implementing it in a performance portable software library, which was recently developed in our group. We describe the high-level Python interface of the code which makes it easy to adapt to specific use cases.
keywords:
kinetic Monte Carlo, electrostatics, Fast Multipole Method, parallel computing, Domain Specific Language
1 Introduction
The kinetic Monte Carlo (KMC) method [1, 2, 3, 4] was originally developed for the simulation of time dependent statistical processes in chemical reaction dynamics. More recently, the method has been applied in computational physics and chemistry to model processes on grain surfaces [5], in electrolytes [6] and organic devices [7, 8]. In contrast to standard Monte Carlo methods, such as the Metropolis Hastings algorithm [9, 10] for the simulation of Markov processes, KMC is not limited to systems in equilibrium. Instead, it allows the representation of dynamical processes in physical materials, while not being limited by the restrictive timestep constraints in Molecular Dynamics (MD) simulations which arise due to fast, but dynamically irrelevant, oscillations around semi-stable configurations.
Kinetic Monte Carlo for energy materials.
An important application of the KMC method, which is currently attracting significant interest, is the simulation of transport processes in energy materials. This includes solid state electrolytes such as Lithium-ion batteries [6] and semiconductors in full-device simulations [7, 8]. A particularly promising application are organic- and perovskite- based solar cells, which can achieve remarkable power efficiencies [11, 12]. All those applications require the modelling of dynamic transport processes in a three dimensional volume to predict material properties such charge mobilities and the current-voltage characteristics.
In addition to making direct physical predictions, KMC simulations are also important to adjust parameters in large-scale drift-diffusion models via upscaling. Continuum models of this type are widely used in industry for full-device simulations.
The dynamics of a wide class of systems can be described by hopping processes of particles between sites of a static background matrix. To resolve physically relevant macroscopic features, such as grain boundaries, the simulation domain has to be large and simulated systems have to contain particles and hopping sites. The hopping rates (commonly referred to as “propensities” in the KMC literature), which serve as an input to the KMC algorithm, depend on the total potential energy of the system. This energy is given by the sum of classical interaction potentials for all particle pairs.
Electrostatic interactions.
For charged particles the electrostatic contribution to the inter-particle potential is long-range. The Coulomb interactions between all particles need to be computed, resulting naively in an expensive calculation per potential hopping event. This computational complexity can be reduced to with Ewald summation [13] and with particle mesh methods [14, 15]. However, even if the Fast Multipole Method (FMM) [16, 17, 18] is used (in its standard form), the computational complexity per hop is still . Since potential hopping events have to be considered in each KMC step, the total cost per step is at least . Because of this quadratic growth in complexity, the electrostatic calculation is typically the bottleneck of the KMC simulation. As argued in [19], this appears to make large-scale KMC simulations with accurate electrostatics computationally infeasible in principle. To overcome this issue, the Coulomb potential is often replaced by an ad-hoc truncated short range interaction, see e.g. [6]. Since only the interactions with a small number of neighbours need to be calculated in this case, the cost of each potential hopping event is reduced to , resulting in a total cost of per KMC step. However, this truncation introduces uncontrollable systematic errors, which limit the predictive power of the model [20]. For example, the authors of [21] find that neglecting long range interactions when modelling protonic diffusion and conduction in doped perovskites changes the predicted diffusion coefficient by compared to the “correct” results obtained with the very expensive Ewald method. As we will discuss below, other approximation methods such as mapping the charges to a grid and solving the Poisson equation [22], possibly in lower dimensions [23, 24], also introduces uncontrollable systematic errors.
An algorithmically optimal Fast Multipole Method.
In this paper we introduce a modification of FMM for KMC. We show that this overcomes the fundamental issues described in [19]. With our modified FMM algorithm the cost per KMC step grows linearly in the number of charges (and not quadratically as claimed in [19]) and accurate electrostatic interactions can be included in large-scale KMC simulations. The key observation is that - since FMM describes the long-range contribution as a continuous field - the change in the electrostatic potential energy can be evaluated at a cost of (i.e. independent of the particle number) for each proposed hopping event. As there are potential hopping events per KMC step, the total computational complexity of the propensity calculation is . Updating the FMM field after one hop is accepted carries an additional cost of , resulting in a total computational complexity per KMC step which scales linearly in the number of charges.
While not the topic of this paper, we remark that it is also possible to improve the computational complexity of standard Monte Carlo (MC) by similar methods. As will be argued at the end of this paper, we believe that changes to the electrostatic energy for each individual attempted MC move can be calculated at a computational cost with a suitably modified version of FMM.
To simulate large physical systems, an efficient, parallel implementation of the algorithm is important to obtain meaningful results in a reasonable time. Easy integration into existing simulation packages and workflows can be achieved by providing a minimal yet flexible user-interface. With the recent diversification of the hardware landscape, the code should be performance portable and run on different chip architectures, including, for example, traditional CPUs and GPUs. The implementation described in this paper is based on the performance portable framework first introduced in [25]. By providing a Python interface and using code generation techniques, the code is fast, yet allows the user to express their algorithms at a high abstraction level.
For an idealised setup we find that our FMM-KMC algorithm can be used to carry out simulations with exact electrostatics on problems with charges in s per KMC step when running on a parallel computer with cores. In a physically realistic configuration the hopping processes of 20412 particles in a -NPD problem doped with F6TCNNQ at a concentration of could be simulated at a rate of per KMC step on a single 12-core Skylake CPU.
Structure.
This paper is organised as follows: After reviewing the key concepts of KMC and FMM in Section 2 we describe our adaptation of the FMM algorithm for KMC simulations in Section 3 and review related work in Section 4. An efficient implementation of our method based on the performance portable framework in [25] is described in Section 5, where we discuss the user-interface in detail. Numerical results which demonstrate the accuracy, computational efficiency and parallel scalability of the algorithm for idealised model systems and a physically relevant setup are presented in Section 6. Finally, we conclude and discuss possible future directions of our work in Section 7. Some more technical aspects are relegated to the appendices. The standard FMM algorithm is written down in Appendix A and the correction term for charge distributions with a non-vanishing dipole-moment is derived in Appendix B. An improved user interface for proposing moves, which is optimised for efficiency, is described in Appendix C. Previously, we reported on the performance of Ewald-based long range electrostatics in the same code base [26]. To complement this work we discuss the performance and scalability of the standard FMM implementation in Appendix D.
2 Review of Methods
To put our new algorithm into context and establish necessary notation, we first review the KMC method and the standard FMM algorithm.
2.1 Kinetic Monte Carlo
Molecular Dynamics (MD) and Monte Carlo (MC) are the standard computational tools for predicting the properties of physical and chemical systems from first principles (see e.g. [27, 28]). Typically it is assumed that the molecular constituents interact via phenomenological classical potentials; for charged systems this includes long range electrostatic interactions. While MC can be used to study systems in equilibrium by sampling from the steady state distribution, MD allows the simulation of dynamical processes such as time dependent charge propagation in batteries and solar cells. In solid state systems at moderate temperatures and pressures there are often two types of processes which occur at very different time scales: fast oscillations around local minima of the energy landscape, which are separated by large energy barriers, and much slower transitions between those minima. In a system with these properties MD is highly inefficient for extracting quantities such as charge mobilities and voltage characteristics. This is because the MD timestep needs to be small enough to resolve the fast oscillations, yet the trajectories have to be sufficiently long to include dynamically relevant transitions between local minima. In fact, the rapid oscillations do not contain interesting physical information on charge transport processes, and should be integrated out. Kinetic Monte Carlo (KMC) [1, 2, 3, 4] overcomes this problem by treating each of the local minima as an independent configuration or state (here and in the following indices are used to label states; particles are indexed with Roman letters and Greek letters are used for cells in the computational grid). The dynamics are approximated by probabilistic transitions between those configurations. The generated probabilistic trajectory is equivalent to snapshots of the full MD simulation at discrete times. For example, free charge carriers in a crystal at room temperature are bound to specific local sites, and the states correspond to particular distributions of the particles, such that every site is either empty or occupied. KMC assumes that there is a fixed rate for a configuration to transition from state to state in a given time, i.e. each transition is modelled as a Poisson process. The rates are known as “propensities” in the KMC literature. This is a valid approximation if we assume that - compared to the state-transition time scale - the fast oscillations around the local equilibria occur so rapidly that the particles “forget” their previous history, and the transition probability is the same at each point in time. The propensities are inputs to the KMC algorithm and it typically assumed that depends on the energy difference between states and . Since transitions between states are modelled by a Poisson process, the probability distribution function for the time of the first escape from the state is
[TABLE]
Here is the total number of states. This information is used to work out the physical time interval between subsequent snapshots. The following two steps occur at each transition between states:
Starting from state , pick a new state such that the the probability of transitioning from state to is proportional to . 2. 2.
Increment the current time by drawing from the distribution in Eq. (1). This can be achieved by choosing a uniform random number and setting the time increment .
Note that while the method is written down for a finite number of states in Algorithm 1, it also works for systems with an infinite number of configurations, if it is assumed that in each step of the algorithm only a finite number of other states can be reached. This is often a sensible assumption since particles can only hop to nearby sites.
For a particular problem the propensities are an input for the algorithm and need to be calculated, for example by working out the Boltzmann factors of different configurations. Crucially, this calculation requires knowledge of the change in system energy induced by the hop. Including the contribution of the electrostatic interaction to this energy difference is very expensive and requires efficient algorithms.
2.2 The Fast Multipole Method
To allow an in-depth understanding of the proposed new algorithm for electrostatic interactions in KMC, we first describe the classical Fast Multipole Method (FMM) introduced in [16], before discussing its adaptation in Section 3. For further technical details we refer the reader to the original literature [16, 17, 18]. In three dimensions the FMM algorithm uses a hierarchical grid with levels for the computational domain (which is assumed to be a equilateral cube of width ) such that the number of cells on each level is for . The number of cells on the finest level is , and typically is chosen such that there there are particles in each fine level cell. Each cell on level is subdivided into 8 child-cells on the next-finer level; conversely each cell on level has a unique parent cell. The Fast Multipole Algorithm now computes the electrostatic potential by splitting it into two contributions. First, the long range part is calculated by working out the multipole expansion of all charges in a fine level cell, followed by an upward- and downward traversal of the grid hierarchy (see Fig. 2). In the upward pass of the algorithm the multipole expansions around the centre of a cell are recursively combined and converted to multipole expansions around the centre of the parent cell, obtaining a single multipole expansion around the centre of the computational domain on the coarsest level . In the downward pass the multipole expansions on each level are transformed into local expansions around the centre of a cell. Those are then recursively combined into local expansions in the child cells. By only considering the contribution from multipole expansions in a fixed number of well-separated cells on each level, the contribution from distant charges are resolved at the appropriate level of accuracy, while including the contribution from closer charges in finer levels.
The -term multipole expansions and the local expansions which play a central role in the FMM algorithm can be expressed in terms of the spherical harmonics , i.e.
[TABLE]
where are spherical coordinates relative to a suitable origin.
Overall it can be shown [16, 17, 18] that at leading order the computational cost of the method is
[TABLE]
where is the order where the multipole and local expansions in Eq. (2) are truncated. To bound the error by some , the value of needs to be chosen such that .
The calculation of the local expansions on the finest level is written down explicitly in Algorithm 2 in Appendix A and requires the following definitions, which will be used in the subsequent discussion of the FMM method for KMC simulations:
the -term multipole expansion (see Eq. (2)) about the centre of cell on level that describes the potential induced by all charges contained in cell .
the -term local expansion (see Eq. (2)) about the centre of cell on level that describes the potential induced by all charges outside the cell and its 26 nearest neighbours.
the linear operator translating multipole moments to multipole moments around a different origin.
the linear operator converting multipole moments to coefficients of the local expansion.
the linear operator translating local expansion coefficients to local expansion coefficients around a different origin.
Finally, the short range contribution of the electrostatic potential is obtained by calculating the field generated by charges in neighbouring cells directly. Fig. 3 illustrates how the total calculation is split up into the long- and short-range contributions discussed above. The algorithm for calculating the total electrostatic energy from the local expansions on the finest level is given in Algorithm 3 in Appendix A.
2.2.1 Boundary conditions
So far we assumed free-space boundary conditions, i.e. we consider a finite charge distribution contained inside an unbounded physical domain. In this case the interaction list is empty on the two coarsest levels, and we can set or equivalently skip those two levels in the downward pass.
When simulating large physical systems, however, the computational domain is typically replicated in one or several space dimensions to avoid spurious surface effects from the finite computational domain. The FMM algorithm is readily modified to account for this, as we discuss in the following for two important cases. Note that care has to be taken if the system has a net charge or a non-zero dipole moment - in those cases the lowest-order sums over periodic copies is conditionally convergent and need to be fixed using physical conditions, see appendix 4.1 of [16] and Appendix B of this paper.
Periodic.
In the simplest case the computational domain is replicated periodically. To account for this, the operator has to be modified on the coarsest level. On this level the local expansion receives contributions from the multipole expansions in an infinite number of well-separated periodic copies. In [29] the contributions from those copies are summed with an Ewald-like method and it is shown that they can be accounted for by simply replacing the spherical harmonics which appear in the linear operator by an infinite sum . In other words, the local expansion on the coarsest level can be obtained from the multipole expansion on the same level through multiplication by a known linear operator: . The sum and the linear operator can be calculated once at the beginning of the simulation.
Dirichlet.
In some cases it is desirable to apply homogeneous Dirichlet boundary conditions on some or all boundaries of the computational domain. This allows the inclusion of an external electric field which is present for example in batteries. As discussed in Appendix 4.2 of [16], this can be achieved by adding appropriate virtual mirror charges, effectively replacing the infinite grid of periodic copies by suitably modified reflected and inverted copies of the primary charge distribution. Apart from adjusting the value of the sum this will require no further modifications of the algorithm. An alternative approach, which we pursue here (and which is also described in [20]), is to simply extend the computational domain with the first mirror charge image and replicating the extended domain periodically, as shown in Fig. 4. We then apply the above algorithm for periodic boundary conditions to this extended domain. In the physical applications we consider here the potential is fixed at the top and bottom of the domain, which is assumed to be periodic in the other two space dimensions. This will only lead to a potential loss of performance by at most a factor two from doubling the number of charges in the system.
2.2.2 Dipole correction
In the case of non-trivial boundary conditions, the contribution of the dipole terms to the infinite sum which determines the local expansion is conditionally convergent. As discussed in [16], the value of the sum needs to be fixed by physical considerations. For example, following section 4.1 of [16], one could require that for a configuration which consists of a pure dipole pointing in the direction the difference in the potential between the points and vanishes. This is not the case for the treatment in [29], which induces a constant electric field in the -direction in the presence of a dipole; physically this corresponds to a non-zero surface charge at infinity. In our calculations we choose to require , i.e. no surface charge at infinity. As explained in Appendix B, this can be achieved by adding a compensating external electric field where is the dipole moment in the simulation cell. In practice this amounts to modifying the local expansion coefficients defined in Eq. (2) on the coarsest level.
3 FMM for KMC
We now describe how the FMM algorithm can be used to compute electrostatic energy differences required for the calculation of propensities in KMC simulations. In contrast to a naive approach, where the classical FMM algorithm described in Section 2.2 is simply used as a black-box solver to calculate the charge distribution after each proposed move, here the multipole method is closely integrated with the KMC update. For simplicity, we initially consider free-space boundary conditions, before discussing the modifications which are required to adapt the algorithm to periodic- and Dirichlet- boundary conditions in Section 3.1. As we will show, using the correctly modified FMM results in a total computational complexity of per KMC step (the complexity is for non-trivial boundary conditions). In each step, the electrostatic calculation can be split into two parts:
- •
Propose moves: calculate the change in electrostatic energy for all potential moves (line 5 of Algorithm 1)
- •
Accept move: update the electrostatic potential, i.e. the local expansions , once a move has been accepted (line 10 of Algorithm 1).
We further assume that the standard FMM algorithm in Algorithm 2 has been used to calculate an expansion of the local field from long-range contributions before the first KMC step. Since the number of KMC moves is very large (compared to ), this startup cost can be safely neglected. In the following the calculation of electrostatic energy differences in the proposal stage and the update of FMM data structures are discussed separately.
Propose moves.
Let be the proposed new position of a particle with charge currently located at position in cell on the finest level of the FMM grid hierarchy. Further denote the 26 direct neighbours of a cell as and define . Assuming that the new position is in cell (which might be identical to ), the total change in the electrostatic energy due to the move is given by
[TABLE]
The first two brackets in Eq. (3) describe the difference in electrostatic energy of the particle at the new and old position, split into a long range part (given by the local expansions and ) and direct interactions with all particles in the cell which contains the particle and its direct neighbours. Note that the terms in the first bracket, which describes the potential energy after the move, contain the potential induced by the particle at position before the move. This contribution is contained either implicitly in , if the cells and are well separated, or included in the direct contribution if this is not the case, because one of the is identical to . Clearly this is incorrect since the particle has moved to in this proposal and is no longer at . This is fixed by removing the spurious self-interaction in the final term of Eq. (3).
Since the local expansion defined in Eq. (2) consists of terms, and each cell contains particles on average, the total cost per proposal is
[TABLE]
independent of the total number of charges.
Accept move.
Once a move has been accepted, the local expansion on the finest level has to be updated in all cells to account for this. Naively, this could be done by re-calculating the entire field using Algorithm 2, at a cost of . However, since the change in the charge distribution is only very small, the change in can be computed much more efficiently by subtracting the contribution of the charge at the original position and adding it back on at the new position . For this, loop over all cells on the finest level. If cell is well separated from cell (i.e. , add the local expansion around the centre of cell which is induced by a monopole with charge at position to , using a multipole-to-local translation . Similarly, if is well separated from , add the local expansion induced by a monopole of charge at the new position . Since the local expansion contains terms and the total number of cells on the finest level is , this reduces the cost of the accept step to
[TABLE]
For higher degrees this leads to significant savings of a factor relative to the naive re-calculation with Algorithm 2.
The above method is readily extended to non-trivial boundary conditions introduced in Section 2.2.1 as follows.
3.1 Boundary conditions
Periodic boundary conditions.
When periodic boundary conditions are applied, the simulation cell is surrounded by an infinite lattice of periodic images of the domain indexed by integer valued offsets (where corresponds to the primary image). This infinite lattice is split into two disjoint sets as shown in Fig. 5. The finite set
[TABLE]
contains the primary image and the surrounding 26 nearest neighbours and
[TABLE]
consists of all other periodic copies.
Due to linearity, the system energy is given by a contribution from periodic images in plus the contribution from images in . For a proposed move we consider the contribution to system energy from each of these two sets separately and sum them to obtain the total change in system energy.
We refer to the contribution to the system energy from images in as the far-field component and the contribution from images in as the near-field component. As before, let be the proposed new position of a particle with charge currently located at in cell on the finest level of the FMM grid hierarchy and assume that the new position is in cell .
Near-field component.
To compute the near-field component we simply extend the method described for free-space boundary conditions to include all images in ; instead of moving a single particle, we also move its 26 copies in when proposing a move.
While this could be achieved by working with all charges in the 27 cells in and employing free-space boundary conditions, in practice it is more efficient to work with the primary image only and implicitly include the 26 copies. To achieve this, first the FMM algorithm in Algorithm 2 is modified such that at the end of the downward pass a local expansion represents the potential induced by all charges in . This can be realised by setting at the beginning of the downward pass. Secondly, the interaction lists used in the downward pass are not truncated at the boundary of the primary domain and instead are wrapped around the boundary to account for the neighbours in . In a similar manner the neighbour cells of a boundary cell include cells over the boundary in . With these modifications the contribution to the system energy from charges in is
[TABLE]
where is the index of the fine-level cell which contains particle . The asterisk () on indicates that neighbours are wrapped periodically across the boundary of the computational domain, as discussed above. The factor is required to avoid double-counting in the total energy.
Consequently, the change in the near-field system energy for the proposed move is
[TABLE]
Eq. (6) is identical to Eq. (3) except for the final two terms. The penultimate term is a generalisation of the self-energy correction in Eq. (3), which also includes the periodic copies of . The final term accounts for the fact that all periodic copies of the particle are moved to new positions with , and those charges contribute to the energy of the particle at . The sum is independent of and it is readily evaluated to
[TABLE]
Far-field component.
We compute the potential induced by charges in by following the approach used for the standard FMM algorithm [29]. In the setup phase of the simulation the multipole expansion is computed directly from the initial configuration. The multipole coefficient is
[TABLE]
This is converted into the local expansion with expansion coefficients where is the operator introduced in Section 2.2.1. More specifically, the local expansion of the potential induced by the periodic images in is (c.f. Eq. (2))
[TABLE]
Hence the energy of charges interacting with a far-field which is expressed as the local multipole expansion in Eq. (7) is obtained by evaluating Eq. (7) at the particle positions , multiplying by and summing over all particles.
[TABLE]
The double sum in Eq. (9) is readily computed as a dot product in dimensions at a cost of . Now consider a proposed move of a particle with charge from to . To evaluate the far-field energy of the modified charge distribution we need to do two things: first, we need to add to the expression for in Eq. (9) and simultaneously subtract since the position of the particle has changed in the sum in Eq. (8). Secondly, the far-field multipole coefficients and hence the local expansion coefficients in Eq. (7) change, since we assume that this far-field is generated by the periodic copies of the cell in . This can be accounted for by adding a correction to the multipole expansion which describes a monopole of charge at and a monopole of charge at the old position . Those two modifications are achieved by setting
[TABLE]
and
[TABLE]
This gives the modified local expansion coefficients
[TABLE]
The new far-field contribution to the system energy for the proposed move is
[TABLE]
Total change in energy.
Combining the near-field and far-field components the change in system energy of a proposed move is given by
[TABLE]
The near-field component retains the computational complexity of the free-space case as the additional correction terms have a constant cost per proposal. For the far-field energy evaluation, the creation of and involves the computation of coefficients with complexity . The multipole-to-local operator can be applied as a matrix-vector product with computational complexity . Finally, evaluation of the far-field local expansion exhibits a computational complexity. Hence the overall cost of proposing a move is
[TABLE]
independent of the number of charges. This should be compared to the complexity for free-space boundary conditions given in Eq. (4).
Accept move.
For an accepted move the local expansions on the finest level are updated as in the free-space case. This operation has a computational complexity of . To account for the far-field component, we store the quantities and . Whenever a move is accepted, the values are updated according to Eqs. (10) and (11), i.e.
[TABLE]
This update requires the computation of expansion terms, and hence has an cost. For an accept operation in the fully periodic case the dominant cost is the update of the local expansions required for the near-field energy computation. We conclude that the total cost for accepting a move is
[TABLE]
which has the same computational complexity as the accept-step for free-space boundary conditions111To keep the interface in the code general and allow transitions to states which have not been previously proposed, the change in energy for the new state will be re-computed when accepting a move. This adds an additional cost, which can be safely neglected as long as . in Eq. (5).
We conclude that the total complexity of the algorithm is per KMC step even for non-trivial boundary conditions. The estimates in Eqs. (12) and (13) are confirmed numerically in Fig. 9 below.
4 Related work
To highlight the impact of the novel FMM-based algorithm presented in this paper, we review existing approaches for including electrostatic interactions in KMC simulations.
As argued in [20], most recent KMC studies truncate or neglect long range electrostatic interactions. The few exceptions reported in the literature typically include some results obtained with the Ewald method, which serves as a computationally expensive reference implementation to quantify systematic errors introduced by those approximations. For example, the authors of [21] do not include the effect of electrostatic interactions in most of their results for a KMC study in doped perovskites. This leads to a change in the protonic diffusion coefficient by , compared to the “correct” result obtained with Ewald summation. A systematic comparison of photovoltaic simulations with truncated potentials and the Ewald method is also presented in [20]. The authors find that introducing a cutoff potential underestimates the device performance and overestimates average charge carrier densities. The Ewald summation is optimised by precomputing the mutual potential between all pairs of charges. This reduces the cost of one proposal to . On the structured lattice which is used in [20] it requires the storage of terms, but for large systems with an irregular arrangement of the hopping sites the memory requirements would make the method infeasible. In [30] perovskite crystal growth is modelled with a KMC method which uses Ewald summation for long range electrostatic interactions. However, the setup is very different to the problem considered here, since the system is described by a series of growing stacks on a 2d surface.
Electrostatic interactions can also be included by mapping the charge distribution to a grid and solving the Poisson equation. Since a solve of this three-dimensional partial differential equation is expensive, a lower dimensional approximation is used in some cases to make the computation feasible. For example, in [23, 24] the one dimensional Poisson equation is solved for a layer averaged charge density, while including the electrostatic field of nearby charges and periodic mirror images exactly. Naturally, this approach introduces uncontrollable systematic errors.
As discussed in [22], including electrostatics by solving the three dimensional Poisson equation naively requires an expensive re-calculation of the potential for every potential hop since the charge has moved and its contribution can not be included in the current potential. To address this issue, the self-interaction error in the naive approach can be suppressed or removed by adding and subtracting the field of a single charge. While efficient methods (such as multigrid [31]) exist for solving the Poisson equation in time on a grid with cells (and it is reasonable to assume that the number of grid cells is at the same order as the number of lattice sites), discretising the Poisson equation is likely to lead to uncontrollable errors due to the peaked distribution of point particles, which can not be represented accurately on a grid. A massively parallel GPU implementation of a KMC method is described in [32]. While the authors still employ a cutoff for the electrostatic interactions, after each KMC step the local potential is updated by adding a dipole correction term, instead of recalculating the value of the field.
Existing FMM implementations and KMC libraries.
Not surprisingly, there is a plethora of existing FMM implementations for the standard method written down in Algorithms 2 and 3. Some of those, such as ScalFMM [33] and ExaFMM [34] are specifically designed for performance and massively parallel scalability; ExaFMM also targets GPU systems [35]. Similarly, a parallel FMM implementation for heterogeneous systems is described in [36]. Other actively developed parallel libraries are FMMlib3d [37], RECFMM [38] and DashMM [39]. Often those libraries can treat more general potentials, for example, ScalFMM is kernel-free and allows the user to implement their own interaction potentials. Since ScalFMM has been highly optimised for modern multicore processors, we quantify the absolute performance of our standard FMM implementation by comparing it to ScalFMM below. The PVFMM code [40] includes several improvements, for example an FFT-based multipole-to-local translation and performance optisation through threading and vectorisation. In FMM implementations the complexity of the multipole to local translation is of particular interest. In [18] Greengard et al. propose that, through a particular choice of the number of charges per cell on the finest level, all required multipole to local translations can be applied with a total cost that is . For a comparison of different established FMM codes see also [41]. However, as far as we are aware, none of the above FMM libraries have been used in KMC simulations. Conversely, existing KMC libraries such as DL_AKMC [42], SPPARKS [43] and KMCLib [44] do not include support for long range electrostatic interactions or rely on one of the approximations described above (see e.g. the study in [45], which uses the commercial Bumblebee library [46]).
5 Implementation and user interface
Algorithmically optimal methods such as the FMM-KMC scheme introduced in Section 3 have to be implemented efficiently on massively parallel hardware. With the recent diversification of the hardware landscape, it is equally important that the code has a simple, intuitive user interface and runs on different chip architectures, such as manycore CPUs and multithreaded GPUs. To ensure the reproducibility of our results and allow others to benefit from the library, we now describe the design of the code in some detail.
5.1 Performance portable framework
The FMM-KMC algorithms introduced in this paper were implemented on top of the performance portable framework described in [25], which we refer to as “PPMD” in the following. The PPMD code is freely available at
Our FMM-KMC implementation is provided in the coulomb_kmc Python package and can be downloaded from
https://github.com/ppmd/coulomb_kmc.
A recent snapshot of the code, which can be used to reproduce the results in the paper, is also provided as [47].
The overall design principle of PPMD is to provide a high-level Python user interface which is flexible enough to express fundamental looping mechanisms for interacting particles, while automatically generating highly efficient code on different hardware platforms. As discussed in Appendix C.1, using this Python interface also allows the easy and efficient implementation of the user-specific handling of the KMC data structures, such as masking forbidden moves based on the problem-dependent matrix of hopping sites.
Efficiency is achieved by using code generation techniques; once the user has expressed the fundamental interaction kernel as a short snippet of C-code, dedicated looping code which executes the entire particle- or particle-pair loop over all particles is generated. Depending in the platform, this might be realised with MPI, OpenMP or with CUDA on GPUs. High-level operations, such as the main time stepping loop, are implemented in Python and orchestrate the calls to the computationally expensive particle- or particle-pair loops. As shown in [25], the performance of the PPMD code is on a par with monolithic MD codes written in C and Fortran, such as LAMMPS [48] and DL_POLY [49, 50].
While the main application of PPMD is molecular dynamics, the PPMD interface and data structures are abstract enough to also implement KMC algorithms which are the topic of this paper. As described in [26], PPMD also includes a library for calculating electrostatic interactions via Ewald summation; this is very useful for testing the FMM algorithms developed in this paper.
Fundamental data structures and interfaces.
The most important data structure in PPMD is the ParticleDat class. This is a distributed storage space for data associated with individual particles in the simulation. Example properties which can be stored as ParticleDats are the positions, velocities and charges of all particles in the system. However, in principle any property, such as the number of local neighbours of a particle, could be stored in a ParticleDat. Under the hood, ParticleDats are realised as wrappers to numpy arrays. In addition, global properties shared by all particles, such as the total energy of the system, can be stored in GlobalArray or ScalarArray objects, where the latter is read-only when accessed from inside a kernel.
To manipulate data, the user writes a short kernel in C which is executed over all particles or all pairs of particles. A particle-pair loop is specified by this C-kernel and a list of all ParticleDats, ScalarArrays and GlobalArrays which are modified by the kernel. Access descriptors specify whether the data is read or written. Based on this specification of the particle-loop, the code generation system automatically generates efficient code for executing the kernel over all pairs of particles. Depending on the access descriptors, it also performs suitable communication calls to synchronise parallel data access and avoids write conflicts in threaded implementations.
To illustrate the idea, we show a short Python code snippet for naively calculating the total potential energy for particles interacting via a Coulomb potential
[TABLE]
in code listing 1. Note that the key operation
[TABLE]
which is executed for all particle pairs is encoded in the C-code stored in the string kernel_code. Positions and charges are stored in the ParticleDats r and q, while the total energy is stored in the GlobalArray U. Interested readers are referred to [25] for more details on PPMD.
5.2 FMM-KMC user interface
To implement the FMM algorithm for KMC simulations described in Section 3, we introduce a new KMCFMM Python class in the coulomb_kmc Python package. This class provides three key operations:
Initialisation of the FMM fields and calculation of the system’s full electrostatic energy at the start of the simulation 2. 2.
Evaluation of the electrostatic energy difference for all proposals 3. 3.
Acceptance of a move selected by the user
Note that the rest of the KMC algorithm, such as the selection of a move for acceptance based on the propensities and working out the set of allowed potential moves is still the responsibility of the user. Section C.1 discusses an example of how the latter can be realised efficiently with the PPMD data structures and parallel loops. Relegating high-level control over the algorithm to the user also allows the easy extension of the basic KMC method to more advanced techniques, such as multilevel methods, or the inclusion of post-processing steps to extract physically meaningful information.
The constructor of the KMCFMM class is passed the initial positions and charges of all particles, stored in ParticleDat objects. It allows the user to choose the parameters of the FMM solver, such as number of FMM levels and expansion terms. The initialise() method computes the system’s electrostatic energy for the initial positions of all charges based on the standard FMM algorithm described in Algorithms 2 and 3. The method also creates and initialises all data structures required for the subsequent proposed and accepted moves.
5.2.1 Proposing moves
The KMCFMM class provides two interfaces for calculating the change in system energy of proposed moves. The simple propose() method assumes that for a particle with index located at position and carrying charge , there is a finite set of potential new positions which this particle could move to. The potential moves of all particles are stored as a list of tuples of the form for , which are passed to the propose() method in the form
[TABLE]
The change in total electrostatic energy for the move with is denoted by . The changes in energy for all potential moves of particle are collected in the list
[TABLE]
The propose() method returns a tuple of arrays of electrostatic energy differences , where each is of the form described in Eq. (16).
To encode three-dimensional vectors, the proposed new positions are passed to the propose() method as a dimensional numpy array for each charge. The method returns a numpy array with the change in electrostatic energy for each proposed move. An example is shown in code listing 2. In this case particle 3 can move to two potential new positions, namely and , whereas there is only one potential hop to the new position for particle 7. For this setup the tuple is given by
[TABLE]
which should be compared to the Python code in Listing 2. After the call to propose(), the variable U will contain the corresponding energy differences as a tuple of numpy arrays in the format
[TABLE]
While the interface for proposing moves described here is intuitive, it is not optimal in terms of efficiency. The improved, but more sophisticated, propose_with_dats() interface is described in Appendix C, where we also illustrate its use for a practically relevant example.
5.2.2 Accepting moves
The KMCFMM instance provides a accept method that accepts a proposed move, updates the system energy and updates internal data structures. To accept a move, the accept() is called with a tuple consisting of a charge index and a new position . An example is given in code listing 3, which assumes that particle moves to the new position .
5.3 Parallelisation and optimisation
On distributed memory machines domain-decomposition is used to parallelise the KMC-FMM algorithm described in Section 3. For this the computational domain is divided between MPI ranks such that each rank “owns” the charges in its local subdomain . Proposals for particles owned by different MPI ranks are handled concurrently. is augmented by a halo region to obtain an extended local domain . It is assumed that the particles are evenly distributed and hops are limited by some maximal distance which determines the size of this halo and hence . Under those conditions, which seem reasonable for many physical systems, domain decomposition will result in good load balancing and – as will be described in the following – requires very little parallel communication. The results in Section 6.3 confirm the excellent parallel scaling on up to 128 nodes.
To store local- and multipole expansions and on all levels of the grid hierarchy, a distributed “Octal Tree” (OT) data structure is set up at the beginning of the simulation. As described in Section 5.3 of [51], data can be attached to the OT using different parallel access modes. This allows suitable halo-exchanges between neighbouring MPI ranks and ensures that all children of a particular coarse level cell are owned by the same rank during the upward- and downward pass of Algorithm 2. The local expansions are calculated for all cells on the finest level at the beginning of the simulation. The OT cells on the finest level are not necessarily aligned with the local subdomains . However, each MPI rank keeps copies of the local expansions for all fine level OT cells which cover its extended domain . It also maintains copies of all particle positions and charges in those cells.
When a potential move is proposed, calculating the change in energy in Eq. (3) for free-space boundary conditions requires the evaluation of the local expansion on the finest level at the old and new positions, and knowledge of particle positions in neighbouring cells. This data is available in local copies, provided the halo is chosen large enough. Upon accepting a move, ownership is transferred if the moved particle crosses a subdomain boundary. The local expansions are updated for all cells which are affected by this move. Note that neither evaluating the energy change for the proposals nor the update of at the end of a KMC step requires any parallel communication apart from sharing the details of the accepted move between all processors.
For non-trivial boundary conditions the near-field change in electrostatic energy given by Eq. (6) can be handled in the same way as just described for free-space boundary conditions, both when proposing and accepting a move. Calculating the change in the far-field contribution given by Eq. (8) requires an update to the expansion coefficients on the coarsest level and defined in Eq. (9). Identical copies of both and are stored on all MPI ranks. When a move is accepted, they are updated locally on each MPI rank by removing the contribution from the particle at and adding the contribution from the new position .
We use OpenMP as a shared memory programming method to distribute the proposed moves on each MPI rank over available cores. This reduces load imbalances due to variations in the number of proposed moves per particle. Furthermore, in this hybrid MPI+OpenMP approach the volumes of the subdomains handled by individual MPI ranks are larger than in an MPI only execution. In addition to improved load-balancing, this increases the ratio of subdomain volume to halo region volume, which reduces the computational work of accepting a move as fewer expansions and particles must be updated.
All performance critical operations which manipulate multipole and local expansions are implemented as auto-generated C code. This allows fixing the number of expansion coefficients at compile-time, which enables important optimisations such as loop-unrolling and auto-vectorisation. In particular, evaluation of the spherical harmonics at a particular position with recurrence relations (see e.g. [52, 53]) is carried out with a two-level loop nest. The bounds of the inner loop with index depend on the current iteration of the outer loop over which makes it virtually impossible to vectorise the code if the outer loop bound is only fixed at runtime. However, if is known at compile time, the loop can be unrolled to generate a long sequential list of updates, which contain the same algebraic operations and can are readily vectorised. In addition, when generating this code combinatorial factors which depend on the loop indices and can be pre-computed at compile time. This reduces the number of floating point operations and further improves performance.
Finally, note that any further book-keeping operations required in a KMC step, such as those described in Algorithm 4, are implemented as ParticleLoop and PairLoop constructs in PPMD. They are therefore automatically parallelised on any parallel architecture, as described in detail in [51, 25].
6 Results
The KMC simulations reported in this section were carried out on the CPU nodes of the “Balena” cluster at the University of Bath, with some additional weak- and strong- parallel scaling runs on the ARM-based “Isambard” supercomputer and the Intel-based CSD3 Peta4 computer. On Balena, Intel Ivy Bridge and Skylake nodes with 16 and 24 cores in total were used. On Isambard, one full Cavium ThunderX2 node contains 64 cores. Finally, on CSD3 Peta4 a full node contains 32 cores. The code was run in mixed MPI+OpenMP mode, the exact node configuration and process layout can be found in Tab. 1. This setup was used for all runs, unless explicitly stated otherwise below. All autogenerated code in PPMD was compiled with version 17.1.132 of the Intel compiler on Balena, using the same version of IntelMPI for distributed memory parallelisation. On Isambard GCC version 8.2.0 was used together with CRAY MPICH 7.7.6. Finally, on CSD3 Peta4, Intel compiler and IntelMPI version 2019.3 was used. Raw results and code snapshots are publicly available for reproducibility purposes in the archive at [47].
6.1 Error analysis
We first demonstrate that the errors in the FMM-KMC method can be systematically quantified and decrease as the number of multipole expansion coefficients increases. In Chapter 5 of [51] we measured the error on the total electrostatic system energy for the standard FMM algorithm and studied its dependency on . In a KMC simulation, however, the quantity of interest is the change in the system’s electrostatic energy for each proposed move. As confirmed by the results in Tab. 3, is typically several (two or more) orders of magnitude smaller than , which is proportional to the problem size. This places tighter bounds on the allowed numerical errors, which have to be small compared to the energy change for individual proposed moves.
To quantify the relative error on , we consider a system of constant density (0.01 charges per unit volume) and record the change in energy for proposed moves. As in [51], the charges are initially arranged in an almost regular cubic lattice with small random perturbations to create a setup which is representative of physically relevant systems. Each proposed move corresponds to an additional small arbitrary displacement from the initial positions. To explore a wide range of configurations, a new pseudo-random system was generated every proposals.
For each proposed move we calculate a highly accurate estimate of the “true” energy difference by using our KMC-FMM algorithm with 26 expansion terms. The corresponding energy difference computed with expansion terms is denoted as . Based on this we define the absolute relative error of move as
[TABLE]
and use this quantity to assess the accuracy of the simulation.
We initially keep the system size fixed at and vary the number of expansion terms from 12 to 21; the histograms in Fig. 6 show the number of samples with a given error for increasing . Inspecting the distribution plotted there with logarithmic scale on the horizontal axis, it can be seen that the error on is reduced by around one order of magnitude as is incremented by 3.
To further quantify the size of the error, let
[TABLE]
be the sample average over independent proposed moves for some quantity . We estimate the expected relative error and its variance as
[TABLE]
Corresponding estimates for the average absolute energy difference and the average absolute system energy are222Since the system is initialised at random, there is an equal probability of to be positive or negative, and it is therefore natural to consider its absolute value.
[TABLE]
The dependency of on the number of expansion terms is shown in Tab. 2, where we also we give the estimated variance, and other relevant quantities for charges and expansion terms.
We repeated the same experiment for fixed and varying problem sizes , the results are shown in Tab. 3.
As confirmed by those tables, on average the relative error defined in Eq. (17) is about three orders of magnitude smaller than the average change in energy itself. The total electrostatic energy of the system grows in proportion with the number of charges and is significantly larger than .
6.2 Computational complexity
Next we confirm that the runtime of the method grows in direct proportion to the number of charges . We consider a system with periodic boundary conditions. Recall that theoretically proposing a single move carries a cost of and accepting a move costs , resulting in a linear growth of the computational complexity per KMC step. This is demonstrated by varying the number of charges in a system (for a fixed number of expansion terms) and plotting the time for an individual proposal and for accepting a move (normalised to the number of charges) in Fig. 8. As this figure shows, both times are in the range of a few when running the code on 16 cores of an Intel Ivy Bridge node.
The sawtooth nature of the plot is an artifact of the varying number of FMM levels, which depends on the problem size as . The sharp drops on the right edge of the “teeth” correspond to an increase of by one, whereas all points on the left, shallow side of the “teeth” were obtained with the same value of , which becomes less optimal as the problem size grows.
Based on those numbers, we estimate the total time per KMC step as
[TABLE]
where is the estimated average number of proposed moves per charge and per KMC step. This value is motivated by the setup in Section 6.3, and of the same order of magnitude as observed values for the -NPD test case in Section 6.4. The time per KMC step is plotted Fig. 8, which confirms that our implementation indeed achieves the expected computational complexity.
The results imply that it is possible to simulate a system with one million charges in about per KMC step when running on a single 16 core Ivy Bridge node and limiting the relative error in the energy difference per proposed move to .
To demonstrate that the computational complexity depends polynomially on the number of expansion terms , we repeated the above experiment but fixed the number of charges at while varying between 2 and 30. The measured runtimes in Fig. 9 confirm that depends quadratically on the number of expansion terms, whereas is asymptotically proportional to the fourth power of .
For the following numerical experiments expansion terms were used.
6.3 Distributed Memory Parallelism
While the previous section shows that it is possible to carry out KMC simulations with one million charged particles in a reasonable time on a single Ivy Bridge node, physically meaningful results can often only be extracted from much larger systems. Setups with particles allow the resolution of grain boundaries and ultimately move closer to the simulation of full photovoltaic devices and batteries. Modelling systems of these sizes in reasonable times requires distributed memory parallelism to utilise multiple compute nodes in a HPC facility.
As discussed in Section 5.3, our implementation employs a hybrid MPI+OpenMP parallelisation strategy. To demonstrate the parallel scaling of a full KMC simulation we implemented the book-keeping algorithm in PPMD, employing the same techniques as in the example described in Appendix C.1. Recall that this uses the propose_with_dats() interface of our FMM-KMC implementation for optimal efficiency. The energy differences of all proposed moves are used to calculate the associated propensities by using a ParticleLoop. To choose a transition following steps 5-7 of the KMC Algorithm 1, the partial sums are computed on each MPI rank and combined across all processes using an MPI_Allgather operation. Finally, the chosen proposed move is accepted by all MPI ranks.
To investigate how effectively our implementation scales across multiple compute nodes we performed both weak- and strong- scaling experiments. In the strong scaling experiment the problem size is kept fixed while the number of compute nodes is increased, resulting in a reduction of the total runtime. Since the local problem size decreases and hence the ratio between communication and computation usually gets worse, strong scaling is typically harder to achieve. In contrast, for the weak scaling experiment the problem size is increased in proportion to the number of nodes, in other words the local problem size remains constant while the total global problem size grows.
In both cases we consider a cubic lattice with a spacing of in each direction such that one in 27 sites is occupied by a charged particle. We allow proposed moves to the 14 neighbouring lattice sites which are either a distance or away. In units of the lattice spacing the corresponding offset vectors are , , , …, , etc. and proportional to the lattice vectors of a fcc crystalline structure. Periodic boundary conditions are applied for the electrostatic field.
For the strong scaling experiments the global problem on a single node contains charges in total, corresponding to a cubic lattice with 300 points in each coordinate direction. On the largest node count (128 nodes on Isambard), this results in a local problem size with just under 7,800 charges per node. In the weak scaling experiment each node is responsible for charges on average. The largest problem which was simulated contains charges.
Both scaling experiments were carried out on the hardware listed in Tab. 1, comparing fully populated nodes on the different machines. For the strong scaling experiment the number of FMM levels was fixed at . For the weak scaling experiment the number of levels was chosen as where is tuned for each machine and depends on the relative cost of the direct interactions (lines 5-7 in Algorithm 2) and calculation of (Algorithm 3 and line 4 in Algorithm 2) on a particular hardware. On Isambard a value of was found to be optimal, whereas and turned out to give the best results on the Ivy Bridge and Skylake nodes of Balena respectively. Finally, for CSD3 we determined to be optimal.
The time per KMC step for this model system with particles running on nodes is given in Tab. 4 for different machines and also plotted in in Fig. 10 (left). As usual, the parallel efficiency for the strong scaling experiment is defined relative to one node:
[TABLE]
For the corresponding weak scaling experiment we fix the local problem size, i.e. the number of charges per node, to and increase the total number of charges in proportion to the number of nodes. Results for are given in Tab. 5 and Fig. 10 (right), where the parallel efficiency in this weak scaling experiment is defined as
[TABLE]
6.4 KMC Simulation of -NPD
We next investigate the performance of our KMC algorithm when applied to a physically realistic configuration. The system studied is -NPD doped with F6TCNNQ, a hole transporting organic semiconductor material [54]. The scientific results of the simulations will be discussed in a forthcoming publication [55] and here we concentrate on evaluating the runtimes for a given setup. The dopant molecules ionise and release a mobile charge carrier; this creates a fixed negative charge and a mobile positive charge. The kinetic Monte Carlo algorithm describes the hopping of the holes between different -NPD molecules, the hopping rates are described by Marcus theory and are functions of structural energy, temperature and molecular polarisation as well as electrostatic energy.
The KMC code used the propose_with_dats() interface and applied a modified version of Algorithm 4 for bookkeeping operations. This modified algorithm only updates the proposed positions and associated masks if charge is in the vicinity of the previously accepted move. This removes redundant bookkeeping operations at each KMC step. The -NPD simulations were performed with a range of applied voltages to investigate the charge mobility dependence on applied voltage. For this a constant external electric field in one direction is added to simulate a given applied potential difference across the domain. The electrostatic field induced by the charges is assumed to be periodic in all three dimensions and the charges were allowed to wrap around the simulation domain upon reaching the boundary333Note that while this was not done here, our code also allows enforcing Dirichlet boundary conditions at the top and bottom of the domain by using mirror charges, as described in Section 2.2.1.
To exploit additional parallelism between ensembles, multiple instances of the -NPD simulation were run simultaneously. Each instance was executed on either 4 Ivy Bridge cores (running 4 simulations in parallel on a full node) or 12 Skylake cores (2 simulations per node). The dependence of the time per KMC step on the number of charges in the system is shown in Tab. 6 and plotted in Fig. 11. For the largest studied system with 20412 charges, one KMC step takes 0.35s when run on a full Skylake socket with 12 cores.
6.5 Performance comparison to ScalFMM
Finally, we verify that our implementation in the PPMD framework is indeed efficient. There is currently no existing library which implements the bespoke FMM method for KMC simulations developed in this paper. A fair comparison to other KMC packages based on Ewald summation or approximations described in [23, 24] would have to be carried out at a fixed, problem dependent error tolerance and is therefore more appropriate for future application-driven studies. A meaningful intercomparison study might also be outright impossible since most of the methods in the literature introduce uncontrollable errors. To nevertheless assess the absolute performance of our implementation we compare the runtime of the standard FMM method in Algorithms 2 and 3 to the freely available ScalFMM library [56]. Given that no extensive attempts have been made to optimise our code, the aim of this comparison is to verify that the performance is in the same ballpark as a published reference implementation.
For this, we measured the time spent in one iteration of a Velocity Verlet integrator for a set of particles which interact via a Coulomb potential. Although the setup of the system differed slightly (the ScalFMM test case uses free-space boundary conditions and a pure repulsive potential, whereas our code was used to simulate a charge-neutral system with periodic boundary conditions and an additional short-range Lennard-Jones interaction), we believe this allows a sensible assessment of the relative performance of the codes. The results in Tab. 7 confirm that in both cases the runtime is dominated by the FMM algorithm. To allow a fair comparison, the parameters were tuned such that both simulations are carried out at comparable accuracy. The ScalFMM code uses the uniform Lagrange interpolation kernel, which depends on the number of terms. As argued in [56] and confirmed in by the ScalFMM developers [57], setting is expected to give a relative accuracy of around in energy and force calculations. As shown in [51], the same relative accuracy is achieved by using multipole expansion terms in our code. All runs were carried out on a single, fully utilised, Ivy Bridge node. While the ScalFMM code was run in OpenMP mode with 16 threads, results on both a pure MPI run and a mixed mode MPI/OpenMP configuration (2 MPI processes with 8 threads each) are reported for our code. Tab. 7 shows measured runtimes for and particles. For our implementation the optimal number of levels turned out to be for and for , whereas the ScalFMM code gave the best results if the equivalent octree depth was set to in both cases.
The results in Tab. 7 confirm that the performance of our code is of the same order of magnitude as the highly optimised ScalFMM library. In fact, the pure MPI implementation is only around slower, which is acceptable given that we have not yet considered further optimisation. Further results on the parallel scalability of our standard FMM implementation can be found in Appendix D.
7 Conclusions
In this paper we described how the Fast Multipole Method can be adapted to allow the efficient and accurate treatment of electrostatic interactions in kinetic Monte Carlo simulation. Although a recent publication [19] claimed that this would not be possible, we demonstrated that our algorithm scales linearly with the number of charges. This was confirmed numerically by measuring the time per KMC step for systems with up to charges. Running in parallel on cores we find s. We also presented results for a physically relevant -NPD test case with particles which could be simulated at a rate of per KMC step on a single 12-core Skylake CPU.
By facilitating the simulation of much larger systems with realistic electrostatics, our new library will allow step-changes in the KMC simulation of energy materials. While the focus of the paper is on the description of the algorithm, its implementation and parallel performance results, a forthcoming publication [55] discusses further results for physically relevant systems.
Our code provides an intuitive user interface and we showed that code generation techniques guarantee excellent scalability and performance on modern HPC installations. In principle the code is performance portable, and has so far been implemented for CPU chips, using hybrid MPI+OpenMP parallelisation. A GPU backend is currently being developed.
An interesting line of future work will be to combine our improved FMM algorithm with novel KMC approaches, such as multilevel KMC techniques. Those methods allow more efficient simulations by skipping physically less interesting transitions, such as “rattling” the frequent repeated hops between pairs of states.
By making suitable adaptations to the FMM algorithm, it will be also possible to reduce the cost of standard Monte Carlo (MC) simulations. Similar improvements have already been studied for Ewald summation [13, 27], for which the change in electrostatic energy per MC move can be calculated at a computational complexity of . This is because the overall cost of the Ewald-based energy calculation is made up by an iteration over all particles and a sum over reciprocal vectors (long-range contribution) and neighbouring particles (short-range contribution). If only particles move at each MC step, only a small number of the sums have to be evaluated. A similar approach is currently explored in the DL_MONTE code [58], though the implementation at present is . We believe that a suitably modified FMM algorithm will improve on this and limit the computational complexity per individual MC move to . The key idea is to store the local expansion on each level of the grid hierarchy, instead of accumulating it on the finest level in the downward sweep. While calculation of the electrostatic field requires evaluation of the in one cell on each of the levels, updating the field only requires changes to a constant number of cells per level. Overall, both operations can be carried out in time per MC update.
Acknowledgements
This research made use of the Balena High Performance Computing (HPC) Service at the University of Bath and the Isambard UK National Tier-2 HPC Service (http://gw4.ac.uk/isambard/). Isambard is operated by GW4 and the UK Met Office, and funded by EPSRC (EP/P020224/1). This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No 646176 and No 824158. William Saunders was funded by an EPSRC studentship during his PhD. We would like to thank Bérenger Bramas and Olivier Coulaud for help with running ScalFMM on our cluster.
Appendix A Standard FMM algorithm
For reference the standard FMM algorithm described in Section 2.2 is written down in Algorithms 2 and 3. In addition to the expansions and , this requires another -term local expansion about the centre of cell on level . describes the potential induced by all charges outside the parent of the cell and outside the 26 nearest neighbours of this parent. Algorithm 2 also uses the notion of an interaction list () of a particular cell. This is important to recursively include contributions from finer levels (see line 18 in Algorithm 2). For a cell on level the interaction list is the set of cells which are the children of the parent cell of and its nearest neighbours, but which are well separated from , i.e. not direct neighbours of on level . Explicitly, the interaction list is given as
[TABLE]
On a particular level the local representation of the operators for converting between multipole expansions between cells and is given by with corresponding notation for and .
As can be seen in Algorithm 3, FMM provides constant-time access to the local expansions (and nearest-neighbour interactions) on the finest level. This property is crucial for the construction of the algorithmically optimal FMM method for KMC simulations presented in this paper.
Appendix B Dipole correction
As discussed in Section 2.2.2, care has to be taken if the dipole moment of the charge distribution is non-zero for periodic- or Dirichlet boundary conditions. Here we derive the correction term which needs to be added to the electric field to enforce the zero surface-charge boundary condition at infinity. First observe that due to the nature of the multipole expansion and the linearity of electrostatics, it is sufficient to derive the term for one particular charge configuration with a given dipole moment. Assume that there is a surface charge density of on the top face of the domain, and an opposite density of at the opposite face (see Fig. 12).
This induces a dipole moment per unit volume of . Let be the potential which is induced by the primary cell and its 26 neighbours, i.e. the near-field contribution. The corresponding far-field contribution is denoted by . Calculation of at the point is straightforward, since we only need to compute the potential generated by two oppositely charged plates of size at a distance of from the origin. The fundamental solution of the Poisson equation and the superposition principle gives
[TABLE]
We want to calculate the electric field at the origin, which is given by Taylor-expanding the total potential
[TABLE]
(note that due to symmetry, for this particular setup there are no contributions which are linear in or ). Since the coefficient of the matrix introduced in [29] is zero, there is no contribution to from . This implies that can be obtained by taking the derivative of at the origin. The resulting surface integral is readily evaluated to obtain
[TABLE]
The same argument can be applied for dipoles pointing in the - and - direction; recalling that , this implies that the field of a vector-valued dipole density is given by
[TABLE]
Appendix C Improved interface for proposals
While the propose() method described in Section 5.2.1 is intuitive and easy to use, it is not optimal in terms of efficiency. This is because the tuple of proposed moves in Eq. (15) has to be converted to an internal data structure before the moves can be passed to our C implementation. To overcome this issue we provide an alternative interface which is more efficient, but requires additional work from the user since the corresponding propose_with_dats() method expects the data to be given in a particular, structured format: the proposed moves have to be encoded as a set of particle properties which are stored in ParticleDat instances. In contrast to the propose() interface, which can operate on a subset of particles, potential new positions have to be specified for all particles in the system. For this, the particles are separated into different types, such that a particle of type can potentially transition to new positions. Note that can be zero, and particles of this type are not able to move at all. The type of a particle could for example depend on the topology of the lattice or the local environment of the lattice site it currently occupies. In this case the type can change during the simulation; an example is described in Appendix C.1. The set is represented by a -dimensional ScalarArray. The types of all particles are stored in an integer-valued ParticleDat , such that is the type of particle , which can therefore hop to potential new sites. Let be the maximum number of moves at any site. An additional real-valued ParticleDat with components is used to store the locations of the potential destinations of all particles. For particle the entry contains the list of three dimensional vectors where any entries with are irrelevant and may contain arbitrary values. Depending on the local environment of a particle and other particles in its vicinity, particular hops might be blocked for a particular configuration. To avoid changing the type of those particles and shuffling around the entries of to account for this, certain transitions can be marked as “forbidden” by setting a flag in a separate ParticleDat with components. Each entry contains a list where the flags signals that the transition for particle is allowed. If this transition is forbidden and will not be considered in the calculation of energy changes for the proposed moves.
The ScalarArray and the ParticleDats , , and are passed as inputs to the propose_with_dats() method, which populates the ParticleDat of components with the energy changes of all proposed moves. More specifically, the entry for particle is the list such that contains the change in electrostatic energy for the move . Entries for moves which are marked as forbidden through the mask or for which are undefined.
An example of the ParticleDats , , and is shown in Fig. 13. In this case each particle can hop to between two and four sites, or not hop at all. The set which described the four different types of particles is therefore given by with .
Using the propose_with_dats() and data structures provided by PPMD allows the entire KMC implementation to be written in the looping operations of PPMD. Apart from guaranteeing the efficiency of code, the user never has to explicitly insert parallelisation calls. To illustrate how this can be done, we now show how the inputs to the propose_with_dats() method can be set up in PPMD for a particular use case.
C.1 Selection of allowed moves in PPMD
For this example we assume that the system consists of charges which can hop between the sites of a regular two-dimensional lattice with spacing embedded in three dimensional space
[TABLE]
Recall that the simulation domain is a box of width , and periodic boundary conditions are assumed for the electrostatic potential. However, we assume that the charges can not hop across the domain boundary. The total number of charges is assumed to be much smaller than the total number of sites and each site can be occupied by at most one charge. In this example we further assume that charges can only hop to directly neighbouring sites. Note that the number of sites a charge can hop to, i.e. its type, depends on whether it is in the interior of the domain or on the boundary, see Fig. 14.
When setting up the input for propose_with_dats(), the following points have to be taken into account:
- •
All charges need to be be assigned a type by setting the entries in the ParticleDat . This depends on the lattice site the particle currently occupies.
- •
The potential destinations for all particles have to be worked out in each KMC step by populating the ParticleDat .
- •
Potential hops to already occupied sites need to be masked by setting the entries of the ParticleDat ; again this has to be done in each KMC step.
The sites can be arranged into nine different categories (one interior, one for every outer edge/corner of the domain, see Fig. 14). Since particles sites in the interior of the domain have four direct neighbours, sites on the edges have four and sites in the corners only two, we set
[TABLE]
Further, for each site category we define an ordered set of integer offsets
[TABLE]
which describes the potential relative hops of a particle located at a site of this category in units of the lattice spacing,
[TABLE]
Now consider the charge with index , which is of type and currently located at position . Since the number of potential destinations is , this charge can potentially hop to any point in the set
[TABLE]
This can be implemented by updating the entries of the ParticleDat with a ParticleLoop in the PPMD code.
Finally, forbidden moves to already occupied sites have to be masked by setting appropriate flags in . This is done by considering all pairs of particles and setting the entry of to zero if the -th entry of is identical to the current position of the other particle in the pair, i.e. if . In PPMD this operation can be realised with a PairLoop.
The pseudocode in Algorithm 4 provides an overview of the PPMD implementation of the book-keeping operations discussed in this section. To set the types of the particles it is assumed that there is a function which returns the category of a lattice site located at .
Appendix D Parallel Performance of standard FMM
To complement the results in [26] and since our standard FMM implementation in itself might be of interest to others, we compare its performance and parallel scalability with the FFT accelerated Smooth Particle Mesh Ewald (SPME) approach in DL_POLY_4. Here we use a configuration which is based on the two ion NaCl “TEST01” [59] scenario from the DL_POLY test suite. The system is stabilised by adding a repulsive short range Lennard-Jones potential with a cutoff of . Due to this small cutoff the additional cost of the Lennard-Jones force calculation can be neglected in the reported runtimes. The initial configuration is a cubic lattice of alternating particle species with a lattice spacing of . To allow a fair comparison, for both implementations the parameters were adjusted such that both methods give comparable relative errors of on the total energy of the system; this required 10 expansion terms in the FMM implementation. For more details on the setup and quantification of errors see [51].
As for the results in Section 6.3, all runs were carried out on the Intel Ivy Bridge E5-2650v2 nodes of the “Balena” HPC facility. In contrast to the setup in Tab. 1, the FMM code was run in two modes:
- •
A pure MPI implementation, using distributed memory parallelism between the 16 cores of each node.
- •
A hybrid MPI+OpenMP mode with one MPI rank and 8 OpenMP threads per socket ( OpenMP threads per 16-core node).
DL_POLY was always run in pure MPI mode.
D.1 Strong scaling
To test the strong scalability we perform 200 Velocity Verlet integration steps of two systems containing and charged particles respectively.
The absolute runtimes in Fig. 15 (left) demonstrate that the performance of our FMM implementation is in the same ballpark as the SPME algorithm in the mature DL_POLY code, which is approximately faster. For larger core counts our FMM implementation does not exhibit unreasonable performance degradation; in fact it scales slightly better that the DL_POLY code. This is further quantified by plotting the strong parallel scalability calculated according to Eq. (20) in Fig. 15 (right). Running in MPI+OpenMP mode further increases parallel efficiency in the strong scaling limit. For smaller node counts, the efficiency of the hybrid approach is poorer than a for pure MPI setup. This is because OpenMP parallelisation introduces atomic operations not found in the distributed memory implementation. Those operations lead to reduced intra-node parallel efficiency, consistent with Amdahl’s Law [60].
D.2 Weak Scaling
In the corresponding weak scaling experiment the number of charges per node is kept fixed, while the total number of charges increases in proportion to the number of nodes . Since the computational complexity of the FMM algorithm is proportional to , we expect the time per FMM evaluation to be independent of . Fig. 16 (left) shows the time per Velocity Verlet step for total problem sizes between and . Due to memory inefficiencies in non-FMM related portions of code it was not possible to run the pure MPI implementation of the code on more than 32 nodes, and we only report results for the hybrid MPI+OpenMPI setup in this case. For each problem size the number of levels is adjusted to achieve optimal performance.
The parallel efficiency defined in Eq. (21) is plotted in Fig. 16 (right). As expected, the time per Velocity Verlet step grows only slowly as the number of processors increases.
We refer the interested reader to [51] for a further discussion of those results.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] W. Young, E. Elcock, Monte Carlo studies of vacancy migration in binary ordered alloys: I, Proceedings of the Physical Society 89 (3) (1966) 735. doi:10.1088/0370-1328/89/3/329 . · doi ↗
- 2[2] A. Bortz, M. Kalos, J. Lebowitz, A new algorithm for Monte Carlo simulation of Ising spin systems, Journal of Computational Physics 17 (1) (1975) 10 – 18. doi:10.1016/0021-9991(75)90060-1 . · doi ↗
- 3[3] D. Gillespie, A general method for numerically simulating the stochastic time evolution of coupled chemical reactions, Journal of Computational Physics 22 (4) (1976) 403–434. doi:10.1016/0021-9991(76)90041-3 . · doi ↗
- 4[4] D. Gillespie, Exact Stochastic Simulation of Coupled Chemical Reactions, Journal of Physical Chemistry 81 (25) (1977) 2340–2361. doi:10.1021/j 100540 a 008 . · doi ↗
- 5[5] H. M. Cuppen, L. J. Karssemeijer, T. Lamberts, The Kinetic Monte Carlo Method as a Way To Solve the Master Equation for Interstellar Grain Chemistry, Chemical Reviews 113 (12) (2013) 8840–8871. doi:10.1021/cr 400234 a . · doi ↗
- 6[6] B. J. Morgan, Lattice-geometry effects in garnet solid electrolytes: A lattice-gas Monte Carlo simulation study, Royal Society Open Science 4 (11) (2017) 170824. doi:10.1098/rsos.170824 . · doi ↗
- 7[7] C. Groves, Simulating charge transport in organic semiconductors and devices: A review, Reports on Progress in Physics 80 (2) (2016) 026502. doi:10.1088/1361-6633/80/2/026502 . · doi ↗
- 8[8] I. R. Thompson, M. K. Coe, A. B. Walker, M. Ricci, O. M. Roscioni, C. Zannoni, Microscopic origins of charge transport in triphenylene systems, Physical Review Materials 2 (6) (2018) 064601. doi:10.1103/Phys Rev Materials.2.064601 . · doi ↗
