LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi
William McDoniel (1), Markus H\"ohnerbach (1), Rodrigo Canales (1),, Ahmed E. Ismail (2), Paolo Bientinesi (2) ((1) RWTH Aachen University, (2), West Virginia University)

TL;DR
This paper presents optimized PPPM long-range solver implementation in LAMMPS for the second-generation Intel Xeon Phi, achieving significant kernel speedups and overall performance improvements for molecular dynamics simulations involving charged particles.
Contribution
The paper introduces a vectorized implementation of the PPPM solver in LAMMPS optimized for Xeon Phi, enabling faster simulations without retuning parameters.
Findings
Kernel speedups of up to 12x
Overall simulation speedups of 2-3x
Simplified parameter tuning process
Abstract
Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.
| Version | mode | precompute | RMS error | Version | mode | RMS error | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| ref | ik | 7Å | 7 | - | 0.0186 | ref | ad | 7Å | 7 | 0.0189 |
| opt | ik | 7Å | 7 | - | 0.0186 | ref | ik | 3Å | 7 | 0.5853 |
| opt | ik | 7Å | 7 | 500 points | 0.0313 | ref | ik | 5Å | 7 | 0.0124 |
| opt | ik | 7Å | 7 | 5000 points | 0.0188 | ref | ik | 7Å | 3 | 0.0197 |
| opt | ad | 7Å | 7 | 5000 points | 0.0188 | ref | ik | 7Å | 5 | 0.0194 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Algorithms and Data Compression
11institutetext: RWTH Aachen University, Aachen, Germany 52062
E-Mail: [email protected] 22institutetext: West Virginia University, Morgantown, USA 26506
Authors’ Instructions
LAMMPS’ PPPM Long-Range Solver
for the Second Generation Xeon Phi
William McDoniel 11
Markus Höhnerbach 11
Rodrigo Canales
Ahmed E. Ismail 1122
Paolo Bientinesi 11
Abstract
Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.
1 Introduction
Molecular dynamics simulations are used to compute the evolution of systems of atoms in fields as diverse as biology, chemistry, and materials science. Such simulations target millions or billions of particles, are frequently run in parallel, and consume a sizable portion of supercomputers’ cycles. Since in principle each atom interacts with all the other atoms in the system, efficient methods to compute the pairwise forces are vital. The most widespread method for electrostatic interactions is the “Particle-Particle Particle-Mesh” (PPPM) method [1], which makes it possible to efficiently compute even the interactions between distant particles.
Due to its popularity, we target the open-source LAMMPS code [2], which offers the PPPM method. LAMMPS is a C++ code designed for large parallel simulations using MPI, and is written to be modular and extensible. LAMMPS can be compiled with a variety of packages that provide different implementations of key methods for the calculation of short-range and long-range interactions. For example, the USER-OMP package includes versions of methods such as PPPM which are specifically designed for shared-memory parallelism. In this paper, we extend the LAMMPS molecular dynamics simulator with a version of PPPM that is especially suitable for architectures with wide vector registers, such as the Xeon Phi. In the past, long-ranged solvers have been optimized for GPUs, with issues similar to those encountered with Xeon Phi accelerators [3, 4].
On these systems, one of the main routes towards high-performance is the exploitation of the wide (512-bit) vector registers. To this end, we create vectorized kernels for all the computational components that are not directly supported by highly optimized math libraries (e.g. FFTs). These routines account for between 20% and 80% of the time spent in PPPM. As such, their optimization leads to notable speedups in the overall performance of the simulation.
One challenge is that the innermost loops of said computational routines are very short, with trip-counts between 3 and 7. This is a common problem for vectorizing molecular dynamics even outside of PPPM. For example, it was encountered by Höhnerbach et al. in their multi-platform vectorization of the extremely short loops of the Tersoff potential [5]. It turns out that work can be saved elsewhere by increasing these trip counts, simultaneously enabling efficient vectorization. Similarly, work can be shifted away from poorly-scaling FFTs and into newly-optimized functions, and, within the optimized functions, memory bandwidth can be traded against additional computation.
In this paper, in addition to discussing vectorization techniques, we also provide insights into the parametrization of PPPM for performance. In particular, we consider three tunable parameters: the real-space cutoff, the interpolation order, and the differentiation mode. Many users will stick to the default choices where such exist, since these promise accurate and reasonably performant calculations. Others will have taken time to tune these parameters for their particular problems, but even expert users often make suboptimal choices that can up to double time-to-solution for a given desired accuracy, depending on the problem [6]. We achieve 2-3x speedups for a wide range of input parameters, and our optimizations also make the careful tuning of several parameters unnecessary by making particular options superior to the others for almost all cases.
The code presented in this paper is contributed to the USER-INTEL package of LAMMPS [7]. It has been shown that this package can not just yield impressive speedups on Intel architecture, but also improve the energy efficiency of the calculation [8].
2 Molecular Dynamics and PPPM
2.1 An algorithmic overview
The interaction between atoms in a molecular dynamics simulation is governed by a so-called potential function. For example, the Lennard-Jones () and the Coulombic () potentials are given by:
[TABLE]
For a given pair of atoms , the potential depends on the distance between them, , as well as their charges and (in the Coulombic case), or the parameters and (in the Lennard-Jones case), which describe the minimum of the potential function and its root. In order to obtain the forces on atoms, MD simulations can compute these potential functions for all pairs of atoms, but pairs have to be evaluated, and this quickly becomes infeasible.
A simple solution is to introduce a cutoff. One only considers interactions among atoms within a given cutoff radius of each other. Consequently, the number of pairs to be evaluated decreases to . Since all the potential functions (e.g., Eqn. 1) fall off with distance, the cutoff provides a reasonable strategy to approximate the total potential on atoms.
There are, however, numerous situations in which long-range interactions between atoms cannot be neglected, and instead have to be approximated numerically. A plain cutoff strategy does not work well for Coulomb interactions, which are relevant when a system contains charged particles or polar molecules, because the potential falls off only as . In contrast, the cutoff is perfectly fine for the Lennard-Jones potential, as long as the system is uniform.
In non-uniform problems, such as those featuring an interface, even Lennard-Jones interactions may need to be calculated using a long-ranged solver and can not be approximated [9]. In these cases, it is necessary to approximate these long-range interactions without explicitly computing pair-wise potential functions; for this task, Particle-Particle Particle-Mesh is often the method of choice. PPPM approximates long-range interactions in a periodic system by obtaining the potential of the entire system of atoms as a function of space, discretized to a grid [1]. While originally developed for electrostatics, the method was later adapted to the term of the Lennard-Jones potential [10].
In this work, we focus on PPPM for electrostatics, i.e., the Coulomb potential. PPPM uses an idea due to Ewald, and splits the potential into two components [11]. The first component, the “short-ranged” part of PPPM, contains the discontinuity due to the term, and a smooth screening term that limits the support to a small spherical region around a given atom; this component can be calculated directly between each atom and its neighbors in a certain cutoff radius . The second component is the “long-ranged” part of PPPM; due to its smooth nature, this can be solved accurately on a grid.
The efficient solution of the long-ranged component is the key ingredient of the PPPM method. Since we are operating with smooth quantities, the electrostatic potential is related to the charge distribution via Poisson’s equation
[TABLE]
From the electrical potential , one can compute the forces on all the atoms due to it. The forces on an atom with charge can be obtained from the gradient of the potential evaluated at the particle’s position:
[TABLE]
PPPM approximates these forces on each particle by proceeding in three steps:
First, particle charges are mapped to a grid using a stencil, obtaining a discretized form of the charge distribution . 2. 2.
Second, Poisson’s equation (Eqn. 2) is solved in order to obtain the potential . This is done by first taking the 3D Fourier transform of the charge distribution, as Poisson’s equation is easier to solve in reciprocal space, and then performing one or more inverse FFTs to obtain a result in real space. 3. 3.
Third, this result is mapped back to the atoms with the same stencil used when mapping charges.
The forces are obtained from the gradient of the potential, and this gradient can be taken in reciprocal or real space, determined by the user-specified differentiation mode. For ik differentiation, the gradient is calculated in reciprocal space, immediately after solving Poisson’s equation, and three inverse FFTs bring it back into real space, where its components are mapped to the atoms. For ad differentiation, one inverse FFT yields the scalar potential in real space, and this is mapped to the atoms using different sets of coefficients for each component of the gradient to be obtained.
Our optimizations focus especially on the mapping steps (steps 1 and 3). Step 2 is not as interesting for manual optimization since it is dominated by FFT calculations, for which highly optimized libraries exist. The mapping steps, on the contrary, are deeply nested loops performing calculations on data that is likely already in cache. We will show that optimizations, especially proper vectorization, will speed up these steps by at least a factor of four.
2.2 Related Work
Besides LAMMPS, many other popular molecular dynamics codes contain long-ranged solvers. Examples include, but are not limited to, Gromacs [12], DL_POLY [13], AMBER [14], Desmond [15], and NAMD [16]. These codes tend not to implement PPPM itself, in favor of related schemes such as PME [17], SPME [18], and -GSE [19]. The main differences with respect to PPPM lie in the function used to interpolate atom charges onto the grid and back, and in the corresponding Green’s function used to solve for the smooth part of the potential. There also exist schemes for long-ranged force evaluation that are not based on Fourier transforms, such as lattice Gaussian multigrid [20], Multilevel Summation [21], and -GSE [19].
2.3 Parametrization of PPPM
Since LAMMPS is used for a wide variety of problems, users have many choices about input parameters for the target physical system. Several of these parameters influence the accuracy and/or speed of the simulation, including the cutoff distance (), the prescribed error in forces relative to a reference (), the stencil size (), and the differentiation mode, ik or ad.
expresses the distance within which pair-wise interactions are computed directly, and outside of which the interactions are approximated using the PPPM grid; the short-ranged calculations scale with . The work done when computing FFTs is controlled by ; LAMMPS automatically determines the coarseness of the FFT grid to satisfy this accuracy constraint, depending on the values chosen for the other parameters. A stencil () causes writing to, and reading from, about 2.7 times as many grid cells compared to the default stencil. A higher-order stencil produces more accurate results, and LAMMPS takes this into account when deciding the resolution of the PPPM grid. Therefore, a higher-order stencil shifts work out of the FFT functions, and into the mapping functions. Users can also choose between the ik and ad differentiation modes described above, and LAMMPS again takes their different accuracies into account when setting up the FFT grid, with the ik mode yielding a slightly coarser grid.
Users will typically want to use a set of inputs that nearly minimize runtime, subject to an accuracy constraint. Unfortunately, short of trial-and-error for a specific problem it can be difficult to find a good set of parameters. In a recent work [6], Fabregat et al. developed a method for automatically searching the space of input parameters to find a good set, guided by cost and accuracy models; their case studies suggest that even expert users systematically underestimate the expense of PPPM: they invariably predicted lower-than-optimal cutoffs, which minimize the work done in computing pair interactions while forcing a finer FFT grid. The impact of stencil size was not considered, leaving the choice at LAMMPS’ default. In the next sections we demonstrate that an appropriate choice of stencil size is needed to achieve good vectorization.
2.4 Profiling
In order to investigate the effects of the input parameters on runtime, we execute our baseline on a single core of a KNL machine with a single thread. The system is an Intel Xeon Phi 7210 chip (64 cores and 16GB of HBM RAM) in quadrant and flat memory mode, connected to other nodes via OmniPath. Our software is based on the May 11, 2016 version of LAMMPS with the RIGID, USER-OMP and USER-INTEL packages enabled. It was compiled using the Intel C++ Compiler version 16.01 (build 20151021), and uses Intel MPI 5.0 (build 20150128). The reference runs use the code provided by the USER-OMP package, and our runs are based on code from USER-INTEL package running in mixed precision mode. Our benchmark is an SPC/E water simulation [22], a benchmark provided with LAMMPS. We modified it to have a cubic domain.
Since all the atoms in the system carry partial charges, the simulation uses PPPM to calculate forces. Unless otherwise specified, the default settings that we use are relative error , and short-range cutoff Å. The basecase contains 36,000 atoms, and will later be scaled up for more extensive benchmarks.
Fig. 1 shows timings as the cutoff, relative error, and differentiation mode vary. The vertical sections denote the time spent in FFTs (“PPPM FFT”), and in PPPM aside from FFTs (“PPPM non-FFT”), the time spent in the pair-wise short-ranged interactions (“Pair”), and everything else (“Other”).
For cutoff, there actually is a minimum of the runtime, i.e., reducing the cutoff will not reduce runtime beyond a certain point where the long-ranged part gets less efficient: The Å case spends a disproportionate amount of time in PPPM. The cutoff mostly impacts the “Pair” time—since it scales as —and the “PPPM FFT” time—since it forces the grid to grow or shrink.
For , there of course is no minimum—lower accuracy results in faster simulations—mostly due to less time spent in FFT calculations (i.e. smaller grids). Outliers in FFT performance can be attributed to pathological cases (in terms of size) of the FFT library.
In both panels of Fig. 1, the “Other” and the “PPPM non-FFT” sections are largely unaffected by changes in cutoff or relative error. In both, ad differentiation performs best (except for one outlier). For cutoff-optimal cases, the majority of the runtime is spent on long-ranged calculation, suggesting that optimization in that area might be quite fruitful.
3 Optimizations
The optimizations for the different stages of the algorithm are discussed here. In particular, we cover the functions that map atoms to grid points and grid values to atoms, the Poisson solver, and the routines responsible for the short-ranged contribution.
3.1 Mapping Functions
All three mapping functions—Map-Charge and both the ik and the ad versions of Distribute-Force—share the same structure: a loop over all atoms, the calculation of stencil coefficients, and then a loop over stencil points. Map-Charge multiplies the particle charge by the stencil coefficient and adds that value to a point on the grid. Distribute-Force proceeds in a slightly different way depending on the differentiation mode. The ik mode multiplies the grid values for each spatial dimension at each grid point by the corresponding stencil coefficient, then adds them to three totals, one for each dimension; after the loop over stencil points, these components are multiplied by the atom’s charge and a scaling factor to obtain force components. The ad mode multiplies the scalar potential at a grid point by three different stencil coefficients to obtain a vector, which is added on the atom; after the loop over stencil points, substantially more calculation than is required for ik differentiation transforms these totals into the components of the force vector.
The stencil coefficients are the product of three polynomials of order equal to the stencil size, one for each dimension. The iteration over stencil points consists of a triple loop (one for each dimension of the stencil). This represents the bulk (80%+) of the work, and accounts for almost all the memory accesses in the mapping functions. Map-Charge accesses only a single value at each grid point, but does very little computation. The ik mode of Distribute-Force uses three different values at each grid point. The ad mode uses only one value at each grid point, but performs more floating point operations. The arithmetic intensity of all these routines is relatively low, and memory access patterns will determine the best approach to vectorization.
Since the number of grid points is typically comparable to or smaller than the number of atoms, and stencil points are touched when looping over atoms, there is a great deal of data reuse. With so few calculations being performed on data which is almost always found in cache, managing vectorization overhead will prove to be vital. In general, we find that it is important to minimize the amount of data shuffling or masking required to prepare for vector operations; whenever possible, a full vector should be pulled from memory, operated on, and returned.
With an understanding of the structure of the mapping functions, we now walk through our process of optimizing each one, pointing out what worked and what did not. A summary of progressive speedups for each function is shown in Fig. 2.
Function Map-Charge
Rethread: To avoid race conditions when writing to the grid, the USER-OMP package has threads own disjoint chunks of the grid, and uses conditional statements in the innermost loop over stencil points. By giving threads disjoint sets of atoms and maintaining private copies of the grid—which are then summed together—we achieve a 2x speedup.
Vector: We vectorize the innermost loop over stencil points, which features unit stride memory accesses as it iterates through grid points. We target a new default stencil size of 7, instead of 5, to make better use of 256-bit vector registers. This implementation achieves another factor of speedup (“vector” implementation), which is significant but not close to the theoretical 7x we might hope for.
Simd8: Masking associated with the 7-iteration loop is a significant overheard. By explicitly setting the loop length to 8 and padding the stencil coefficient arrays with zeros, we avoid masking and obtain a total of x speedup over the re-threaded scalar version.
Precompute: Rather than evaluating polynomials to obtain the stencil coefficients for each atom every time step, we precompute 5000 values for each polynomial and refer to the nearest entry in this lookup table instead. This brings total speedup to over 12x of the baseline.
Function Distribute-Force (ik Differentiation)
Atom Simd: Since Distribute-Force performs reads from the grid rather than writes, the atom loop can be vectorized easily, yielding a x speedup. The gather operations required to read grid point values cause this to be a poor choice.
Inner Simd: Reproducing the inner loop vectorization from Map-Charge, setting the loop length to 8, produces a x speedup over the scalar implementation.
Repacking: Distribute-Force for ik differentiation uses three different grids with their own force components. By modifying the Poisson solver to instead output the x and y components interweaved, and the z component interweaved with 0s, the innermost loop can be extended to 16 iterations and the x and y components can be computed together by taking advantage of the 512-bit vector register on Xeon Phi. This provides an additional x speedup.
Precompute: As with Map-Charge, the polynomial evaluations to obtain stencil coefficients can be replaced with references to a lookup table, for a similar x additional speedup and a total speedup of x relative to the reference.
Function Distribute-Force (ad Differentiation)
Vector: Transferring over all of the optimizations from the ik mode of Distribute-Force, except the inapplicable repacking of the Poisson solver output, yields speedup below 3x relative to the reference. This is because the extra work after the loop over stencil points has become relatively expensive.
Split Atom: We split the loop over atoms in two. The first atom loop ends after the triple loop over stencil points, having summed weighted potentials into three arrays of length equal to the number of atoms. The second atom loop operates on these arrays to obtain force components, and can be vectorized as it contains no inner loops and has unit stride access to the weighted potential arrays. This brings the overall speedup to just above 4x.
3.2 Poisson Solver
The Poisson solver is a poorly-scaling, communication-intense function which performs 3D FFTs, solves Poisson’s equation in reciprocal space, and then performs a number of inverse 3D FFTs depending on the differentiation mode (3 for ik and 1 for ad). These 3D FFTs are performed in parallel as a series of 1D FFTs with communication steps in between. The FFT functions are from high-performance libraries (in our case MKL) and we do not attempt to optimize them. Our optimization of the solver comes from three ideas.
Shift Work: Switching to a stencil size of 7 creates more work in the mapping functions, but causes LAMMPS to choose a coarser grid resolution, requiring fewer calculations to perform the FFTs.
2D FFTs: The series of 1D FFTs is inefficient [23]. We replace it with a 2D FFT followed by a 1D FFT, and in the first communication step we ensure that planes of data are located on each MPI rank. This saves one communication step and is roughly () faster. Even for poorly load-balanced cases, where the number of necessary 2D FFTs is only slightly greater than the number of MPI ranks, it does not perform worse.
Adjust Grid Sizes: The FFT calls of Intel’s MKL library do not perform well for particular unfortunate values, which can catch users by surprise (compare time spent in FFTs across the cases in Fig. 1). A simple fix that catches many problem cases is to check whether the number FFT grid points in any dimension is a multiple of 16, and increase it by 1 if necessary. Users will now only rarely find that their simulations run substantially slower after making a tiny change to their input file, and, as an added bonus, these simulations will gain slightly improved accuracy.
3.3 Short-Ranged Interactions
To avoid shifting the bottleneck to the short-range calculation, it is desirable that it be vectorized. Mike Brown of Intel contributed code vectorizing the pair potential used in simulations containing electrostatic interactions (optionally with cut off Lennard-Jones interactions), where his strategy was to vectorize the loop over each atom’s neighbors. This achieves a x speedup (for example, compare the time spent in “Pair” between the reference and optimized versions in Fig. 3). We provide similar code compatible with the Buckingham potential, optimized for PPPM and USER-INTEL, and also versions of pair potentials compatible with PPPM for dispersion.
4 Results
We now present comparisons between the reference and optimized versions of LAMMPS using full simulations, profiling the code as in Fig. 1, to show how the various parts of the code contribute to total runtime. We also investigate the opaque way in which the user-facing knobs impact accuracy, and provide evidence that our optimizations do not sacrifice accuracy. The experiments were conducted on a single core, a full node, and multiple nodes. While the speedup is both problem dependent and parameter dependent, the optimized version is faster in every case simulated.
Because of our decision to target a new default stencil size of 7, it would not be fair to make like-to-like comparisons between the reference and our optimized versions. Further, LAMMPS’ input files do not even require an explicit choice of stencil size, so many users will just allow it to take on its default value. Fig. 3 compares the two versions as stencil order varies for our baseline test cases, using ik differentiation to demonstrate that the new value is faster for the optimized version. We simulate the standard 5Å case on a single core and a 64x scaled-up 7Å case on a full KNL node, which are nearly-optimal cutoff radii for each case. The trend in total runtime is expected: on both a single core and the full node the reference version is fastest with a stencil size of 5 while the new version is fastest with a value of 7. For all future cases presented, the reference code uses while ours uses .
4.1 Accuracy
Since the optimizations proposed involve both parameter-tuning and numerical approximations, we now verify that our code is as accurate as the reference. To this end, we compare to an Ewald summation run with a relative error of , and a cutoff of 10Å.
As seen in Table 1, without stencil coefficient precomputation, the optimized and reference versions obtain almost identical forces for both differentiation modes. 5000 precomputed stencil polynomial evaluations are sufficient to retain overall accuracy with our approximation. In addition, the optimized version conserves momentum (the sum of forces on all atoms remains nearly zero) and the macroscopic temperature difference between reference and optimized simulations after 100 time steps is always small (), and nearly zero without stencil precomputation.
Many users may not expect that their choice of cutoff can have a large effect on accuracy, and LAMMPS’ internal accuracy model does not do as good of a job with stencil size as it does with differentiation mode. After 100 time steps, the temperature is almost 10 degrees higher for a 3Å cutoff than for cutoffs greater than or equal to 4Å. In addition to speedup, our optimized version becomes slightly more accurate by moving to a stencil size of 7.
4.2 Single-Core Simulations
We first compare simulations using our optimized version to the reference cases we presented earlier in Fig. 1. Fig. 4 shows both versions as cutoff varies for ik and ad differentiation, respectively. As with the reference version, there is a runtime-optimal cutoff for the optimized version at 5Å where a balance is struck between the pair interactions and the FFTs. Total speedup at this optimal cutoff is 2.21x for ik and 2.75x for ad differentiation. With our optimizations, ad differentiation goes from being only marginally faster at the runtime-optimal cutoff to being 32% faster, making it a compelling choice even for serial simulations where the FFTs do not take up much time.
The calculation of long-range interactions, inclusive of the mapping functions, the FFTs, and various minor functions (PPPM FFT plus PPPM non-FFT), is sped up by a factor of 3.44x for ad differentiation. The calculation of the long-range interactions excluding the FFTs has actually sped up by a higher factor of 3.61x despite the larger stencil requiring looping over 2.74 times as many grid points. The calculation of pair interactions is sped up by about 2.5x. ad differentiation is now faster than ik differentiation for every cutoff, due to the smooth decrease in time spent performing FFTs as cutoff increases.
The relative penalty for choosing a poor cutoff has not changed much except for cases where an unfortunate number of FFT grid points was doubling the cost of FFTs. In general, an overestimate of the runtime-optimal cutoff is much less penalizing than an underestimate because the cost of the FFTs increases rapidly as cutoff decreases. Because the optimized long-range calculations are sped up by about as much as the optimized short-range calculations, users will find that pre-existing input files and intuitions about runtime-optimal cutoffs still yield good results.
Fig. 5 compares the optimized implementation to the reference as relative error varies. Speedups are between 2.1x and 2.77x for all cases, without an apparent pattern other than that ad differentiation has gained more from the optimizations than ik differentiation. There is not a clear optimal relative error, since users will want to adjust this parameter depending on how important accuracy is in the long-range calculation for their specific problems.
4.3 OpenMP and MPI Parallelism
With the additional complication of parallelism, we do not attempt to determine optimal choices of input parameters for our test case, though users will go through this complex process for their individual problems, often settling on a suboptimal set of parameters [6]. Here we just show that our optimized version is much faster than the reference for a range of cutoffs on a full KNL node, for varying numbers of cores on up to two full nodes, and for varying numbers of OpenMP threads per rank.
LAMMPS is intended to be scalable to very large numbers of cores, but this scalability is highly dependent on the details of the simulation. As the number of MPI ranks increases, the runtime-optimal input parameters change. Using just one set of input parameters might result in poor scalability (if the chosen set is optimal for small numbers of ranks) or good scalability (if the chosen set is optimal for a large number of ranks). As the number of ranks grows, FFTs and other functions requiring communication become relatively more expensive. This increases the runtime-optimal cutoff and can also make using a stencil size of 7 more efficient even for reference LAMMPS. Parallelism provides yet more knobs for users to consider. These include the number of MPI ranks per node and a number of OpenMP threads per rank. The optimal choice is again problem-dependent, but generally LAMMPS should be run with around 1 core per rank and 1-2 threads per core.
Fig. 6 contains results for running a proportionally scaled-up benchmark on an entire KNL node with all 64 of its cores. Now we present results for cutoffs from 4 to 9Å instead of 3 to 7 Å, since at 3Å the FFTs for both versions take much longer. For reference LAMMPS the runtime-optimal cutoff is now at 7Å. The optimized version is fastest at 6Å, although 7Å is only slightly slower. This set of simulations features the same number of atoms per core as Fig. 4, but its efficiency is reduced by parallelism overhead. For the single-core optimal cutoff of 5Å, this scaled-up simulation takes 2.5 times as long per atom with our optimized code. It still takes about twice as long even at the new optimal cutoff of 6Å. The reference version fares a little better, taking “only” twice as long at 5Å and 1.4 times as long at its new optimal cutoff of 7Å. If instead we compare the times required at the new runtime-optimal cutoffs to that required for the single-core optimal cutoff, the full node simulations take 1.8 and 2.1 times longer for the reference and optimized codes, respectively. Scalability aside, however, the same general patterns are apparent here as were seen earlier. Total speedup is about 2.4x for optimal cutoffs, lower than for the single-core case due to the relative increase in the expense of communication-intensive functions.
As it appears on the LAMMPS website, the SPC/E water benchmark we use here defaults to a cutoff of 9.8Å. This is of course far higher than the runtime-optimal cutoff on a single core—the simulation takes more than twice as long as at 5Å for reference LAMMPS and about twice as long for our optimized version. However, this exhibits much better scalability since runtime-optimal cutoffs are higher for higher core counts. This is because less time is spent performing poorly-scaling FFTs while more time is spent computing short-range pair interactions. Fig. 7a shows core-seconds taken to simulate a fixed-size problem as the number of cores used increases. There is one MPI rank per core and 1 thread per rank. For the 10Å case this scales well up to 32 cores, but for the full KNL node parallel efficiencies are 82% for reference LAMMPS and 63% for optimized LAMMPS. Running on 128 cores across two full nodes is very inefficient; the optimized version actually runs faster on one node, in part due to using an unfortunate number of FFT grid points, although it remains faster than the reference. Both versions see a comparable increase in core-seconds as communication costs rise, and this has a larger relative impact on the optimized version because it was faster to begin with. These observations are consistent with benchmarks published on the LAMMPS website, which exhibit large losses in parallel efficiency after about 16 processors for a variety of systems when running fixed-size benchmarks.
More commonly, users will simulate large problems on large numbers of cores. Fig. 7b shows core-seconds per atom as problem size and core count both vary, such that there are 36k atoms per core in each simulation. Parallel efficiencies on a full KNL node are now 85% and 71% for the reference and optimized versions, respectively, and 79% and 60% for two full nodes. Again we see the optimized version scaling less well because the rise in communication costs with core count is roughly the same for both versions, but it remains 2-3x faster over the entire range.
Users can also make use of OpenMP parallelism, by either assigning multiple cores to each MPI rank or using multiple threads per physical core, or both. Fig. 8 shows profiles for the same 64x-scale water test case being simulated on a full KNL node, where the number of MPI ranks and OpenMP threads per rank is varied. We use a cutoff of 6Å, as this was close to the runtime-optimal cutoff for this case on the full node when using 64 MPI ranks and 1 thread per rank. Best results are obtained when using one MPI rank per core, which is expected when not running on many nodes—the behavior on two full nodes is similar. Slight performance gain is obtained by using two OpenMP threads per core, which helps a little when computing the short-range interactions. The optimized version behaves similarly to the reference, and is at least twice as fast except when using too few MPI ranks.
5 Conclusion
Efficient vectorization proved to be key to attaining significant speedups over reference LAMMPS. For the PPPM functions, we tested several approaches and found memory access patterns to be particularly important. However, because the contiguous memory accesses were to be found in loops over stencil points, the stencil size limited vectorization efficiency. At the same time, as other parts of the code were optimized, the FFTs became relatively more expensive. And from the beginning we were concerned with users having difficulty choosing an optimal stencil size.
All of these problems turn out to have the same solution. Targeting a higher default stencil size allowed whole rows of a larger stencil to be computed at once, enabling efficient vectorization. Work shifted away from the FFTs and into newly-optimized functions when LAMMPS automatically adjusted the FFT grid to preserve accuracy. And users who do not test a variety of stencil sizes are no longer missing out on potential performance, because is optimal for every case and can be made the default. The relatively more expensive FFTs also made another previously-hard choice much easier, as now ad differentiation is significantly faster than ik differentiation due to its requiring only half as many FFTs. Although not discussed here, most of our optimizations are applicable to 256-bit vector registers and yield significant speedup on Xeon architectures, and similar speedup is also observed for different types of physical problems, such as an interfacial system where half of the domain is a vacuum.
LAMMPS is an extremely flexible program that allows and requires users to make numerous choices when simulating their different physical problems, and our optimized code is a significant improvement over reference LAMMPS, regardless of a user’s particular needs, for simulations which make use of the PPPM method for electrostatics. We achieve 2-3x speedup across a wide range of cutoff radii, for different accuracy requirements of the long-range solver, for both differentiation modes, and for different approaches to parallelization.
Many of these choices have a large impact on performance and even on simulation accuracy, often in ways that are not intuitive and not transparent to users as they try to work out how best to approach their problems. Some, like the choice of stencil size, are sufficiently obscure that many users likely use the default value, some without even knowing that they even had a choice. Other users will have gone to great lengths to set up their simulations in the best possible way, and will have made nearly-optimal choices for their specific problems. Our optimizations are particularly helpful to the first group because several of the user-facing knobs now have clearly best settings for a range of problem sizes, and these settings can be clearly communicated without much qualification as to which cases they work for, or they can even be made the default selections. Users with long experience and carefully-crafted input files will benefit from significant speedup for their existing set of inputs and can also expect that the optimal inputs for the new version are close to what they were already using.
Acknowledgments
The authors gratefully acknowledge financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111, and from Intel Corporation via the Intel Parallel Computing Center initiative.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. W. Hockney and J. W. Eastwood, Computer simulation using particles . Bristol: Hilger, 1988.
- 2[2] S. Plimpton, “Fast parallel algorithms for short-range molecular dynamics,” Journal of Computational Physics , vol. 117, no. 1, pp. 1 – 19, 1995.
- 3[3] W. M. Brown, A. Kohlmeyer, S. J. Plimpton, and A. N. Tharrington, “Implementing molecular dynamics on hybrid high performance computers – particle–particle particle-mesh,” Computer Physics Communications , vol. 183, no. 3, pp. 449 – 459, 2012.
- 4[4] M. J. Harvey and G. De Fabritiis, “An implementation of the Smooth Particle Mesh Ewald method on GPU hardware,” Journal of Chemical Theory and Computation , vol. 5, no. 9, pp. 2371–2377, 2009. [Online]. Available: http://dx.doi.org/10.1021/ct 900275 y
- 5[5] M. Höhnerbach, A. E. Ismail, and P. Bientinesi, “The vectorization of the tersoff multi-body potential: An exercise in performance portability,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , ser. SC ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 7:1–7:13.
- 6[6] D. Fabregat-Traver, A. E. Ismail, and P. Bientinesi, “Accelerating scientific codes by performance and accuracy modeling,” Co RR , vol. abs/1608.04694, 2016. [Online]. Available: http://arxiv.org/abs/1608.04694
- 7[7] W. M. Brown, J.-M. Y. Carrillo, N. Gavhane, F. M. Thakkar, and S. J. Plimpton, “Optimizing legacy molecular dynamics software with directive-based offload,” Computer Physics Communications , vol. 195, pp. 95 – 101, 2015.
- 8[8] W. M. Brown, A. Semin, M. Hebenstreit, S. Khvostov, K. Raman, and S. J. Plimpton, “Increasing molecular dynamics simulation rates with an 8-fold increase in electrical power efficiency,” in Proceedings of the 2016 ACM/IEEE Conference on Supercomputing , ser. SC ’16. New York, NY, USA: IEEE, 2016.
