TL;DR
This paper introduces a fast spectral discontinuous Galerkin method for solving the multi-species full Boltzmann equation efficiently on GPUs, significantly outperforming traditional stochastic methods in speed.
Contribution
The paper presents a novel GPU-accelerated DGFS implementation for the full Boltzmann equation, achieving high parallel efficiency and speedup over existing stochastic approaches.
Findings
Solves the Boltzmann equation in minutes, two orders faster than DSMC.
Achieves parallel efficiency of 0.96--0.99 on 36 GPUs.
Demonstrates effective weak/strong scaling performance.
Abstract
When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes,…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4| 16 | 4 | 0.00039 | 3.27e-03 | 3.27e-03 | 0.00050 | 1.77e-03 | 1.77e-03 | 0.00039 | 4.80e-03 | 1.22e-03 | 0.00050 | 3.63e-03 | 2.47e-04 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 0.00056 | 3.73e-03 | 3.73e-03 | 0.00085 | 2.00e-03 | 2.00e-03 | 0.00050 | 4.96e-03 | 1.33e-03 | 0.00093 | 3.62e-03 | 2.42e-04 | |
| 16 | 0.00084 | 3.73e-03 | 3.73e-03 | 0.00147 | 2.00e-03 | 2.00e-03 | 0.00084 | 4.96e-03 | 1.33e-03 | 0.00148 | 3.62e-03 | 2.42e-04 | |
| 24 | 6 | 0.00068 | 1.37e-04 | 1.37e-04 | 0.00114 | 1.01e-04 | 1.01e-04 | 0.00068 | 1.81e-03 | 2.06e-02 | 0.00114 | 1.79e-03 | 5.15e-03 |
| 12 | 0.00117 | 1.49e-04 | 1.49e-04 | 0.00209 | 9.64e-05 | 9.64e-05 | 0.00114 | 2.12e-03 | 1.87e-02 | 0.00210 | 2.13e-03 | 6.01e-03 | |
| 24 | 0.00210 | 1.49e-04 | 1.49e-04 | 0.00401 | 9.64e-05 | 9.64e-05 | 0.00210 | 2.12e-03 | 1.87e-02 | 0.00401 | 2.13e-03 | 6.01e-03 | |
| 32 | 8 | 0.00159 | 3.04e-05 | 3.04e-05 | 0.00287 | 2.51e-05 | 2.51e-05 | 0.00157 | 1.54e-04 | 1.62e-02 | 0.00286 | 1.52e-04 | 1.13e-02 |
| 16 | 0.00286 | 3.17e-05 | 3.17e-05 | 0.00541 | 2.45e-05 | 2.45e-05 | 0.00286 | 5.91e-05 | 1.69e-02 | 0.00542 | 5.87e-05 | 1.03e-02 | |
| 32 | 0.00543 | 3.17e-05 | 3.17e-05 | 0.01057 | 2.45e-05 | 2.45e-05 | 0.00542 | 5.91e-05 | 1.69e-02 | 0.01059 | 5.87e-05 | 1.03e-02 | |
| 40 | 10 | 0.00328 | 1.38e-06 | 1.38e-06 | 0.00626 | 1.26e-06 | 1.26e-06 | 0.00326 | 5.53e-05 | 4.31e-03 | 0.00626 | 5.56e-05 | 4.35e-03 |
| 20 | 0.00625 | 9.35e-07 | 9.35e-07 | 0.01226 | 8.10e-07 | 8.10e-07 | 0.00626 | 5.01e-05 | 4.29e-03 | 0.01222 | 4.97e-05 | 4.54e-03 | |
| 40 | 0.01227 | 9.35e-07 | 9.35e-07 | 0.02446 | 8.10e-07 | 8.10e-07 | 0.01219 | 5.01e-05 | 4.29e-03 | 0.02431 | 4.97e-05 | 4.54e-03 | |
| 48 | 12 | 0.00656 | 1.04e-07 | 1.04e-07 | 0.01289 | 9.99e-08 | 9.99e-08 | 0.00658 | 8.46e-06 | 5.76e-04 | 0.01291 | 8.45e-06 | 5.93e-04 |
| 24 | 0.01290 | 1.05e-07 | 1.05e-07 | 0.02556 | 9.95e-08 | 9.95e-08 | 0.01291 | 7.17e-06 | 5.80e-04 | 0.02561 | 7.51e-06 | 6.09e-04 | |
| 48 | 0.02545 | 1.05e-07 | 1.05e-07 | 0.05169 | 9.95e-08 | 9.95e-08 | 0.02550 | 7.17e-06 | 5.80e-04 | 0.05215 | 7.51e-06 | 6.09e-04 | |
| 56 | 14 | 0.01204 | 9.80e-08 | 9.80e-08 | 0.02350 | 9.79e-08 | 9.79e-08 | 0.01202 | 5.22e-06 | 2.32e-04 | 0.02352 | 4.08e-06 | 1.88e-04 |
| 28 | 0.02354 | 9.80e-08 | 9.80e-08 | 0.04667 | 9.79e-08 | 9.79e-08 | 0.02353 | 5.09e-06 | 2.24e-04 | 0.04662 | 3.97e-06 | 1.87e-04 | |
| 56 | 0.04664 | 9.80e-08 | 9.80e-08 | 0.09303 | 9.79e-08 | 9.79e-08 | 0.04674 | 5.09e-06 | 2.24e-04 | 0.09313 | 3.97e-06 | 1.87e-04 | |
| Parameter | Case C-01 |
|---|---|
| Molecular mass: () | , |
| Non-dim physical space | |
| Non-dim velocity space | |
| Spatial elements | |
| DG order | 3 |
| Time stepping | Euler |
| Viscosity index: | |
| Scattering parameter: | |
| Ref. diameter: () | |
| Ref. temperature: () | |
| Characteristic mass: () | |
| Characteristic length: () | 1 |
| Characteristic velocity: () | 337.2 |
| Characteristic temperature: () | 273 |
| Characteristic number density: () | |
| Initial conditions | |
| Velocity: () | 0 |
| Temperature: () | 273 |
| Number density: () | |
| Number density: () | |
| Knudsen number: | |
| Knudsen number: | |
| Left wall (purely diffuse) boundary conditions (subscript ) | |
| Velocity: () | |
| Temperature: () | 273 |
| Right wall (purely diffuse) boundary conditions (subscript ) | |
| Velocity: () | |
| Temperature: () | 273 |
| Phase space | Work Units (s) | Efficiency | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1G | 3G | 6G | 9G | 12G | 24G | 36G | 1G/3G | 1G/6G | 1G/9G | 1G/12G | 1G/24G | 1G/36G | |
| 47.580 | 16.155 | 8.339 | 5.698 | 4.392 | 2.423 | 1.774 | 0.98 | 0.95 | 0.93 | 0.90 | 0.82 | 0.84 | |
| 126.601 | 42.616 | 21.551 | 14.563 | 11.038 | 5.784 | 4.030 | 0.99 | 0.98 | 0.97 | 0.96 | 0.91 | 0.98 | |
| 391.943 | 131.081 | 65.913 | 44.218 | 33.513 | 17.224 | 11.621 | 1.00 | 0.99 | 0.98 | 0.97 | 0.95 | 1.05 | |
| 94.682 | 31.957 | 16.197 | 10.944 | 8.331 | 4.392 | 3.079 | 0.99 | 0.97 | 0.96 | 0.95 | 0.90 | 0.96 | |
| 253.016 | 84.834 | 42.741 | 28.697 | 21.703 | 11.158 | 7.693 | 0.99 | 0.99 | 0.98 | 0.97 | 0.94 | 1.03 | |
| 782.343 | 261.601 | 131.217 | 87.755 | 66.009 | 33.520 | 22.509 | 1.00 | 0.99 | 0.99 | 0.99 | 0.97 | 1.09 | |
| 141.754 | 47.641 | 24.033 | 16.182 | 12.326 | 6.356 | 4.388 | 0.99 | 0.98 | 0.97 | 0.96 | 0.93 | 1.01 | |
| 378.956 | 126.853 | 63.676 | 42.636 | 32.066 | 16.295 | 11.041 | 1.00 | 0.99 | 0.99 | 0.98 | 0.97 | 1.07 | |
| 1172.907 | 391.916 | 196.439 | 131.153 | 98.538 | 49.652 | 33.471 | 1.00 | 1.00 | 0.99 | 0.99 | 0.98 | 1.10 | |
| 283.091 | 94.737 | 47.679 | 31.903 | 24.060 | 12.262 | 8.320 | 1.00 | 0.99 | 0.99 | 0.98 | 0.96 | 1.06 | |
| 759.149 | 253.498 | 127.004 | 84.932 | 63.780 | 32.212 | 21.672 | 1.00 | 1.00 | 0.99 | 0.99 | 0.98 | 1.09 | |
| 2347.099 | 783.642 | 392.470 | 261.817 | 196.552 | 98.680 | 66.018 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.11 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A discontinuous Galerkin fast spectral method for multi-species full Boltzmann on streaming multi-processors
Accepted
Shashank Jaiswal
Purdue University701 W. Stadium AvenueWest LafayetteIndiana47907U.S.A
,
Jingwei Hu
Purdue University150 N. University StWest LafayetteIndiana47907U.S.A
,
Julien K. Brillon
Purdue University701 W. Stadium AvenueWest LafayetteIndiana47907U.S.A
and
Alina A. Alexeenko
Purdue University701 W. Stadium AvenueWest LafayetteIndiana47907U.S.A
(2019)
Abstract.
When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes, making it at least two order-of-magnitude faster than the present state-of-art stochastic method—direct simulation Monte Carlo—widely used for solving Boltzmann equation. Various performance metrics, such as weak/strong scaling have been presented. A parallel efficiency of 0.96–0.99 is demonstrated on 36 Nvidia Tesla-P100 GPUs.
Multi-species full Boltzmann, Discontinuous Galerkin Fast Spectral, Graphics Processing Units
††copyright: rightsretained††conference: Platform for Advanced Scientific Computing; June 2019; ETH Zurich, Switzerland††ccs: Applied computing Mathematics and statistics††ccs: Computing methodologies Massively parallel and high-performance simulations
1. Introduction
From the fundamental mass/momentum conservation principles, it can be inferred that, in the presence of external forces, say, pressure and temperature gradients, the heavier species moves slower and the lighter species moves faster giving rise to a phenomena termed as diffusion, and effects thereof. Diffusion processes are critical in many applications, for instance, the measurement of the neutrino mass using a windowless gaseous tritium source in the ongoing KATRIN experiment (Bornschein et al., 2005). The dynamics of such systems (and others) are governed by the Boltzmann equation—an integro-differential equation describing the evolution of the distribution function in six-dimensional phase space—which models the dilute gas behavior at the molecular level to accurately describe a wide range of non-continuum flow phenomena, for instance, shocks, expansions into vacuum (Muntz, 1989) as well as velocity and thermal slip at gas-solid interfaces (Sharipov and Kalempa, 2003, 2004). Most rarefied flows of technological interest involve gas mixtures with species diffusion playing a decisive role in turbulent, chemically reacting flows, and evaporation/condensation processes (Takata and Golse, 2007).
The approaches for numerical solution of the Boltzmann equation date back to as early as 1940s (Grad, 1949) using, for example, the now widely used direct simulation Monte Carlo (DSMC) method (Bird, 1994). The DSMC method, based on the kinetic theory of dilute gases, models the binary interactions between particles stochastically. However, it is this stochastic nature, that makes the method unsuitable for flows involving species in trace concentration, for instance, to analyze the spectrum of beta electrons emitted by tritium source which can be substantially different in the presence of the impurities in KATRIN experiment (Bornschein et al., 2005; Sharipov and Kalempa, 2005).
The main difficulty of numerically solving the full Boltzmann equation lies in its complicated collision term. Recently, a fast Fourier spectral method for the multi-species Boltzmann collision operator was introduced in (Jaiswal et al., 2019b). The complexity for a single evaluation of the collision operator is reduced from (direct calculation) to (based on a low-rank decomposition strategy), where is the number of discretization points in each velocity dimension, is the number of discretization points in the radial direction needed for low-rank decomposition, and is the number of discretization points on the sphere. Based on (Jaiswal et al., 2019a), a discontinuous Galerkin fast spectral (DGFS) method was also proposed in (Jaiswal et al., 2019b) for solving the full multi-species Boltzmann equation. DGFS can produce high order spatially and temporally accurate solutions for low-speed and unsteady flows in micro-systems, and is amenable to excellent nearly-linear scaling characteristics on massively parallel architectures. This paper focuses on implementation aspects of multi-species DGFS with an emphasis on establishing the algorithmic behavior of such numerical schemes. More specifically, here, we are concerned about the scaling characteristics of DGFS on multi-GPU/multi-CPU systems.
In the section that follows, we describe the multi-species Boltzmann in brief, followed by a description of the collision operator algorithm. Various performance metrics such as weak/strong scaling, and micro benchmarks involving, static adaptivity of cell-size and polynomial order approximation for rarefied gas-flows have been discussed in section 3. Concluding remarks are given in section 4.
2. The multi-species Boltzmann equation
The non-dimensional Boltzmann equation for multi-species, mono-atomic gas without external forces can be written as (cf. (Jaiswal et al., 2019b))
[TABLE]
where denotes number of species in the mixture – each of them represented by a number distribution function of time , position , and particle velocity . The collision operator takes into account interactions between species and , which acts only in the velocity space:
[TABLE]
where and denote the pre and post collision velocity pairs, which are related through momentum and energy conservation as
[TABLE]
where , denote the mass of particles of species and respectively. Here, the vector varies over the unit sphere . The quantity () is the collision kernel depending only on and the scattering angle (angle between and ). In the present work, we consider the variable soft sphere (VSS) (Koura and Matsumoto, 1991) scattering model. It is worth emphasizing that although the VSS collision kernel is adopted in the present work for easy comparison with DSMC solutions, the fast spectral method we use for the collision operator applies straightforwardly to general collision kernels (see (Gamba et al., 2017; Jaiswal et al., 2019a, b)).
For Variable Soft-Sphere model (Bird, 1994) in particular, the non-dimensional collision-kernel , and the Knudsen number are given as
[TABLE]
[TABLE]
Here denotes the usual Gamma function, , , , and are, respectively, the reference diameter, the reference temperature, the viscosity index, and the scattering parameter. The diameter and exponent are determined so that the transport (viscosity and diffusion) coefficients of VSS are consistent with experimental data. Additionally , , , and , respectively, denote the characteristic length, characteristic temperature, characteristic number density, and characteristic mass . Based upon these, we define the characteristic velocity as where refers to Boltzmann constant; and characteristic time as . For convenience, we define a pre-factor as
[TABLE]
Henceforth, we will always refer to the non-dimensional Boltzmann equation (1) in our presentation.
2.1. The collision operator
First, note that does not depend on spatial coordinate . Given distribution functions and of species and , dependent only on the velocity coordinate : discretized uniformly using points, the method produces at the same grid with complexity, where is the number of Gauss-Legendre quadrature/discretization points in the radial direction needed for low-rank decomposition , is the number of discretization points on the sphere. The steps (based on (Jaiswal et al., 2019b)) for evaluating can be summarized as:
- •
Change the variable to :
[TABLE]
where is the unit vector along , and
[TABLE]
- •
Determine the extent of velocity domain , and periodically extend , to .
- •
Truncate the integral in to a ball with
[TABLE]
- •
Approximate , by truncated Fourier series
[TABLE]
Note here is a three-dimensional index.
- •
Substitute , into (9), and perform the standard Galerkin projection
[TABLE]
where , and the kernel modes and are given by
[TABLE]
It is clear that the direct evaluation of (for all ) would require complexity. But if we can find a low-rank, separated expansion of as
[TABLE]
then the gain term (positive part) of can be rearranged as
[TABLE]
which is a convolution of two functions and , hence can be computed via fast Fourier transform (FFT) in operations. Note that the loss term (negative part) of is readily a convolution and can be computed via FFT in operations.
In order to find the approximation in (17), we simplify (16) as
[TABLE]
where
[TABLE]
while for the loss term,
[TABLE]
For details on the error introduced from the Fourier-spectral approximation, the reader is referred to (Jaiswal et al., 2019b). This discussion has been omitted in the present work for brevity.
2.2. The collision operator algorithm
The collision operator procedure described above is applicable for general collision kernels for -species mixture. However, for a concise description of the algorithmic ideas from an implementation viewpoint, we restrict our discussion to Variable Soft Sphere collision kernel (2). The ideas, however, can certainly be carried over to other collision kernels.
In multi-species implementation, with the high amount of involved computation, our motive is to avoid spurious computation for every timestep. We first outline the procedure for pre-computing variables that can be stored and reused during the course of the simulation.
- •
First, we precompute . We use Gauss-Legendre-Quadrature (GLQ) for integration. So , the GLQ zeros, is an array of size (since the integrand oscillates on the scale of O(N), the total number of quadrature points needed should be ). Additionally, we use spherical design (Womersley, 2016) quadrature on sphere. So, , the spherical-quadrature zeros, is an array of size . as previously defined is the 3-D velocity-space index, and is therefore an array of size . Based upon these is precomputed and stored as a flattened row-major array . This is described in steps 1–9 of Algo. (1).
- •
Second, we compute as per Eq. (20). Note that is velocity-space index of size . Since , , and do not change with time, the term is precomputed and stored as a flattened row-major array for every collision pair . This is described in step 13 of Algo. (1).
- •
Third, we perform precomputation needed for loss-term as per Eq. (21). The output is stored as a flattened row-major array for every collision pair . This is described in step 14 of Algo. (1).
Next we outline the procedure for computing . Recall that our motive is to compute (15)
- •
First, we compute the forward Fourier transform of , and to obtain and respectively. This is described in step 1 of Algo. (2).
- •
Second, we compute as per Eq. (2.1). Recall that has been already precomputed and stored as . Also recall that has been precomputed and stored as . These can be reused to compute . This is described in step 2–8 of Algo. (2). In our implementation, we explicitly unroll the nested loops using Mako (Bayer, 2018) templating engine, such that variables in steps 4 and 5 are computed in a single kernel call (thereby requiring a space of each), and the FFT transforms in the step 6 are rather batched FFT transforms, each of size .
- •
Third, in order to perform convolution for the loss-term , we prepare the variable QG in step 7 of Algo. (2).
- •
Fourth, we perform convolutions to compute as in Eq. (15). Recall that has now been precomputed and stored as QG, and can be reused here. An inverse Fourier transform is then performed to obtain final . This is described in step 10 of Algo. (2).
3. Micro-Benchmarks
Verification for standard rarefied gas flows can be found in (Jaiswal et al., 2019b). In the present work, we focus on the evaluation of the algorithmic behavior.
3.1. Hardware Configuration
Serial and parallel implementations of multi-species DGFS solver are run on 15-node Brown-GPU RCAC cluster at Purdue University. Each node is equipped with two 12-core Intel Xeon Gold 6126 CPU, and three Tesla-P100 GPU. The operating system used is 64-bit CentOS 7.4.1708 (Core) with NVIDIA Tesla-P100 GPU accompanying CUDA driver 8.0 and CUDA runtime 8.0. The GPU has 10752 CUDA cores, 16GB device memory, and compute capability of 6.0. The solver has been written in Python/PyCUDA and is compiled using OpenMPI 2.1.0, g++ 5.2.0, and nvcc 8.0.61 compiler with third level optimization flag. All the simulations are done with double precision floating point values.
3.2. Spatially homogeneous case: Krook-Wu exact solution
For constant collision kernel, an exact solution to the spatially homogeneous multi-species Boltzmann equation can be constructed (see (Krook and Wu, 1977)). We use this solution to verify the accuracy of the proposed fast spectral method for approximating the collision operator. Considering a binary mixture (; ), the equation simplifies to
[TABLE]
where and is some positive constant. The exact solution is given by
[TABLE]
where
[TABLE]
Furthermore, the following condition needs to be satisfied
[TABLE]
For simplicity, we choose , , but vary the mass ratio in the following tests.
It is also helpful to take the derivative of eqn. (23), which yields
[TABLE]
where
[TABLE]
This allows us to check the accuracy of the collision solver without introducing time discretization error.
Table 1 shows the norm between the numerical and analytical . For different mass ratios, we have considered the cases with points in each velocity dimension; and , spherical design quadrature points on the full sphere. A good agreement between analytical and numerical solutions is evident from the table. At a fixed , with increase in mass ratio, the error norm increases. In particular, increase in does not considerably affect the solution due to the isotropic nature of the distribution function. Note that, in the fast spectral decomposition, since the integral oscillates roughly on , the total number of Gauss–Legendre quadrature points in the radial direction should be on order of . As per (Gamba et al., 2017), a more precise estimate is . However, there is no good rule to select optimal . We observe that the error is relatively unaffected upon reducing from to . However, we note that is a safer choice.
From a computational viewpoint, the simulation time is independent of the mass ratio. On increasing the number of discretization points on the sphere , the computational cost approximately doubles–however, we do observe the effect of loop unrolling for smaller . Likewise, the computational cost approximately doubles on increasing the number of quadrature points . This establishes that the algorithm is linear in both and .
3.3. Spatially in-homogeneous case: Couette flow
The aforementioned methodology allows us to compute the collision operator efficiently. To solve the fully spatial in-homogeneous equation (1), we also need an accurate and efficient spatial and time discretization. Here, we adopt the the Runge-Kutta discontinuous Galerkin (RKDG) approach–widely used for hyperbolic systems–as adapted in (Jaiswal et al., 2019a, b) for Boltzmann equation. The details of the discretization can be found in (Jaiswal et al., 2019a, b). We mention that evaluation of collision operator consumes of computation time, and hence, in the present work, we focus on the collision operator behavior. More details on spatial-temporal RKDG discretization on GPU can be found in (Klöckner et al., 2009; Witherden et al., 2014). We restrict our discussion and benchmarks to 1-D flow problems for brevity111Discussion and benchmarks for higher 2D/3D spatial dimension shall be presented in the extended version of this manuscript..
3.3.1. Verification
For general Boltzmann equation (1), analytical solutions do not exist. Therefore, we compare our results with widely accepted direct simulation Monte Carlo (DSMC) (Bird, 1994) method. We want to emphasize that DSMC is a stochastic method for solution of the N-particle master kinetic equation which converges to the Boltzmann equation in the limit of infinite number of particles (Wagner, 1992).
In the current test case, we consider the effect of velocity gradient on the solution. The coordinates are chosen such that the walls are parallel to the direction and is the direction perpendicular to the walls. The geometry as well as boundary conditions are shown in Figure 1. Figure 2 illustrates the velocity and temperature along the domain length for both species, wherein we observe an excellent agreement between DGFS and DSMC. The small discrepancies, however, are primarily due to: a) statistical fluctuations inherent to the Monte Carlo methods, b) practical limitations on number of particles used in DSMC simulations. From a computational viewpoint, the present DGFS simulations on a single GPU took 138 seconds to acquire the steady state, in contrast to 26086.45 sec on 24 processors for DSMC simulations as reported in (Jaiswal et al., 2019b), for achieving comparable accuracy.
3.3.2. Scaling Behavior
The simulations are carried out for different test-cases by varying element-count (), polynomial approximation order (), and velocity-space sizes (). The spatial elements are distributed to processors using the well-known linear domain-decomposition strategy requiring sharing of floating-point during MPI communication phase. Speed up obtained with multi-GPU solver is presented in Table (3). As evident from the table, the acceleration due to GPU parallelization increases with increase in the size of computational grid. More specifically, the increase in and have small-effect on overall speedup which suggests that DG-operators (for instance derivative, time-evolution) are rather computationally inexpensive operations. On the other hand, increase in velocity-grid improves the observed speedup. The weak/strong scaling behavior is also evident from the table.
3.3.3. Flat profile
Recall that the fast Fourier spectral collision operator algorithm 2 is split into multiple parts. It is therefore interesting to see what performance level is attained by each part of the operator. Fig (3) presents the percentage of time spent in various parts of Algo. 2 vs. order of DG scheme (K). First, we note that the DG operators denoted in yellow, requires of the total simulation time. The collision operator, however, consumes nearly of the total time for both and .
4. Conclusions
We have presented an implementation of the multi-species Discontinuous Galerkin Fast Spectral (DGFS) method for solution of multi-species monoatomic full Boltzmann equation on multi-GPU/multi-CPU architectures. The DG-type formulation employed in the present work has advantage of having high-order accuracy at the element-level, and its element-local compact nature (and that of our collision algorithm) enables effective parallelization on massively parallel architectures. For verification and benchmarks, we carry out simulations for spatially homogeneous BKW, and Couette flow problems. Parallel efficiency close to 0.95 is observed on a 36 GPU multi-node/multi-GPU system. An important key observation is that the efficiency can be maintained provided we have enough work on each processor. It is this speedup that now allows researchers to solve problems within a day that would otherwise take months on traditional CPUs. Future work directions include, assessment of the implementation beyond thousand cores. Extending the implementation to general 2D/3D mixed grids coupled with adaptivity in physical and velocity spaces, is an interesting direction as well.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Bayer (2018) Michael Bayer. 2018. Mako: Templates for Python. (2018). https://www.makotemplates.org/
- 3Bird (1994) G. A. Bird. 1994. Molecular Gas Dynamics and the Direct Simulation of Gas Flows . Clarendon Press, Oxford.
- 4Bornschein et al . (2005) L Bornschein, Katrin-Collaboration, et al . 2005. The KATRIN experiment-a direct measurement of the electron antineutrino mass in the sub-e V region. Nuclear Physics A 752 (2005), 14–23.
- 5Gamba et al . (2017) I. Gamba, J. Haack, C. Hauck, and J. Hu. 2017. A fast spectral method for the Boltzmann collision operator with general collision kernels. SIAM Journal of Scientific Computing 39 (2017), B 658–B 674.
- 6Grad (1949) Harold Grad. 1949. On the kinetic theory of rarefied gases. Communications on pure and applied mathematics 2, 4 (1949), 331–407.
- 7Jaiswal et al . (2019 a) S. Jaiswal, A. Alexeenko, and J. Hu. 2019 a. A discontinuous Galerkin fast spectral method for the full Boltzmann equation with general collision kernels. J. Comput. Phys. 378 (2019), 178–208.
- 8Jaiswal et al . (2019 b) Shashank Jaiswal, Alina A. Alexeenko, and Jingwei Hu. 2019 b. A discontinuous Galerkin fast spectral method for the multi-species full Boltzmann equation. ar Xiv preprint ar Xiv:1903.03056 (2019).
