cuSten -- CUDA Finite Difference and Stencil Library

Andrew Gloster; Lennon O'Naraigh

arXiv:1902.09931·cs.DC·September 5, 2019

cuSten -- CUDA Finite Difference and Stencil Library

Andrew Gloster, Lennon O'Naraigh

PDF

1 Repo

TL;DR

cuSten is a CUDA library that simplifies the implementation of finite difference and stencil computations in 2D and batched 1D, accelerating development of GPU-based numerical solvers.

Contribution

The paper introduces cuSten, a user-friendly CUDA library that streamlines finite difference and stencil computations, with an application to the Cahn-Hilliard equation and performance benchmarking.

Findings

01

cuSten significantly speeds up numerical code development on GPUs.

02

The library demonstrates competitive performance in solving PDEs.

03

Benchmark results show advantages over serial implementations.

Abstract

In this paper we present cuSten, a new library of functions to handle the implementation of 2D and batched 1D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied to solve the Cahn-Hilliard equation utilizing an ADI method with periodic boundary conditions, this solver is also used to benchmark the cuSten library performance against a serial implementation.

Tables1

Table 1. Table 1: Code metadata

Nr.	Code metadata description
C1	Current code version	2.1
C2	Permanent link to code/repository used of this code version	https://github.com/munstermonster/cuSten/releases/tag/2.1
C3	Legal Code License	Apache License 2.0
C4	Code versioning system used	git
C5	Software code languages, tools, and services used	CUDA, C++, HDF5 (for one of the examples, not needed to compile library)
C6	Compilation requirements, operating environments & dependencies	CUDA
C7	If available Link to developer documentation/manual	Generated using Makefile supplied with software, see README
C8	Support email for questions	[email protected]

Equations18

F [C] = \int_{Ω} [\frac{1}{4} (C^{2} - 1)^{2} + \frac{1}{2} γ ∣\nabla C ∣^{2}] d^{D} x

F [C] = \int_{Ω} [\frac{1}{4} (C^{2} - 1)^{2} + \frac{1}{2} γ ∣\nabla C ∣^{2}] d^{D} x

\frac{\partial C}{\partial t} = \nabla \cdot [D (C) \nabla \frac{δ F}{δ C}]

\frac{\partial C}{\partial t} = \nabla \cdot [D (C) \nabla \frac{δ F}{δ C}]

\frac{\partial C}{\partial t} = D \nabla^{2} (C^{3} - C - γ \nabla^{2} C), x \in Ω, t > 0.

\frac{\partial C}{\partial t} = D \nabla^{2} (C^{3} - C - γ \nabla^{2} C), x \in Ω, t > 0.

C (x, t = 0) = f (x), x \in \overline{Ω} .

L_{x} w = - \frac{2}{3} (C^{n} - C^{n - 1}) - \frac{2}{3} Δ t \nabla^{4} \overset{ˉ}{C}^{n + 1} + \frac{2}{3} D Δ t \nabla^{2} (C^{3} - C)^{n}

L_{x} w = - \frac{2}{3} (C^{n} - C^{n - 1}) - \frac{2}{3} Δ t \nabla^{4} \overset{ˉ}{C}^{n + 1} + \frac{2}{3} D Δ t \nabla^{2} (C^{3} - C)^{n}

L_{y} v = w

C^{n + 1} = \overset{ˉ}{C}^{n + 1} + v,

\frac{\partial ^{2} ϕ _{i}}{\partial x ^{2}} = \frac{ϕ _{i + 1} - 2 ϕ _{i} + ϕ _{i - 1}}{Δ x ^{2}},

\frac{\partial ^{2} ϕ _{i}}{\partial x ^{2}} = \frac{ϕ _{i + 1} - 2 ϕ _{i} + ϕ _{i - 1}}{Δ x ^{2}},

\frac{\partial ^{4} ϕ _{i}}{\partial x ^{4}} = \frac{ϕ _{i + 2} - 4 ϕ _{i + 1} + 6 ϕ _{i} - 4 ϕ _{i - 1} + ϕ _{i - 2}}{Δ x ^{4}},

s (t) = \frac{1}{1 - ⟨ C ^{2} ⟩}

s (t) = \frac{1}{1 - ⟨ C ^{2} ⟩}

k_{1} (t) = \frac{\int d ^{n} k ∣ C ^ ∣ ^{2}}{\int d ^{n} k ∣ k ∣ ^{- 1} ∣ C ^ ∣ ^{2}}

k_{1} (t) = \frac{\int d ^{n} k ∣ C ^ ∣ ^{2}}{\int d ^{n} k ∣ k ∣ ^{- 1} ∣ C ^ ∣ ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

munstermonster/cuSten
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

cuSten – CUDA Finite Difference and Stencil Library

Andrew Gloster

School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland

Lennon Ó Náraigh

School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland

Abstract

In this paper we present cuSten, a new library of functions to handle the implementation of 2D and batched 1D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied to solve the Cahn-Hilliard equation utilizing an ADI method with periodic boundary conditions, this solver is also used to benchmark the cuSten library performance against a serial implementation.

Current Software Version

Keywords: CUDA, Finite Difference, Library, PDEs, Stencil, Benchmark

I Introduction

Many problems in Physics and Applied Mathematics can be expressed as systems of Partial Differential Equations (PDEs), examples of which include the Navier–Stokes doering1995applied , Euler osher2006level ; hesthaven2018numerical , Black–Scholes wilmott_howison_dewynne_1995 and Burgers whitham2011linear equations. In many situations analytic methods of solving a given system are not possible due to the complexity of the equations; an alternate approach is to solve the system numerically. To discretize the system numerically, several standard approaches exist, including the finite-difference, finite-volume, and finite-element methods. For definiteness, this work focus on the finite-difference method, however, it can be applied in any situation requiring stencil-based operations.

The application of the finite-difference method turns the operators in PDEs into expressions which can be input into a computer program. For high-resolution numerical simulations, numerical scheme allowing, it is desirable to solve these computational problems in parallel with multiple processors to reduce the time taken to find a solution, this has been traditionally tackled with the MPI or OpenMP libraries which allow for parallelisation across multiple CPU cores. More recently due to developments in technology, a reduction in cost compared to traditional multi-CPU platforms and increased performance, GPUs have become a common approach to parallelisation. Particularly the use of NVIDIA GPUs and their programming language CUDA have become prevalent. CUDA today includes many GPU versions of common numerical libraries such as the linear algebra package cuBLAS, the Fourier transform library cuFFT and cuSPARSE which provides the programmer with many common solution methods for sparse matrices, discussion of these can be found on the CUDA documentation web-page cudaDoc .

There is a large field of literature associated with the implementation finite-difference methods using CUDA, a few examples include Micikevicius093dfinite ; waveFD ; elsStencil ; CUDAthesis . This literature commonly explains how to approach the problem of implementing a finite difference scheme using CUDA but yet the authors provide no publicly available library or code with their papers that a reader readily use in their own project, thus requiring the reader to rewrite code that repeats work already done elsewhere. Libraries providing PDE solvers and other stencil–based computations exist, such as libGeo and indeed some approaches that can generate code for the programmer autoGenZhang ; autoGenHolewinski , but these libraries and approaches can be limiting due to investment cost in learning essentially a full software package or new method. Indeed the PETSc library petscwebpage ; petscuserref ; petscefficient which supports finite-difference methods also now provides a GPU implementation but this limits the program to be written mostly using that library’s API (thus limiting flexibility), and requires the programmer to also develop knowledge of cuTHRUST cudaDoc to implement the GPU aspects of the library effectively. It can also be noted that the PETSc web-page petscwebpage currently documents some difficulties associated with using GPUs effectively in PETSc. As such, we present cuSten as a computational framework complementary to PETSc, readily deployable by a programmer interested in Physics applications, with relatively low overhead in terms of learning to implement large complex libraries.

Common problems at development time include readjusting boundaries when changing finite difference schemes or ensuring the correct data has been loaded onto the GPU at the time of computation, both of these are dealt with by cuSten. cuSten aims to overcome these difficulties along with addressing the problems with the above problems by providing a new software tool, introducing a simple set of four functions (three in many cases) for the programmer to implement their finite difference solver. These functions are accessed much like cuBLAS or cuSPARSE giving freedom to the programmer to build the program as they choose but eliminating the need to worry about the finite difference implementation specifics. This tool allows a programmer to simply input their desired finite-difference stencil and the direction in which it should be applied and then the rest of the implementation, including the domain decomposition, boundary positioning and data handling are wrapped into functions that are easily called. This approach reduces the development time necessary for implementing new systems/solvers and provides a robust framework that does not involve a black-box-approach to the solution from the programmer. Furthermore, the approach does not require a major overhead of time to invest in learning/implementing a new tool.

It is not intended that the code produced by this tool be the most efficient implementation of a given scheme versus a dedicated code for a specific problem, but it is intended that the development time of a code is drastically cut by removing the need for the programmer to do unnecessary work at development time. We include a comprehensive example of the application of this tool in Section V, here the ease of implementation of the cuSten library to solve the 2D Cahn–Hilliard equation is highlighted. A benchmark of cuSten versus a serial implementation of the same is also included to highlight the improvements in performance due to parallelisation on the GPU. 2D problems are the main focus of this new tool; 2D problems provide a test-bed for the development numerical algorithms which can then be extended to 3D, where debugging, testing, and validation are more time-consuming. The extension of the present method to 3D is discussed in Section VI below. In terms of floating point precision this library focuses on the use of the double floating point type as in most application it is desirable to have 64 bit precision when solving PDEs, the source code is easily modified using a standard text editor with find and replace to change to other data types if so desired (this is discussed also in Section VI below).

In Section II we introduce the underlying architecture of the cuSten library, including how it uses streams and events for optimal memory management. Then in Sections III and IV we talk the reader through the cuSten API along with examples and where to find all the source code within the library should they wish to edit it, the API is further explained in the Doxygen documentation included with the library itself. Section V presents the implementation of a full 2D Cahn–Hilliard solver along with a GPU versus CPU benchmark of the cuSten library and finally concluding discussions with potential future work are presented in Section VI.

II Software Architecture

The library in this paper makes use of the CUDA programming language. For the sake of brevity we assume the reader is familiar the standard features of the language including kernels, shared memory etc. The tool is built on two main sets of code, one handling the creation and destruction of the cuSten_t data type which handles all of the programmer’s inputs (found in /src/struct) and the other handling the compute kernels (found in /src/kernels).

At the top level will be the main program solving whatever PDE is of concern to the programmer and the library is called through the header cuSten.h. The programmer provides the necessary memory to the library using Unified Memory along with the stencil details, these will be detailed in Section III. Unified Memory was chosen as it simplifies the handling of memory in the library and interfacing with the rest of programmer’s code. The ability to address data beyond the device memory limit is also useful in cases where not all the data required for a given program fits in device memory, the movement of memory on and off the device is handled by the cuSten library as explained in the following paragraph.

To take advantage of Unified Memory the library allows the programmer to divide their domain into ‘tiles’ such that each tile will fit into the device RAM. Each tile is a chunk of the total domain in the y direction to ensure the memory is contiguous. The tiles are loaded onto the GPU in time for the kernel to be launched such that there are no GPU page faults. The programmer also has the option to unload the tiles onto host RAM after the computation is completed on a given tile, this can be for IO or if the programmer needed to free device RAM for the next tile or a new task. This system ensures that loading/unloading data and computation is implemented as a pipeline using separate streams for data loading/unloading and kernel launches ensuring that everything overlaps and ensuring that as little time is wasted retrieving memory over the PCIe bus which is a bottleneck to a memory bound program. Finite difference programs, such as the ones discussed in this article, are typically memory bound as only a few computations are required per point in the array yet the memory overhead can be quite large when several variables need to have stencils applied to them. Events are used to ensure the data has been loaded prior to the launch of a kernel.

The programmer has the choice of supplying a standard linear stencil or a function pointer with additional input coefficients to the library, examples of which are discussed in Sections IV.1 and IV.2 respectively. Within the compute kernel blocks of data with suitable boundary halos are loaded into shared memory. The stencil or function is then applied to the block with each thread calculating the output for its position. When this has completed the data is then output as blocks into the memory provided by the programmer for output, the same memory cannot be used for both as the blocks require overlapping data and thus cannot use already output values.

III Software API

The programmer can use up to four functions for the application of any given finite difference stencil, in most cases only three are required. The possible stencil directions include x, y and xy, where xy allows for cross derivatives which require that diagonal/off-diagonal information is available for the stencil to be completed, the stencil size is not limited in any direction and can be any desired shape, for example the stencil can be a $5\times 3$ in dimension and use every point within that area. Indeed the area for the stencil need not be centred at $(i,j)$ and it can extend in any direction more than another as necessary, this can be done be specifying non symmetric quantities for the number of points required left/right or top/bottom of $(i,j)$ in the stencil. A typical stencil for a second order accurate cross derivative $\frac{\partial^{4}}{\partial x^{2}\partial y^{2}}$ is shown in Figure 1, this stencil also appears in the linear biharmonic term for the Cahn–Hilliard solver presented in section V.2.

Each direction then comes with a periodic and non-periodic boundary option along with a choice between supplying just a set of weights (example in IV.1) which are applied linearly or a function pointer (examples in Section IV.2 and V.2) that can be used to apply more sophisticated schemes. In order to apply non–periodic boundary conditions the programmer will need to write their own boundary condition kernel, this was done to keep the library flexible to the programmer’s desired numerical scheme which may require more sophisticated boundaries than simple Neumann/Dirichlet conditions. The cuSten library simply leaves the data in the boundary cells untouched when performing a non–periodic computation. The naming convention for the functions available in the library is

[TABLE]

The descriptions for the options are as follows:

Create: This will take the programmer inputs such as the stencil size, weights, number of tiles to use etc. and return the cuSten_t ready for use later in the code.

Destroy: This will undo everything in create, freeing pointers and streams etc. To be used when the programmer has finished using the current stencil, for example at the end of a program.

Swap: This will swap all relevant pointers, in other words swap the input and output data pointers around so the stencil can be applied to the updated stencil after time-stepping. The need for this function is generally dependent on the overall numerical scheme a programmer is using, it is not needed in all situations.

Compute: This will run the computation applying the stencil to the input data and outputting it to the appropriate output pointer.

X: Apply the stencil in the x direction.

Y: Apply the stencil in the y direction.

XY: Apply the stencil in the xy direction simultaneously (for situations with cross derivatives etc.). The library will account for corner halo data in this situation.

p: Apply the stencil with periodic boundary conditions.

np: Apply the stencil with non-periodic boundary conditions, this leaves suitable boundary cells untouched for the programmer to then apply their own boundary conditions.

Fun: Version of the function to be used if supplying a function pointer, otherwise leave blank.

The functions are then called in order of Create, Compute, Swap (if necessary) and then Destroy at the end of the program. Complete usage examples are found in the next section with further examples found in examples/src. The complete API can be found in the Doxygen documentation, see README on how to generate this.

IV Examples

In this section we provide an overview of using library. We present three examples. The first is an implementation using linear stencil weights. The second involves a function pointer instead. The third example is at the level of a detailed physics problem (advection in Fluid Mechanics), and is included here to demonstrate to the user how to modify the source code as necessary. These three examples (and more) can be found in examples/src. The README provides compilation details. In all examples in the repository we take derivatives of various trigonometric functions as these are easy to benchmark against in periodic and non-periodic domains.

IV.1 Standard Weights

We present here the example 2d_x_np.cu, it is recommended to have this example open in a text editor to follow along. In this example we implement an 8th order accurate central difference approximation to the second derivative of $\sin(x)$ in the $x$ direction. The domain has 1024 points in $x$ and 512 points in $y$ , set by nx and ny respectively with the domain size lx set to $2\pi$ .

Unified memory is allocated with dataOld set to the input $\sin(x)$ and answer set to $-\sin(x)$ , dataNew is zeroed to ensure correct output. We choose to implement this scheme on compute device 0 by setting deviceNum and implement the scheme using a single tile, setting numTiles to 1. The stencil is then implemented by setting the parameters numSten, numStenLeft and numStenRight along with providing an array of the stencil weights the same length as numSten. numSten is the total number of points in the stencil, in this case 9, while numStenLeft/Right are the number of points in the left and right of the stencil, both 4 in this case. A cuSten_t named xDirCompute is then declared and fed along with the above parameters into custenCreate2DXnp, this then equips cuSten_t with the necessary information. The ordering of parameters to be fed into cuStenCreate2DXnp can be found in both the Doxygen documentation and cuSten/src/struct/cuSten_struct_functions.h

The computation is run using cuStenCompute2DXnp(&xDirCompute, HOST) where the HOST indicates we wish to load the data back to the host memory after the computation is completed, DEVICE if you wish to leave it in device memory. Finally the result is output along with the expected answer to stdout, the 4 cells on either side in the x direction will be $0.0$ due to the boundary, these would then be set by the programmer using suitable boundary conditions in a full solver. Then the cuStenDestroy2DXnp function is called to destroy the cuSten_t. Memory is then freed in the usual manner.

IV.2 Function Pointer

Now we present the function pointer version of the above example, named 2d_x_np_fun.cu, again is is recommended to have a text editor open with the code to follow along. Many of the parameters are the same as before except this time we remove the weights and replace them with coefficients that are then fed into the function pointer by the library.

The function pointer in this case implements a standard second-order accurate central-difference approximation to the second derivative of $\sin(x)$ . We supply numSten, numStenLeft and numStenRight as before but now we also need numCoe to specify how many coefficients we need in our function pointer.

Our function pointer is of type devArg1X, where the 1 indicates how many input data sets are required. Each thread in a block will call the function and it returns the desired output value for that thread, each index in the array has one thread assigned to it. The inputs are pointers to the input data, the coefficients and the index location in the stencil

[TABLE]

The central-difference scheme is implemented in a standard way with indexing done relative to loc, the coefficient in this case is set to $1.0/\Delta x^{2}$ as is standard. A key point to notice, is that the programmer must allocate memory for the function pointer on the device, this can be seen on line 131 and 132 of the example code prior to calling the Create function.

The rest of the access to the API is then the same as before except some of the inputs change and there is a Fun at the end of each function name, for example cuStenCreate2DXnpFun. We will see later in Section V how function pointers provide us with a powerful tool to apply stencils to non-linear quantities, in particular we will see this with the cubic term of the Cahn–Hilliard equation to which we wish to apply a Laplacian.

IV.3 Advection

The library also comes with an extra variant of the above functions 2d_xyADVWENO_p in which a 2D periodic advection WENO scheme has been implemented by modifying the 2DXYp source code. This is included as an example to show the user how to modify the source code as necessary to more specific needs or in situations where the function pointer may not meet requirements, for example in this situation where extra data needed to be input in the form of $u$ and $v$ velocities. The files can be found in the cuSten/src folder with how its called in examples/src/2d_xyWENOADV_p.cu.

A brief overview of the modifications made to the 2DXYp code are as follows:

•

The stencil dimensions are now set automatically when the creation function is called.

•

The $u$ and $v$ velocities were linked to the cuSten type with appropriate tiling.

•

Additional asynchronous memory copies were included in the memory loading portion of the code to ensure the velocities are present on the device at the required time.

•

The corner data copying to shared memory blocks was removed from the kernel as it is no longer required.

•

The standard stencil compute was removed and replaced with a device function call to a WENO solver, details of the solver can be found in osher2006level .

V cuCahnPentADI

In this section we show how the cuSten library can be used as part of a larger program that the authors developed using the cuPentBatch cuPent solver, a batched pentadiagonal matrix solver. We also provide a benchmark at the end of the section to show how cuSten performs versus a serial implementation. The equation we wished to solve was the 2D Cahn–Hilliard equation. The Cahn–Hilliard equation models phase separation in a binary liquid: when a binary fluid in which both components are initially well mixed undergoes rapid cooling below a critical temperature, both phases spontaneously separate to form regions rich in the fluid’s component parts. The regions expand over time in a phenomenon known as coarsening CH_orig . The equation is extremely well studied and is a popular model in polymer physics and interfacial flows.

In the mathematical framework, a single scalar concentration field $C(\mathbf{x},t)$ characterizes the binary mixture. As such, a concentration level $C=\pm 1$ indicates phase separation of the mixture into one or other of its component parts, while $C=0$ denotes a perfectly mixed state. The free energy for the mixture can be modeled as

[TABLE]

where the first term promotes de-mixing and the second term smooths out sharp gradients in transition zones between de-mixed regions; also, $\gamma$ is a positive constant, $\Omega$ is the container where the binary fluid resides, and $D$ is the dimension of the space. The twin constraints of mass conservation and energy minimization suggest a gradient-flow dynamics for the evolution of the concentration:

[TABLE]

where $\delta F/\delta C$ denotes the functional derivative of the free energy and $D(C)\geq 0$ is the mobility function, assumed for simplicity in this work to be a positive constant. As such, the basic model equation reads

[TABLE]

V.1 Discretisation

For simplicity, we focus on the case where $\Omega=(0,2\pi)^{D}$ , with periodic boundary conditions applied in each of the $D$ spatial dimensions. The method of solution we choose is based on the ADI method presented in stableHyper for the linear hyperdiffusion equation – we extend that scheme here and apply it to the non-linear Cahn–Hilliard equation as follows:

[TABLE]

Where ${\bf{L_{x}}}={\bf{I}}+\frac{2}{3}D\gamma\Delta t\partial_{xxxx}$ and similarly for ${\bf{L_{y}}}$ . In Equation (4) each one of the matrix inversions is solved using cuPentBatch as per the method presented in cuPent and we transpose the matrix when changing from the x direction to y direction sweep to ensure the data is in the proper interleaved format. To deal with the periodic element of the inversion the method is the same as in Reference cuPent ; navon_pent . To recover the initial $n-1$ time step we simply set this to the initial condition and appropiately update and time steps there after. The derivatives are discretised using standard second order accurate central differences.

[TABLE]

and $\Delta x=\Delta y$ for the a uniform grid.

V.2 Application of cuSten

The code for the example can be found with the repository in the cuCahnPentADI folder, supplied also in this folder is a Makefile to compile the files and a Python script to analyse the results which we present in Section V.3. cuSten is applied for all of the finite-difference elements of the code excluding the matrix inversion where we use cuPentBatch. Between lines 148 and 190 we can see an application of a more sophisticated function pointer than presented previously in Section IV.2, here we apply the Laplacian to the right-hand-side (RHS) non-linear term $C^{3}-C$ . The coefficients are declared between lines 481 and 516, the non–linear term is a $5\times 5$ stencil. This shows a clear example of ease of use of the function pointers and the easy swap in/out of values. Note how the indexing starts from the top left of the stencil and sweeps left to right in i, row by row in j for indexing.

The linear terms for the RHS are implemented using standard weighted schemes, the scheme uses a $5\times 5$ stencil, this and the non–linear term highlight one of the key features of the library with the easy change in stencil size and the boundaries are dealt with automatically (in this case periodic). The additional static functions at the start of the file apply the time stepping parts of the algorithm and combination of terms to set the full RHS. Output is done using the standard HDF5 library, this is required for the cuPentBatchADI program but not the cuSten library itself.

V.3 Numerical Results

In order to analyse the performance of the code we use two standard tests to quantify the coarsening rate LennonAurore . First we have the quantity $s(t)$ which can be defined as

[TABLE]

Where $\langle\cdot\rangle$ denotes the spatial average, which we calculate by a simple integration over the domain using Simpsons’s rule. Secondly we plot $1/k_{1}(t)$ , which also captures the growth in length scales, where $k_{1}$ can be defined as

[TABLE]

with the hat denoting the Fourier Transform. We run the simulation to a final time $T=100$ with $n_{x}=n_{y}=512$ points, the time–step size is set at $\Delta t=0.1\Delta x$ . The initial conditions are a random uniform distribution of values between $-0.1$ and $0.1$ , we have set the coefficients $D$ and $\gamma$ to $1.0$ and $0.01$ respectively. The initial condition is chosen to mimic a ‘deep quench’, where the system is cooled suddenly below the critical temperature, which allows for phase separation to occur spontaneously Zhu_numerics . The quantities $s(t)$ and $1/k_{1}(t)$ are plotted in Figure 2 as a function of $t$ with a reference line of $t^{1/3}$ included as both should scale proportionally to this. We can see clear match between our two quantities and $t^{1/3}$ . Finite-size effects spoil the comparison between numerics and theory towards the end of the computation, as by that time the $(C=\pm 1)$ -regions fill out the computational domain. Figure 3 to illustrate the behaviour of the solution in space and time: the system clearly evolves into extended regions where $C=\pm 1$ , which grow over time, consistent with Figure 2 and the established theory LS , in particular we can see the development of the finite size effects.

V.4 Benchmark of cuSten

In this section we benchmark the cuPentCahnADI program, which uses cuSten and cuPentBatch, against a serial version of the program running on a CPU. The GPU used in this benchmark is an NVIDIA Titan X Pascal and the CPU is an Intel i7–6850K which has 6 hyper-threaded cores operating at $3.6GHz$ . The benchmark is performed by measuring the time to time–step the simulation to a final time of $T=10$ , scaling $N$ where $N\times N$ is the total size of the domain. All start–up overheads and program–finish overheads are excluded from the timing, this is to ensure a fair benchmark of only the numerical computation, $T=10$ was chosen to ensure any effects of background processes due to the operating system are averaged out. No IO steps were included in either code. The same parameters and initial conditions were used as in the previous section, the serial code and version of cuPentCahnADI which outputs times rather than simulation data can be found in the folder cuPentSpeedUp.

In Figure 4 we present the scaling in time as a function of $N$ for the serial and GPU codes, superimposed are the lines for $N^{2}$ and $N^{3}$ for comparison. It can be seen clearly from these plots that the CPU code scales in time as $N^{3}$ while the GPU code initially scales with $N^{2}$ increasing to $N^{3}$ as $N$ increases. This can be attributed to the presence of Amdahl’s law, as $N$ increases more and more the serial computation required per core of the GPU will begin to dominate and the parallelisation speed-up levels out.

Further evidence of this can be seen in Figure 5 as the curve begins to level off for $N$ large. In this plot though we can see the clear advantage of parallelising this 2D solver on a GPU versus the serial CPU code, the speed-up is on the $O(10)$ for all reasonable grid resolutions, indeed the speed-up gets to $40\times$ faster for large $N$ , a significant performance increase. This performance would be increased further on newer GPUs such as the V100. Thus significant performance can be gotten by using a GPU with the cuSten library for 2D computations. The advantages of GPUs for the speed-up of batches of 1D problems has already been discussed in cuPent , the results presented in that paper used an earlier version of the cuSten library.

VI Discussion and Conclusions

VI.1 Possible Future Extensions to the Library

As previously mentioned the current library is limited to 2D uniform grids with double precision. Future areas of expansion could include moving the current library functions into C++ templates, this would allow for easier generalisation to other data types without the current need for find and replace to be done manually. Expansion to 3D and non-uniform grids is less trivial. 3D would require a different approach to loading data than currently implemented as data will not be contiguous in RAM in the z direction, a more sophisticated loading scheme with pointers would be required. For non-uniform grids additional data would need to be loaded into memory, it is likely in this situation that a hybrid of modifying the code such as in the WENO example to have extra data available ( $u$ and $v$ velocities in the case of WENO, coordinate transformations in the case of a non-uniform grid) and using function pointers would be the best approach to make to the existing source.

VI.2 MPI

The design of the library lends itself to an MPI domain decomposition to be used in a hybrid code with the cuSten library. Each MPI process could be assigned to a GPU using the deviceNum parameter, then the user would apply the non periodic versions of the stencils along with using MPI to swap the boundary halos. Memory exchange is simplified in MPI due to the use of Unified Memory, the required data will be copied directly between GPU devices. This allows for the application of this library in much larger solvers which require more than just a single GPU.

VI.3 Concluding Remarks

In this paper we have shown how cuSten can be used to simplify the implementation of finite difference programs in CUDA compared with other state of the art libraries such as PETSc. cuSten has a lightweight interface with a minimal learning curve required to implement the functions as part of a wider project. The library has been benchmarked against a serial code using a Cahn–Hilliard solver and numerous examples are provided to show potential users how to use the functionality provided. It has wide ranging applications in finite-difference solver development and in further areas requiring stencil–based operations such as image processing and optimisation problems.

Acknowledgements

Andrew Gloster acknowledges funding received from the UCD Research Demonstratorship. All authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research. The authors also acknowledge the referees of this paper who provided insightful feedback with good suggestions to improve both the paper and cuSten library.

VII Bibliography

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Charles R Doering and John D Gibbon. Applied analysis of the Navier-Stokes equations , volume 12. Cambridge University Press, 1995.
2[2] Stanley Osher and Ronald Fedkiw. Level set methods and dynamic implicit surfaces , volume 153. Springer Science & Business Media, 2006.
3[3] Jan S Hesthaven. Numerical methods for conservation laws: From analysis to algorithms , volume 18. SIAM, 2018.
4[4] Paul Wilmott, Sam Howison, and Jeff Dewynne. The Mathematics of Financial Derivatives: A Student Introduction . Cambridge University Press, 1995.
5[5] Gerald Beresford Whitham. Linear and nonlinear waves , volume 42. John Wiley & Sons, 2011.
6[6] NVIDIA. Cuda toolkit documentation. https://docs.nvidia.com/cuda/ , 2019.
7[7] Paulius Micikevicius. 3d finite difference computation on gpus using cuda. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, ACM International Conference Proceeding Series , pages 79–84. ACM, 2009.
8[8] David Michéa and Dimitri Komatitsch. Accelerating a three-dimensional finite-difference wave propagation code using gpu graphics cards. Geophysical Journal International , 182(1):389–402, 2010.