Automatic Differentiation for Adjoint Stencil Loops

Jan H\"uckelheim; Navjot Kukreja; Sri Hari Krishna Narayanan; Fabio; Luporini; Gerard Gorman; Paul Hovland

arXiv:1907.02818·cs.DC·July 8, 2019

Automatic Differentiation for Adjoint Stencil Loops

Jan H\"uckelheim, Navjot Kukreja, Sri Hari Krishna Narayanan, Fabio, Luporini, Gerard Gorman, Paul Hovland

PDF

TL;DR

This paper introduces a novel automatic differentiation technique that maintains the efficient, parallelisable stencil-like structure of loops, enabling high-performance gradient computations in scientific and neural network applications.

Contribution

The paper presents a new method combining automatic differentiation with loop transformations to preserve stencil loop structures for efficient, parallelizable gradient computation.

Findings

01

Implemented in the Python tool PerforAD

02

Applied to seismic imaging and fluid dynamics

03

Achieves stencil-like, parallelisable derivatives

Abstract

Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated…

Equations20

r_{i, j} = u_{i - 1, j} + 2 u_{i, j - 1},

r_{i, j} = u_{i - 1, j} + 2 u_{i, j - 1},

r_{i, j} = f (a, b) (a = u_{i - 1, j}, b = u_{i, j - 1}) .

r_{i, j} = f (a, b) (a = u_{i - 1, j}, b = u_{i, j - 1}) .

r_{i, j} = derivative (f (a, b), a) (a = u_{i - 1, j}, b = u_{i, j - 1}) .

r_{i, j} = derivative (f (a, b), a) (a = u_{i - 1, j}, b = u_{i, j - 1}) .

derivative (f (a, b), a) (a = u_{i - 1, j}, b = u_{i, j - 1}),

derivative (f (a, b), a) (a = u_{i - 1, j}, b = u_{i, j - 1}),

derivative (f (a, b), a) (a = u_{i, j}, b = u_{i + 1, j - 1}) .

derivative (f (a, b), a) (a = u_{i, j}, b = u_{i + 1, j - 1}) .

[s_{i} + l \in [1, M] max (o_{i} (S_{l})), s_{e} - l \in [1, M] min (o_{i} (S_{l}))] .

[s_{i} + l \in [1, M] max (o_{i} (S_{l})), s_{e} - l \in [1, M] min (o_{i} (S_{l}))] .

\frac{\partial ^{2} u}{\partial t ^{2}}

\frac{\partial ^{2} u}{\partial t ^{2}}

D := (\frac{d t}{d x})^{2} = (\frac{d t}{d y})^{2} = (\frac{d t}{d z})^{2},

D := (\frac{d t}{d x})^{2} = (\frac{d t}{d y})^{2} = (\frac{d t}{d z})^{2},

\frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} = ν \frac{\partial ^{2} u}{\partial x ^{2}} .

\frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} = ν \frac{\partial ^{2} u}{\partial x ^{2}} .

u \frac{\partial u}{\partial x} \approx \frac{1}{d x} \cdot (max (u_{i}, 0) \cdot (u_{i} - u_{i - 1}) + min (u_{i}, 0) \cdot (u_{i + 1} - u_{i})) .

u \frac{\partial u}{\partial x} \approx \frac{1}{d x} \cdot (max (u_{i}, 0) \cdot (u_{i} - u_{i - 1}) + min (u_{i}, 0) \cdot (u_{i + 1} - u_{i})) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Automatic Differentiation for Adjoint Stencil Loops

Jan Hückelheim

0000-0003-3479-6361

Imperial College LondonSouth KensingtonLondonUnited KingdomSW7 2AZ

[email protected]

,

Navjot Kukreja

Imperial College LondonSouth KensingtonLondonUnited KingdomSW7 2AZ

[email protected]

,

Sri Hari Krishna Narayanan

Argonne National Laboratory9700 South Cass RoadLemontIllinois60439

[email protected]

,

Fabio Luporini

Imperial College LondonSouth KensingtonLondonUnited KingdomSW7 2AZ

[email protected]

,

Gerard Gorman

Imperial College LondonSouth KensingtonLondonUnited KingdomSW7 2AZ

[email protected]

and

Paul Hovland

Argonne National Laboratory9700 South Cass RoadLemontIllinois60439

[email protected]

(2019)

Abstract.

Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable.

In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.

Automatic Differentiation, Stencil Computation, Loop-Transformation, Shared-Memory Parallel, Discrete Adjoints, Back-Propagation

††copyright: rightsretained††journalyear: 2019††copyright: licensedusgovmixed††conference: 48th International Conference on Parallel Processing; August 5–8, 2019; Kyoto, Japan††booktitle: 48th International Conference on Parallel Processing (ICPP 2019), August 5–8, 2019, Kyoto, Japan††price: 15.00††doi: 10.1145/3337821.3337906††isbn: 978-1-4503-6295-5/19/08††ccs: Mathematics of computing Automatic differentiation††ccs: Software and its engineering Source code generation††ccs: Theory of computation Parallel computing models††ccs: Theory of computation Shared memory algorithms

1. Introduction

Derivatives are an important ingredient for optimisation, inverse modelling, error analysis, and related methods. Reverse-mode automatic differentiation (AD) is used to compute these derivatives in applications including climate modelling (Heimbach et al., 2005), fluid dynamics (Giles et al., 2005), machine learning (Baydin et al., 2018), and image processing (Li et al., 2018). Starting from an implementation of a differentiable function, referred to as primal, AD generates a new program that computes the derivative of that function. This is achieved by using operator overloading (Griewank et al., 1996; Hogan, 2014), source-to-source (Griewank et al., 1989; Utke et al., 2008; Narayanan et al., 2010; Hascoet and Pascual, 2013), or just-in-time compilation (Revels et al., 2016; Innes, 2018).

In this paper, we focus on the differentiation of stencil loops, which appear, for example, in convolutional neural networks and structured-mesh PDE solvers. Stencil loops are characterised by a memory access pattern where each index in an output array is updated based on data gathered from a neighbourhood around that index in one or more input arrays. Stencil computations are straightforward to parallelise. For example, when OpenMP is used, each thread is guaranteed to have a different value for the counter in the parallel loop, and hence the write operations will occur at unique memory locations. The read operations, on the other hand, occur at neighbourhoods around that index, and more than one thread can read data from the same memory location. Techniques to execute stencils efficiently on multicore CPUs or GPUs have been studied extensively; see for example (Kamil et al., 2010; Ragan-Kelley et al., 2013).

We aim to apply reverse-mode automatic differentiation by source-to-source transformation to such stencil loops. We will refer to this approach as AD for the remainder of this work. AD, also known as back-propagation or adjoint differentiation, allows the computation of gradients with respect to an arbitrary number of function inputs, at a cost that is independent of the number of inputs. Thus, one can generate high-resolution seismic images, perform industrial-scale shape optimisation, or train large neural networks. The derivatives generated by AD are also sometimes called adjoint programs.

Reverse-mode AD traces derivatives backwards through a program, from the outputs to the inputs. In other words, the adjoint program is reading data from variables corresponding to the output of the primal program and is writing data to variables corresponding to the inputs of the primal. This data flow reversal also means that a concurrent read access in the primal program (i.e., a gather operation) can result in a concurrent write access in the derivative program (i.e., a scatter operation). Consequently, a parallelisation or vectorisation strategy that is valid for the primal program may not be valid for its derivative, and reverse-differentiating parallel programs without introducing data races (while maintaining parallel efficiency) is a hard problem. Previous approaches work only for a small class of well-structured programs where array indices or pointer targets of input variables can be statically shown to be non-overlapping (Förster, 2014; Hückelheim et al., 2017).

To avoid data races and therefore undefined program behaviour, one could safeguard every potentially conflicting write access with atomic pragmas or critical sections. Doing so, however, causes the write updates to be sequentialised, reducing parallel efficiency. Additionally, such constructs incur overheads due to the necessary thread synchronisation. The cost of executing the atomic updates may slow the program execution significantly, as we demonstrate in the test cases in this paper.

Another option is to use sum-reductions as implemented, for example, in the OpenMP runtime. In nontrivial scenarios, these require a private copy of the output array for each thread, which significantly increases the memory footprint of the program. Other approaches include colouring schemes, which incur an overhead in computing the colouring, as well as introducing synchronisation barriers for each colour, thus reducing parallel performance.

Our adjoint stencils technique solves this problem by implementing back-propagation using only gather operations obtained via loop transformations of the scatter operator. It requires no additional memory and performs the bulk of the computation with no additional synchronisation barriers. All transformations are applied at compile time. The idea is illustrated in Figure 1.

The paper is organised as follows. We summarise related work and background in Section 2. Following this, we describe the steps to generate adjoint stencils in Section 3 and present test cases in Section 4. We present experimental performance results in Section 5, and in Section 6, we summarise our conclusions and briefly describe future work.

2. Related Work

AD differs from numerical differentiation or finite differences in that it computes exact derivatives of the implemented computation, with no truncation errors and without the need to choose any finite step size. Automatic differentiation also differs from symbolic differentiation in that it can handle large computations including loops, branches, and function calls efficiently.

AD for parallel programs has been addressed for distributed-memory MPI programs (Hovland, 1997), exploiting the fact that message passing between parallel instances is explicit in the program source code and often limited to a relatively small, carefully defined interface. In this work, we focus on stencils, and not any specific parallelisation technique. Stencil compilers (e.g., YASK) can parallelise in MPI or shared memory but need the stencil structure.

Our approach is similar to that of autodiff for Halide (Li et al., 2018). For simple stencils, we expect both approaches to yield identical results. However, whereas the approach in (Li et al., 2018) is intimately tied to Halide and relies on zero padding and sum reductions, our approach can be applied to any stencil compiler, such as Devito (Luporini et al., 2018), HOSTS (Stock et al., 2014), and Tiramisu (Baghdadi et al., 2019), and requires neither padding nor reductions.

While our work uses some concepts from polyhedral compilers, it does so in a targeted way that is tailored to generating stencil loops that compute adjoints. Our framework supports nonrectangular iteration spaces, avoiding some of the overapproximation required in an interval-based representation like Halide.

A core aspect of our work is a gather-scatter conversion, similar to that described in (Stock et al., 2014). That paper, however, deals only with performance optimisation of stencil operations, and AD is not considered. To work with AD, the transformation must recognise that some inputs of the original function, as well as inputs to the adjoint computation, need to be gathered from the appropriate indices.

In a separate paper (Hückelheim et al., 2018), we describe Transposed Forward-Mode AD (TF-MAD) to transform the adjoint of a stencil back into a stencil loop. We applied this to a structured PDE solver and observed equally good scalability for the primal and adjoint. While that paper has an aim similar to the work presented here, it did not provide an implementation to automate the necessary transformations, and it was restricted to stencils with a symmetric data flow; that is, if the output array at index $i$ depends on the input array at index $j$ , then the output at $j$ must also depend on the input at $i$ . This is not the case for most stencil loops in practice, where the input space is slightly larger than the output space because of boundary effects.

Other authors have presented automatic differentiation for various domain-specific languages (Farrell et al., 2013; Paszke et al., 2017). These have in common that the differentiation needs to operate only on a high level of abstraction and assemble hand-optimised building blocks that perform efficient adjoint operations, for example, on neural network layers, partial differential equations, or linear algebra operators. In contrast, our work is not tied to any particular application domain and can differentiate any computation that has a stencil-like structure.

3. Method and Implementation

In this section we describe the procedure to generate adjoint stencils and its implementation in the stencil differentiation tool PerforAD, which we release with this work.111PerforAD v1.1, https://github.com/jhueckelheim/PerforAD/releases/tag/1.1

Before describing the transformations for arbitrary loop nest depths and stencil shapes, we begin with a brief summary of AD, followed by a simple example to illustrate adjoint stencils.

3.1. Automatic Differentiation

AD computes partial derivatives of each individual statement in a given primal computer program and implements a new adjoint program that accumulates those partial derivatives, following the chain rule, to obtain the derivative of the entire primal program. For example, if the original program contains a statement such as

⬇

r = sin(u);

then the adjoint program will contain the corresponding statement

⬇

ub = cos(u)*rb;

where ub is used to store the derivative of the program output with respect to u and rb holds the derivative of the program output with respect to r. The multiplication of the derivative of this statement (cos(u)) with the output adjoint rb is an application of the chain rule of calculus. If a primal expression contains nonlinear functions (in this case sin()), the computation of the partial derivatives of that expression needs to access the original primal value (in this case, u).

An AD tool often distinguishes between active and passive variables. For example, a user may be interested only in the derivatives of some outputs with respect to some inputs. The inputs and outputs of interest are called independent and dependent, respectively. All intermediate variables that depend on an independent input and that influence a dependent output are called active; all other variables are called passive. Only active variables require a derivative counterpart variable, and only expressions that involve an active variable need to be differentiated. PerforAD supports these requirements by allowing a user to specify which arrays are active.

PerforAD is not a general-purpose AD tool. Rather, it can be seen as an AD-aware loop transformation tool that focuses on the efficient creation of adjoint stencil loops. A general-purpose AD tool is currently necessary to differentiate the entire program, except for the stencil loops that can be handled by PerforAD. The tool does not contain its own parser front-end and instead relies on the caller to supply a high-level description of the stencil computation. In the test cases in our paper, this description was manually created. Automating this process remains future work. PerforAD is designed in a modular fashion to simplify the creation of new front-ends (for example, to parse Fortran or C code) and back-ends (for example, to print Fortran or C code). PerforAD is written in Python and makes extensive use of the symbolic math library SymPy (Meurer et al., 2017) for its internal computation, as well as for its user interface.

3.2. 1D Example

This section explains the adjoint stencil transformation step by step using a one-dimensional example. This explanation is accompanied by the illustration in Figure 2.

Suppose that a primal program contains the following parallel gather operation for an iteration space $i\in[1,n-1]$ :

⬇

#pragma omp parallel for private(i)

for ( i=1; i<=n - 1; i++ ) {

r[i] = c[i](2.0u[i-1]-3.0u[i]+4u[i+1]);

}

A straightforward reverse-mode differentiation of this loop to compute the derivatives of r with respect to u would yield the following scatter operation, where ub and rb represent the adjoint variables that correspond to u and r:

⬇

for ( i=1; i<=n-1; i++ ) {

ub[i-1] += 2.0 * c[i] * rb[i];

ub[i] -= 3.0 * c[i] * rb[i];

ub[i+1] += 4.0 * c[i] * rb[i];

}

If we assume that floating-point summation is associative (we come back to this point in Section 3.5), we can split this into three loops.

⬇

for ( i=1; i<=n-1; i++ ) {

ub[i-1] += 2.0 * c[i] * rb[i];

}

for ( i=1; i<=n-1; i++ ) {

ub[i] -= 3.0 * c[i] * rb[i];

}

for ( i=1; i<=n-1; i++ ) {

ub[i+1] += 4.0* c[i] * rb[i];

}

We can now substitute the loop counter i in the first, second, and third loop with j:=i-1, j:=i, and j:=i+1, respectively, and obtain three loops that each use the loop counter $j$ as write index and have iteration spaces $j\in[0,n-2]$ , $j\in[1,n-1]$ and $j\in[2,n]$ .

⬇

for ( j=0; j<=n-2; j++ ) {

ub[j] += 2.0 * c[j+1] * rb[j+1];

}

for ( j=1; j<=n-1; j++ ) {

ub[j] -= 3.0 * c[j] * rb[j];

}

for ( j=2; j<=n; j++ ) {

ub[j] += 4.0 * c[j-1] * rb[j-1];

}

We observe that the iteration space of the three loops intersects for $j\in[2,n-2]$ . Only the iteration space of the first loop contains $j=0$ , that of the first and second loops contain $j=1$ , that of the second and third contain $j=n-1$ , and the iteration space of the third loop contains $j=n$ . We can therefore compute the same result using the following parallel loop and remainder statements.

⬇

ub[0] += 2.0 * c[1] * rb[1];

ub[1] += 2.0 * c[2] * rb[2];

ub[1] -= 3.0 * c[1] * rb[1];

ub[n-1] -= 3.0 * c[n-1] * rb[n-1];

ub[n-1] += 4.0 * c[n-2] * rb[n-2];

ub[n] += 4.0 * c[n-1] * rb[n-1];

#pragma omp parallel for private(j)

for ( j=2; j<=n-2; j++ ) {

ub[j] += 2.0 * c[j+1] * rb[j+1];

ub[j] -= 3.0 * c[j] * rb[j];

ub[j] += 4.0 * c[j-1] * rb[j-1];

}

Assuming that $n$ is sufficiently large, the time spent executing the remainder statements will be insignificant compared with that spent inside the loop, which contains only updates to ub[j] that can easily be merged into a single statement to obtain

⬇

#pragma omp parallel for private(j) shared(rb,ub,c)

for ( j=2; j<=n-2; j++ ) {

ub[j] += 4.0 * c[j-1] * rb[j-1]

     - 3.0*c[j] * rb[j]

     + 2.0 * c[j+1] * rb[j+1];

}

This adjoint stencil loop has the same set of read and write indices and can be parallelised in the same way as the primal stencil loop. Note that the constant factors $4.0$ and $2.0$ have swapped their position compared with the primal stencil.

3.3. Multidimensional Stencils

We now present the generation of adjoint stencil loops for any number of dimensions and for any stencil shape, as implemented in PerforAD.

3.3.1. From Inputs to Adjoint Statements

The tool requires as input a symbolic expression that represents the computation performed by a single iteration of the innermost loop. In addition, a list of symbolic objects representing loop counters and symbolic expressions for loop bounds is required. Also, the user must provide a list of active variables for which counterpart derivative variables will be generated. Two complete examples of PerforAD input scripts are given in Section 4.

If the loop body is sufficiently simple, the provided expression is differentiated with respect to all active input variables separately using SymPy’s symbolic differentiation capabilities. We note that even though we use symbolic differentiation for the statement inside the innermost loop, the overall derivative is assembled by using automatic differentiation techniques. For large loop bodies where symbolic differentiation is too inefficient, instead of providing a concrete expression such as

[TABLE]

the user can provide an uninterpreted function such as

[TABLE]

PerforAD will then generate adjoint code where the derivatives of $f$ are also uninterpreted function calls. For example, the derivative of $f$ with respect to its input $a$ , evaluated at $(u_{i-1,j},u_{i,j-1})$ , would be written as

[TABLE]

A back-end could easily replace this expression by the appropriate call to a derivative function that was created manually or by AD.

Each of these generated expressions represents the partial derivative of the stencil with respect to one input (for example, $\frac{\partial f}{\partial L}$ ). The derivative expressions that are generated in this way will be denoted by $S_{1},S_{2},\ldots,S_{M}$ , where the number of such expressions $M$ is bounded by the stencil size. For example, the above two-point stencil would yield $M=2$ . In order to implement back-propagation, each expression needs to be multiplied with the adjoint variable that is associated with the output of the primal loop body (in this example, $\bar{r}_{i,j}$ ), and the result must be assigned to the adjoint variable that is associated with the input with respect to which this expression was differentiated (in this example, $\bar{u}_{i-1,j}$ ). This yields the expression $\bar{u}_{i-1,j}\;+=\frac{\partial f}{\partial a}\cdot\bar{r}_{i,j}$ . Without the following steps, the derivative expressions would collectively implement the scatter operation that is typical for conventional adjoints of stencil loops.

3.3.2. Shifting the Indices

To transform the scatter into a gather operation, PerforAD determines the constant offset $o$ that is used by each derivative expression (in the above example, a two-dimensional vector $o=[-1;0]$ ). All indices of that expression are increased by $-o$ . In our example, this yields $\bar{u}_{i,j}+=\frac{\partial f}{\partial L}\cdot\bar{r}_{i+1,j}$ . In order to preserve the semantics of the adjoint computation, the offset is stored along with the shifted expression and is later used to adjust the loop bounds accordingly. The result of this step is a list of tuples, each containing an expressions and an offset. All expressions now write to the same output index.

If the output of the primal stencil function depends on its inputs in a nonlinear way, then the derivative computation requires read access to the primal input, and any array accesses also need to occur with shifted indices. For our example with

[TABLE]

the derivative with offsets shifted by $o=[-1;0]$ would become

[TABLE]

In some cases, the shifted derivative functions will read data from indices that did not occur in any primal expression (here, $i+1,j-1$ ).

3.3.3. Generating the Core Loop Nest

The offsets are now used to identify the largest iteration space in which it is legal to execute all shifted derivative expressions generated in the previous step. We will refer to the loop that covers this space as the core loop nest. The loop bounds of the core loop nest depend on the bounds of the primal loop nest supplied by the user, as well as the offsets of the shifted expressions. In each dimension $i$ , the iteration space of the primal loop will be denoted as $i\in[s_{i},e_{i}]$ where $s_{i},e_{i}$ refer to the primal lower and primal upper bound, respectively. The $i$ component of the offset vector of a given adjoint expression $S_{l}$ will be denoted as $o_{i}(S_{l})$ . The lower core loop bound is then given by the original lower loop bound given by the user, plus the maximum offset in that dimension. Conversely, the upper loop bound is given by the original upper loop bound minus the minimum offset. Formally, the adjoint core loop bounds in $i$ -dimension are

[TABLE]

The core loop nest implements a gather operation with as many distinct read indices as the primal loop. If the iteration space is much larger than the stencil size, this loop nest will perform most of the adjoint computation. The core loop nest can be easily parallelised.

3.3.4. Boundary Treatment

In addition to the core loop nest, we need to generate boundary loop nests that execute the appropriate subset of derivative expressions wherever they are valid outside the core area. To achieve these, we split the set of derivative expressions into subsets such that each subset contains only expressions with the same iteration space. Usually more than one correct splitting exists. In the current implementation, the loop nests are generated such that their iteration spaces are disjoint; that is, every output index is accessed only within one loop nest that contains all statements that update this location. As a result, no synchronisation barriers are needed between the generated loop nests. Note that the boundary loops are also stencil gather operations, albeit with smaller numbers of read indices, and can be easily parallelised; see Figure 3.

The current splitting strategy results in a relatively large code size: with a primal stencil that gathers data from $n$ points in each of $d$ dimensions, the number of generated adjoint loop nests is at most $(2n-1)^{d}$ . For the one-dimensional three-point stencil in Section 3.2, this resulted in five adjoint loops, of which all but the core loop could be unrolled, since they performed only one iteration. For a dense $3\times 3$ stencil in two dimensions, the number of adjoint loops would be $25$ , and for a dense three-dimensional $3\times 3\times 3$ stencil, $125$ . If the primal stencil is not dense but instead a star-shaped stencil such as the one shown in Section 4.1, then $53$ loop nests are needed.

To avoid generating such a large amount of code, the strategy could be changed to instead generate only one remainder loop on each side of the core loop in each dimension, containing all derivative expressions. Since not all expressions are valid throughout the whole remainder, each statement would need to be guarded by an if-condition. While this might not be ideal for performance, the effect would be limited to the remainder loop nests, which are at most $d-1$ -dimensional. A polyhedral compiler could be used to implement many other strategies, with different trade-offs between the code size and the number of branches.

Another strategy is possible if the primal stencil is always executed on entire arrays and the AD tool can control the memory allocation for both primal and adjoint code. In this case, the input arrays could be padded with zeroes, which would allow all shifted adjoint expressions to be executed throughout the entire domain.

3.4. Restrictions

The transformation implemented in PerforAD requires that there be a set of input arrays from which data is read and a set of output arrays that are written in an assignment or incremented by using the += operator. The sets of read and write arrays must not intersect; that is, no array can be both input and output (with the aforementioned exception of the += operator). In addition, read access is allowed from constants that are not active for differentiation. All output arrays are accessed only by using the loop counters as indices. For a nested loop with counter variables $[i_{1},i_{2},\ldots]$ , a multidimensional array may be written or incremented by using a permuted subset of the counter variables as indices, for example, r[i_1][i_4][i_3]. All input arrays are read at indices that are constant integer offsets of the loop counters, for example, u[i_1+3][i_3-2][i_4-1]. The offsets must be known at compile time for expressions to be shifted and remainder loops to be generated. The loop nest must be perfect; that is, no statement may appear other than inside the innermost loop. All loop bounds must be affine functions.

3.5. Associativity of Floating-Point Summation

We assume that the contributions to each adjoint output index can be reordered arbitrarily without affecting the final result. This assumption would be true for real numbers but is not true for floating-point numbers because of roundoff effects. If this effect is important for some application, PerforAD could be modified to respect a particular ordering that the user chooses. The ordering would still be deterministic if the adjoint code is executed in parallel. This is possible because all updates to a given index are collected within the same iteration. In contrast, the scatter operation generated by a conventional AD tool could not be easily parallelised in a way that produces a deterministic floating-point result.

3.6. Verification

To verify the correctness of the adjoint code generated by PerforAD, we compared it with adjoint code generated by the AD tools ADIC (Narayanan et al., 2010) and Tapenade (Hascoet and Pascual, 2013). We used the test cases described in Section 4 and checked for equality of the computed adjoints. For all the input test cases in this paper, the outputs of all the three approaches were in full agreement.

4. Test Cases

We have selected two test cases to demonstrate the performance of adjoint stencils in PerforAD. Both test cases are structured-mesh discretisations of partial differential equations. The generated adjoint stencil code faces two performance challenges. First, a large number of remainder loops may cause large adjoint code sizes and potentially slow the execution. Second, the use of symbolic differentiation applied to the loop body may cause unnecessary computations for complicated primal loop bodies, especially since PerforAD makes no attempt to identify common sub-expressions within the same loop nest. We selected a test case with a deep loop nest and another one with a complicated loop body to test our tool in the presence of these challenges.

The first test case is a solver for wave equations with spatially varying wave propagation speed. The computation is performed in a three-deep loop nest, corresponding to the three spatial dimensions. A large number of iterations is typically performed with this type of solver, which makes this a good showcase for performance evaluations.

The second test case is a one-dimensional Burgers equation solver. The loop body is more challenging to differentiate, because it is nonlinear and only piecewise differentiable. As a result, the generated adjoint is more complex and contains branches.

For both test cases, we generate adjoint stencils using PerforAD and generate conventional adjoint code with scatter memory access using the Tapenade (Hascoet and Pascual, 2013) AD tool. Tapenade generates serial output code. We thus additionally create a parallelised version of the conventional adjoint code by adding OpenMP pragmas to the Tapenade-generated code, safeguarding the scattered memory access with atomic pragmas to avoid data races.

4.1. Wave Equation

The wave equation and its adjoint are solved numerically in applications including seismic imaging, where performance on high-performance computing systems has been identified as a bottleneck (Araya-Polo et al., 2011). We perform one single time step on a $1000\times 1000\times 1000$ three-dimensional grid. The wave equation is given by

[TABLE]

By using finite differences to discretise the differential operators in space and time with spatial and temporal resolution $dx$ and $dt$ and by letting $c:=a^{2}$ and

[TABLE]

a primal solver to solve this equation, as well as its adjoint, can be generated with the Python script shown in Figure 4.

The primal and adjoint stencil loop shown in Figure 5 are generated by PerforAD and shown alongside the conventional adjoint generated by Tapenade, which is not a stencil loop and thus had to be manually parallelised.

4.2. Burgers Equation

Computational fluid dynamics (CFD) is another application that routinely uses adjoints and relies on efficient parallelisation to compute industrial-scale test cases. The Burgers equation is often used as a prototype in CFD solver development, because it is scalar and thus relatively easy to implement, but it still shows some of the challenging nonlinear behaviour that makes CFD simulations difficult. In one spatial dimension, it can be written as

[TABLE]

An interesting challenge in the context of this work is that the convective term is nonlinear. A common way of discretising this term is upwinding, where the finite difference formula depends on the sign of $u$ as

[TABLE]

By letting $C:=\frac{dt}{dx}$ and $D:=\nu*\frac{dt}{dx^{2}}$ , the primal and adjoint stencils can be implemented by using the PerforAD script shown in Figure 6, yielding the C code shown in Figure 7.

The stencil is nonlinear, and the value of $u$ is required at various points in the derivative computation. The adjoint code generated by Tapenade handles the $\min$ and $\max$ functions differently from PerforAD. Instead of evaluating these functions within the adjoint code, Tapenade creates a loop that evaluates the functions separately and pushes the results onto a stack. These precomputed values are popped from the stack during the adjoint computation. The resulting code cannot easily be parallelised, since the order in which values are added and removed from the stack is crucial.

To exclude this effect and compare PerforAD with a parallelised adjoint code that uses scatter operations, we manually modified the Tapenade output to instead evaluate the $\min$ and $\max$ functions within the adjoint computation itself, and we removed the stack access. Following this, we made the loop OpenMP-parallel and added atomic pragmas. We ran the primal and all adjoint solvers on a one-dimensional problem with one time step on 1 billion cells.

5. Experimental Results

We obtained timing results on two test systems for the primal and adjoint, as well as conventional adjoint solvers for the wave equation, and the Burgers equation. The first test system, referred to as Broadwell, uses a dual socket system with 12-core Intel Xeon E5-2650 v4 Broadwell CPUs, with a combined total of 24 physical cores. To eliminate NUMA effects on our experiments, we limited the number of threads to 12, and used numactl to restrict memory and computation to a single socket and OMP_PROC_BIND to ensure thread pinning. The other test system, referred to as KNL, runs an Intel XeonPhi Knights Landing 7210 processor with 64 physical cores and up to 256 threads, in KMP_AFFINITY=scatter mode. The examples are compiled using the Intel C Compiler version 18 using the flags -O3 -fopenmp -xHost.

5.1. Broadwell

Since PerforAD introduces overhead by differentiating the loop body separately for each function input using SymPy, we expect the serial performance of the generated adjoint stencils to be worse than that of Tapenade-generated adjoint code. On the other hand, we expect adjoint stencils to have parallel scalability at least equivalent to that of the primal stencil.

All of this is indeed confirmed in our experiments. Figure 10 shows absolute runtimes for our wave equation test case on Broadwell. The serial runtime of adjoint stencils is $64\%$ higher than that of Tapenade adjoints, at $8.52s$ compared with $5.43s$ . However, using atomic pragmas on the conventional adjoint slows that code, and it requires $91s$ even if only one thread is used, and it slows down further with any additional thread that is added to the computation. This result is in agreement with the reported slowdown of using atomics in (Ragan-Kelley et al., 2013). In contrast, the adjoint stencil code scales well and performs slightly better than the conventional adjoint if $2$ threads are used, and significantly better if more threads are added. Similarly, Figure 11 shows absolute runtimes for the Burgers equation test case, where the serial performance of adjoint stencils is worse but, because of the improved scalability, outperforms the conventional adjoint solver if at least $2$ threads are used. The scalability of all wave equation solvers on Broadwell is presented in Figure 8, and for the Burgers equation solvers in Figure 9.

5.2. Knights Landing (KNL)

On the KNL system, the primal wave equation solver scales up to 16 threads, then plateaus. The solver cannot efficiently scale to the full number of threads, probably because the memory bandwidth is saturated. Adjoint stencils scale better than the primal stencils do (up to 32 threads), probably because the loop body contains more operations. The better scalability means that the performance gap between the primal and adjoint stencil is slightly smaller for parallel than for the serial execution.

The Tapenade wave equation adjoint is not parallelised and hence does not scale. The manually parallelised scatter adjoint with atomics scales up to 2 threads, then becomes slower with any additional thread. Its absolute runtime is always worse than that of the serial adjoint. The best-observed runtime for stencil adjoints is more than $19\times$ better than that of the conventional adjoint code.

For the Burgers equation, adjoint and primal stencils scale well up to 64 threads. Parallelised conventional adjoints with atomics always perform worse than serial conventional adjoints in our experiment. On KNL, we used the original Tapenade output that precomputes $\min$ and $\max$ functions on a stack for our serial run times. Hence, the PerforAD-generated adjoint stencil is faster even in serial. The manually parallelised Tapenade adjoint is identical to that used on Broadwell, with no stack access. Adjoint stencils outperform conventional adjoints by a factor of $125\times$ .

6. Conclusion and Future Work

We have presented adjoint stencils, a method for automatic differentiation or back-propagation of gather stencil loops in a way that uses only gather stencil operations. This method is implemented in the open source tool PerforAD, which we release together with this paper. The test cases in our work demonstrate that adjoint stencils are as scalable on CPU and XeonPhi systems as are the original computations, and can outperform programs generated by conventional automatic differentiation by orders of magnitude.

We plan to test our method also on GPU systems, and to explore similar transformations not only for stencil computations but more generally for programs that have been performance-optimised. We aim to extend PerforAD by combining it with conventional automatic differentiation tools, as well as polyhedral compilers (Bondhugula et al., 2008; Baghdadi et al., 2019), to target more applications. It would also be useful to integrate our tool with domain-specific language compilers (Kamil et al., 2010; Luporini et al., 2018; Kronawitter and Lengauer, 2018) to make these transformation available to a larger audience.

Acknowledgements This work was funded in part by support from the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided on Blues, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Araya-Polo et al . (2011) M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguade, J. M. Cela, and M. Valero. 2011. Assessing Accelerator-Based HPC Reverse Time Migration. IEEE Transactions on Parallel and Distributed Systems 22, 1 (Jan 2011), 147–162. https://doi.org/10.1109/TPDS.2010.144 · doi ↗
3Baghdadi et al . (2019) Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization . IEEE Press, 193–205.
4Baydin et al . (2018) Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research 18 (2018), 1–43.
5Bondhugula et al . (2008) Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) .
6Farrell et al . (2013) Patrick E Farrell, David A Ham, Simon W Funke, and Marie E Rognes. 2013. Automated derivation of the adjoint of high-level transient finite element programs. SIAM Journal on Scientific Computing 35, 4 (2013), C 369–C 393.
7Förster (2014) Michael Förster. 2014. Algorithmic Differentiation of Pragma-Defined Parallel Regions: Differentiating Computer Programs Containing Open MP . Ph.D. Dissertation. RWTH Aachen.
8Giles et al . (2005) MB Giles, D Ghate, and MC Duta. 2005. Using automatic differentiation for adjoint CFD code development. (2005).