# Automatic Code Generation for High-Performance Discontinuous Galerkin   Methods on Modern Architectures

**Authors:** Dominic Kempf, Ren\'e He{\ss}, Steffen M\"uthing, Peter Bastian

arXiv: 1812.08075 · 2018-12-20

## TL;DR

This paper introduces a code generation approach that enhances the sustainability and performance portability of high-performance Discontinuous Galerkin methods on modern architectures by automating SIMD vectorization strategies.

## Contribution

It presents a novel combination of UFL and loopy for code generation, enabling automated vectorization and autotuning for DG methods within the Dune framework.

## Key findings

- Achieved 40-60% of theoretical peak performance on AVX2 and AVX512.
- Developed a flexible, autotuned code generator for DG assembly.
- Enhanced sustainability by reducing hand-written vectorized code.

## Abstract

SIMD vectorization has lately become a key challenge in high-performance computing. However, hand-written explicitly vectorized code often poses a threat to the software's sustainability. In this publication we solve this sustainability and performance portability issue by enriching the simulation framework dune-pdelab with a code generation approach. The approach is based on the well-known domain-specific language UFL, but combines it with loopy, a more powerful intermediate representation for the computational kernel. Given this flexible tool, we present and implement a new class of vectorization strategies for the assembly of Discontinuous Galerkin methods on hexahedral meshes exploiting the finite element's tensor product structure. The optimal variant from this class is chosen by the code generator through an autotuning approach. The implementation is done within the open source PDE software framework Dune and the discretization module dune-pdelab. The strength of the proposed approach is illustrated with performance measurements for DG schemes for a scalar diffusion reaction equation and the Stokes equation. In our measurements, we utilize both the AVX2 and the AVX512 instruction set, achieving 40\% to 60\% of the machine's theoretical peak performance for one matrix-free application of the operator.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.08075/full.md

## Figures

29 figures with captions in the complete paper: https://tomesphere.com/paper/1812.08075/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1812.08075/full.md

---
Source: https://tomesphere.com/paper/1812.08075