# Performance and Portability of Accelerated Lattice Boltzmann   Applications with OpenACC

**Authors:** E. Calore, A. Gabbana, J. Kraus, S. F. Schifano, R. Tripiccione

arXiv: 1703.00186 · 2017-03-02

## TL;DR

This paper evaluates the performance and portability of an OpenACC-accelerated Lattice Boltzmann application across diverse HPC architectures, comparing it with CUDA and OpenCL implementations to assess efficiency and portability trade-offs.

## Contribution

It provides a comprehensive analysis of OpenACC's effectiveness for portable HPC application development, specifically for Lattice Boltzmann simulations, across multiple hardware platforms.

## Key findings

- OpenACC achieves comparable performance to CUDA and OpenCL on GPUs.
- Portability of OpenACC allows running the same code efficiently on different architectures.
- Performance impact of portability is acceptable for large-scale HPC applications.

## Abstract

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.00186/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1703.00186/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1703.00186/full.md

---
Source: https://tomesphere.com/paper/1703.00186