Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
Mark Mawson, Alistair Revell

TL;DR
This paper optimizes a 3D lattice Boltzmann fluid solver for Kepler architecture nVidia GPUs, analyzing memory strategies and achieving over 1036 MLUPS, while identifying hardware-related performance bottlenecks.
Contribution
It introduces a simplified memory transfer approach for LBM on Kepler GPUs and evaluates its performance against more complex methods, providing detailed benchmarking results.
Findings
Simple memory transfer approach is most efficient for LBM on Kepler GPUs.
Peak performance exceeds 1036 MLUPS on K20C GPU.
A hardware-related periodic bottleneck in performance was observed.
Abstract
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as `Kepler'. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of `performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
