# A Versatile Software Systolic Execution Model for GPU Memory-Bound   Kernels

**Authors:** Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi, Matsuoka

arXiv: 1907.06154 · 2019-09-09

## TL;DR

This paper introduces a flexible systolic execution model for GPU memory-bound kernels that significantly improves performance on stencil and convolution workloads by leveraging warp primitives and register files.

## Contribution

The paper presents a novel systolic execution model for CUDA GPUs that enhances performance of memory-bound kernels, outperforming state-of-the-art implementations on recent Nvidia architectures.

## Key findings

- Outperforms top stencil implementations on V100 and P100 GPUs.
- 2.5x faster 2D convolution compared to Nvidia's NPP.
- Effective for a wide range of stencil and convolution kernels.

## Abstract

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.06154/full.md

## Figures

22 figures with captions in the complete paper: https://tomesphere.com/paper/1907.06154/full.md

## References

64 references — full list in the complete paper: https://tomesphere.com/paper/1907.06154/full.md

---
Source: https://tomesphere.com/paper/1907.06154