Multi-dimensional intra-tile parallelization for memory-starved stencil computations
Tareq Malas, Georg Hager, Hatem Ltaief, David Keyes

TL;DR
This paper introduces a multi-dimensional intra-tile parallelization technique for stencil computations on shared-cache multicore CPUs, significantly improving cache efficiency, performance, and energy savings especially for low-arithmetic-intensity stencils.
Contribution
It proposes a novel intra-tile parallelization method and an auto-tuner framework, Girih, to optimize stencil computations on shared-cache multicore architectures, outperforming existing frameworks.
Findings
Girih achieves superior performance across various stencil schemes.
The method reduces cache space requirements without hardware prefetching issues.
Energy consumption is decreased due to lower DRAM bandwidth usage.
Abstract
Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multi-dimensional intra-tile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
