Multicore-optimized wavefront diamond blocking for optimizing stencil updates
Tareq Malas, Georg Hager, Hatem Ltaief, Holger Stengel, Gerhard, Wellein, and David Keyes

TL;DR
This paper introduces a novel multicore-optimized wavefront diamond blocking technique for stencil updates that significantly reduces memory traffic and improves performance in bandwidth-limited scenarios on modern processors.
Contribution
It combines multi-core wavefront temporal blocking with diamond tiling to create stencil update schemes that lower memory pressure and enhance bandwidth utilization.
Findings
Large reductions in memory traffic compared to existing methods
Performance improvements in bandwidth-starved conditions
Effective trade-off control between concurrency and memory usage
Abstract
The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
