Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory
Markus Wittmann, Georg Hager, Gerhard Wellein

TL;DR
This paper introduces a pipelined, multicore-aware temporal blocking algorithm for stencil codes that leverages shared caches and hybrid memory architectures to improve performance on bandwidth-limited multicore systems.
Contribution
It presents a novel pipelined approach to temporal blocking that explicitly utilizes shared caches and extends to hybrid shared/distributed-memory clusters.
Findings
Enhanced stencil code performance on multicore chips
Effective use of shared caches reduces memory bandwidth pressure
Successful application in hybrid shared/distributed-memory environments
Abstract
New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
