Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor
Tareq M. Malas, Aron J. Ahmadia, Jed Brown, John A. Gunnels, David E., Keyes

TL;DR
This paper presents a high-level assembly synthesis approach to optimize streaming numerical kernels on the IBM Blue Gene/P, achieving significant speedups for 3D stencil computations on energy-efficient in-order processors.
Contribution
It introduces a novel high-level assembly synthesis method tailored for the Blue Gene/P's architecture, improving the performance of streaming kernels through optimized scheduling.
Findings
Achieved a 1.7x speedup over previous results for 3D stencil kernels.
Demonstrated effectiveness of mechanically scheduled variants in various memory scenarios.
Validated the approach through simulation, verification, and analysis.
Abstract
Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the Central Processing Unit (CPU). We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM Blue Gene/P supercomputer's PowerPC 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
