MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit
Yinuo Wang, Tianqi Mao, Lin Gan, Wubing Wan, Zeyu Song, Jiayu Fu, Lanke He, Wenqiang Wang, Zekun Yin, Wei Xue, and Guangwen Yang

TL;DR
MMStencil leverages matrix units, SIMD, and memory optimizations to significantly accelerate high-order 3D stencil computations on multicore CPUs, outperforming existing libraries and benefiting real-world HPC applications.
Contribution
This paper introduces a novel matrix-based acceleration approach for 3D high-order stencils on multicore CPUs, including algorithmic, memory, and parallelism optimizations.
Findings
Achieves up to 2.1x speedup over state-of-the-art libraries on Nvidia A100.
Enables 1.8x speedup in real-world HPC applications compared to optimized GPU versions.
Maintains high hardware utilization across diverse stencil shapes and dimensions.
Abstract
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectromagnetic Scattering and Analysis
