Improving compiler support for SIMD offload using Arm Streaming SVE
Mohamed Husain Noor Mohamed, Adarsh Patil, Latchesar Ionkov, Eric Van Hensbergen

TL;DR
This paper explores enhancing compiler support for Arm's Streaming SVE to facilitate efficient offloading of SIMD workloads to core-adjacent accelerators, aiming to reduce programming complexity and improve performance.
Contribution
It investigates compiler techniques, including LLVM passes and MLIR transforms, to enable automatic, profitable code generation for Arm SME in SSVE mode.
Findings
Analyzed LLVM and MLIR approaches for SIMD offload
Proposed heuristics for profitable offloading
Provided insights for future compiler evolution
Abstract
The wider adoption of tightly coupled core-adjacent accelerators, such as Arm Scalable Matrix Extension (SME), hinges on lowering software programming complexity. In this paper, we focus on enabling the use of SME architecture in Streaming Scalable Vector Extension (SSVE) mode for workloads written in C/C++. While current compilers optimize loops for all types of SIMD instructions, these techniques primarily target vector units within the core and falter when applied to disaggregated, core-adjacent SIMD accelerators. Our goal is to enable the compiler to automatically generate code for such accelerators only when profitable. To this end, we investigate a path towards performant, precise, and repeatable computation offloading through two compiler ecosystems. We revisit LLVM compiler passes, MLIR transforms and their associated cost models, and heuristics. We hope that these insights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
