Improving compiler support for SIMD offload using Arm Streaming SVE

Mohamed Husain Noor Mohamed; Adarsh Patil; Latchesar Ionkov; Eric Van Hensbergen

arXiv:2506.02233·cs.PL·June 4, 2025

Improving compiler support for SIMD offload using Arm Streaming SVE

Mohamed Husain Noor Mohamed, Adarsh Patil, Latchesar Ionkov, Eric Van Hensbergen

PDF

Open Access

TL;DR

This paper explores enhancing compiler support for Arm's Streaming SVE to facilitate efficient offloading of SIMD workloads to core-adjacent accelerators, aiming to reduce programming complexity and improve performance.

Contribution

It investigates compiler techniques, including LLVM passes and MLIR transforms, to enable automatic, profitable code generation for Arm SME in SSVE mode.

Findings

01

Analyzed LLVM and MLIR approaches for SIMD offload

02

Proposed heuristics for profitable offloading

03

Provided insights for future compiler evolution

Abstract

The wider adoption of tightly coupled core-adjacent accelerators, such as Arm Scalable Matrix Extension (SME), hinges on lowering software programming complexity. In this paper, we focus on enabling the use of SME architecture in Streaming Scalable Vector Extension (SSVE) mode for workloads written in C/C++. While current compilers optimize loops for all types of SIMD instructions, these techniques primarily target vector units within the core and falter when applied to disaggregated, core-adjacent SIMD accelerators. Our goal is to enable the compiler to automatically generate code for such accelerators only when profitable. To this end, we investigate a path towards performant, precise, and repeatable computation offloading through two compiler ecosystems. We revisit LLVM compiler passes, MLIR transforms and their associated cost models, and heuristics. We hope that these insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques