A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)
B\'erenger Bramas

TL;DR
This paper presents a fast, vectorized sorting implementation optimized for ARM's SVE, demonstrating significant speedups over standard algorithms by adapting to SVE's unique features.
Contribution
The paper introduces a novel vectorized sorting method tailored for ARM SVE, addressing its unique predicate and variable vector size, and achieves substantial performance improvements.
Findings
Achieves 4x speedup over GNU C++ sort
Efficiently handles different data types including integers and doubles
Adapts well to ARM SVE's predicate and variable vector size
Abstract
The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
