AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms from a Unified, Transpiled Codebase

Andrei-Leonard Nicusan; Dominik Werner; Simon Branford; Simon Hartley; Andrew J. Morris; Kit Windows-Yule

arXiv:2507.16710·cs.DC·July 23, 2025

AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms from a Unified, Transpiled Codebase

Andrei-Leonard Nicusan, Dominik Werner, Simon Branford, Simon Hartley, Andrew J. Morris, Kit Windows-Yule

PDF

Open Access

TL;DR

AcceleratedKernels.jl is a Julia library that enables cross-architecture parallel computing with a unified codebase, achieving high performance and exceptional composability across diverse hardware including GPUs and CPUs.

Contribution

It introduces a transpilation-based backend-agnostic framework for parallel programming in Julia, simplifying implementation and enabling efficient CPU-GPU co-processing.

Findings

01

Performance comparable to C and OpenMP implementations.

02

Achieved world-class sorting throughput of 538-855 GB/s on NVIDIA A100 GPUs.

03

GPU interconnects like NVLink significantly speed up HPC tasks.

Abstract

AcceleratedKernels.jl is introduced as a backend-agnostic library for parallel computing in Julia, natively targeting NVIDIA, AMD, Intel, and Apple accelerators via a unique transpilation architecture. Written in a unified, compact codebase, it enables productive parallel programming with minimised implementation and usage complexities. Benchmarks of arithmetic-heavy kernels show performance on par with C and OpenMP-multithreaded CPU implementations, with Julia sometimes offering more consistent and predictable numerical performance than conventional C compilers. Exceptional composability is highlighted as simultaneous CPU-GPU co-processing is achievable - such as CPU-GPU co-sorting - with transparent use of hardware-specialised MPI implementations. Tests on the Baskerville Tier 2 UK HPC cluster achieved world-class sorting throughputs of 538-855 GB/s using 200 NVIDIA A100 GPUs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Embedded Systems Design Techniques