High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
Emmanuel Pilliat (ENSAI)

TL;DR
KernelForge.jl is a Julia library that provides high-performance, portable GPU primitives for arbitrary data types and operations, matching vendor-optimized libraries across different hardware architectures.
Contribution
It introduces a two-layer architecture enabling portable, high-performance GPU primitives in Julia, bridging the gap between portability and vendor-level efficiency.
Findings
Matches or exceeds CUB kernel execution time on scan and mapreduce on NVIDIA A40
Matches cuBLAS throughput on matrix-vector operations across tested configurations
Demonstrates portable abstractions can achieve vendor-level throughput without sacrificing generality
Abstract
Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Network Packet Processing and Optimization · Advanced Data Storage Technologies
