A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
Elias Stehle, Hans-Arno Jacobsen

TL;DR
This paper introduces a memory bandwidth-efficient hybrid radix sort algorithm for GPUs that significantly reduces memory transfers, leading to substantial speed-ups in sorting large datasets compared to previous GPU and CPU methods.
Contribution
It presents a novel GPU radix sort approach that nearly halves memory transfers, boosting sorting performance, and extends it with a pipelined heterogeneous algorithm for larger or off-GPU data.
Findings
Achieves 2.32x faster sorting of 2GB data over state-of-the-art GPU radix sort.
Maintains at least 1.66x speed-up on skewed distributions.
Improves end-to-end sorting of 64GB data by over 2x compared to CPU-based radix sort.
Abstract
Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
