Faster Radix Sort via Virtual Memory and Write-Combining
Jan Wassenberg, Peter Sanders

TL;DR
This paper introduces a high-performance radix sort for 32-bit integers that leverages virtual memory and write-combining, achieving near-peak memory bandwidth and outperforming existing CPU and GPU sorting algorithms.
Contribution
The paper presents a novel microarchitecture-aware radix sort that utilizes virtual memory and write-combining to significantly improve throughput on modern hardware.
Findings
Achieves at least 88% of system's peak memory bandwidth per pass.
Outperforms Intel's radix sort by a factor of 1.5.
Competitive with GPU algorithms when data transfer overhead is considered.
Abstract
Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally paired with a 32-bit value. We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 88 % of the system's peak memory bandwidth. Our implementation outperforms Intel's recently published radix sort by a factor of 1.5. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
