FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU
Michael Dang'ana

TL;DR
FractalSortCPU introduces a bandwidth-efficient, in-place radix sort algorithm optimized for CPUs that outperforms existing methods by reducing data pre-processing and leveraging SIMD acceleration.
Contribution
The paper presents a novel CPU-adapted histogram compression scheme for radix sorting that improves performance and bandwidth efficiency on large-scale datasets.
Findings
Achieves up to 6x bandwidth efficiency improvement over state-of-the-art.
Reduces latency by eliminating input bucketing and data pre-processing.
Demonstrates superior performance on CPU, GPU, and FPGA across various data sizes.
Abstract
Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
