Leyenda: An Adaptive, Hybrid Sorting Algorithm for Large Scale Data with   Limited Memory

Yuanjing Shi; Zhaoxing Li

arXiv:1909.08006·cs.DB·September 19, 2019

Leyenda: An Adaptive, Hybrid Sorting Algorithm for Large Scale Data with Limited Memory

Yuanjing Shi, Zhaoxing Li

PDF

Open Access

TL;DR

Leyenda is an adaptive, hybrid sorting algorithm designed for large-scale data with limited memory, optimizing disk I/O and CPU cache usage to outperform existing methods in various environments.

Contribution

It introduces Leyenda, a novel hybrid Radix MSB MergeSort that adapts to hardware conditions for efficient internal and external sorting.

Findings

01

Outperforms GNU's parallel quick/merge sort by up to three times

02

Ranks second in ACM 2019 SIGMOD external sort contest

03

Achieves top overall performance in large-scale sorting

Abstract

Sorting is the one of the fundamental tasks of modern data management systems. With Disk I/O being the most-accused performance bottleneck and more computation-intensive workloads, it has come to our attention that in heterogeneous environment, performance bottleneck may vary among different infrastructure. As a result, sort kernels need to be adaptive to changing hardware conditions. In this paper, we propose Leyenda, a hybrid, parallel and efficient Radix Most-Significant-Bit (MSB) MergeSort algorithm, with utilization of local thread-level CPU cache and efficient disk/memory I/O. Leyenda is capable of performing either internal or external sort efficiently, based on different I/O and processing conditions. We benchmarked Leyenda with three different workloads from Sort Benchmark, targeting three unique use cases, including internal, partially in-memory and external sort, and we found…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management