Implementing the Comparison-Based External Sort
Michael Polyntsov, Valentin Grigorev, Kirill Smirnov, George, Chernishev

TL;DR
This paper investigates the implementation of comparison-based external sorting in database systems, focusing on how different data structures for the merge step affect performance, and finds that loser trees outperform other methods even on modern disk-bound hardware.
Contribution
It provides an experimental evaluation of data structures for the merge step in external sort, highlighting the efficiency of loser trees over naive and heap-based approaches.
Findings
Loser trees outperform naive and heap-based merge methods.
Efficient data structures for merge steps are beneficial even on modern disk-bound systems.
Implementing optimized merge data structures improves external sort performance.
Abstract
In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint. Sorting may be required on any step of query processing, i.e., be it source data or intermediate results. At the same time, the data to be sorted may not fit into main memory. In this case, an external sort operator, which writes intermediate results to disk, should be used. In this paper we consider an external sort operator of the comparison-based sort type. We discuss its implementation and describe related design decisions. Our aim is to study the impact on performance of a data structure used on the merge step. For this, we have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Advanced Data Storage Technologies · Data Management and Algorithms
