Bounding the Fragmentation of B-Trees Subject to Batched Insertions
Michael A. Bender, Aaron Bernstein, Nairen Cao, Alex Conway, Mart\'in Farach-Colton, Hanna Koml\'os, Yarin Shechter, Nicole Wein

TL;DR
This paper analyzes how batched insertions affect the internal fragmentation of B-trees, extending Yao's classical results to more realistic workload scenarios and proposing strategies to maintain high space utilization.
Contribution
It generalizes Yao's analysis to batched insertions in B-trees and introduces alternative strategies for workloads where even splitting is less effective.
Findings
Even splitting maintains good space utilization for many workloads.
Alternative strategies can be employed to ensure high space efficiency.
The analysis provides a rigorous framework for understanding fragmentation under batched insertions.
Abstract
The issue of internal fragmentation in data structures is a fundamental challenge in database design. A seminal result of Yao in this field shows that evenly splitting the leaves of a B-tree against a workload of uniformly random insertions achieves space utilization of around 69%. However, many database applications perform batched insertions, where a small run of consecutive keys is inserted at a single position. We develop a generalization of Yao's analysis to provide rigorous treatment of such batched workloads. Our approach revisits and reformulates the analytical structure underlying Yao's result in a way that enables generalization and is used to argue that even splitting works well for many workloads in our extended class. For the remaining workloads, we develop simple alternative strategies that provably maintain good space utilization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Distributed systems and fault tolerance · Data Quality and Management
