Belga B-trees
Erik D.Demaine, John Iacono, Grigorios Koumoutsos, Stefan Langerman

TL;DR
This paper introduces the Belga B-tree, a self-adjusting external memory data structure that adapts to query distributions and achieves near-optimal performance, extending ideas from binary search trees to B-trees.
Contribution
The paper formalizes the B-Tree model, proves lower bounds, and presents the Belga B-tree, which is competitive with the best offline B-tree algorithms within a logarithmic factor.
Findings
Belga B-tree achieves $O(\log \log N)$ competitiveness.
Transformation from static BST to B-tree is faster by a $ heta(\log B)$ factor.
Randomization is necessary for significant speedup in the transformation.
Abstract
We revisit self-adjusting external memory tree data structures, which combine the optimal (and practical) worst-case I/O performances of B-trees, while adapting to the online distribution of queries. Our approach is analogous to undergoing efforts in the BST model, where Tango Trees (Demaine et al. 2007) were shown to be -competitive with the runtime of the best offline binary search tree on every sequence of searches. Here we formalize the B-Tree model as a natural generalization of the BST model. We prove lower bounds for the B-Tree model, and introduce a B-Tree model data structure, the Belga B-tree, that executes any sequence of searches within a factor of the best offline B-tree model algorithm, provided . We also show how to transform any static BST into a static B-tree which is faster by a factor; the transformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Database Systems and Queries · Data Management and Algorithms
\sidecaptionvpos
figuret
Belga B-trees111This work was supported by the Fonds de la Recherche Scientifique-FNRS under Grant no MISU F 6001 1 and by NSF Grant CCF-1533564.
Erik D. Demaine John Iacono Grigorios Koumoutsos Stefan Langerman CSAIL, Massachusetts Institute of Technology. [email protected]é Libre de Bruxelles and New York University. [email protected]é Libre de Bruxelles. [email protected] de Recherches du F.R.S-FNRS. [email protected]
Abstract
We revisit self-adjusting external memory tree data structures, which combine the optimal (and practical) worst-case I/O performances of B-trees, while adapting to the online distribution of queries. Our approach is analogous to undergoing efforts in the BST model, where Tango Trees (Demaine et al. 2007) were shown to be -competitive with the runtime of the best offline binary search tree on every sequence of searches. Here we formalize the B-Tree model as a natural generalization of the BST model. We prove lower bounds for the B-Tree model, and introduce a B-Tree model data structure, the Belga B-tree, that executes any sequence of searches within a factor of the best offline B-tree model algorithm, provided . We also show how to transform any static BST into a static B-tree which is faster by a factor; the transformation is randomized and we show that randomization is necessary to obtain any significant speedup.
1 Introduction
Worst-case analysis does not capture the fact that some sequences of operations on data structures, often typical ones, can be executed significantly faster than worst case ones. Methods of analyzing algorithms whose performance depends on more fine-grained characteristics of the input sequence other than the size have been coined distribution sensitive data structures [Iac01b, BHM13]. Two general methods to bound the performance of such a data structure exist. The first is to explicitly bound the performance by some bound. For binary search trees (BSTs) there is a rich set of such bounds (see e.g. [EFI13, CGK*+*16]) like the sequential access bound [Tar85], the working set bound [ST85b, Iac01a], the (weighted) dynamic finger bound [CMSS00, Col00, IL16], the unified bound [BCDI07, Iac01a] and many others [BDIL16, HIM13, CGK*+*18]. The other method is to compare the performance of the data structure on a sequence of operations to the performance of the best offline data structure in some model on the same sequence. Such an analysis uses the language of competitive analysis introduced in [ST85a], where the competitive ratio of an algorithm is the supremum ratio of the performance of the given algorithm to the offline optimal over all sequences of operations over a given length. A data structure which is -competitive in a particular model is said to be dynamically optimal [ST85b]. In the BST model, the best known competitive ratio is , first achieved by Tango trees [DHIP07]. The existence of a dynamically optimal BST is one of the most intriguing and long-standing open problems in online algorithms and data structures (see [Iac13] for a survey). The two prominent candidates to achieve dynamic optimality for BSTs are the splay tree of Sleator and Tarjan [ST85b] and the greedy algorithm [DHI*+*09, Luc88], but they are only known to be -competitive.
Disk-Access Model (DAM).
The external memory model, or disk-access model (DAM) [AV88] is the leading way to theoretically model the performance of algorithms that can not fit all of their data in RAM, and thus must store it on a slower storage system historically known as disk. This model is parameterized by values and ; the disk is partitioned into blocks of size , of which can be stored in memory at any given moment. The cost in the DAM is the number of transfers between memory and disk, called Input-Output operations (I/Os). The classic data structure for a comparison based dictionary in the DAM model, as well as in practice, is the B-Tree [BM72]. The B-Tree is a generalization of the BST, where each node stores up to data items, for , and the number of children is one more than the number of data items. The B-Tree supports searches in time in the DAM, a factor faster than traditional BSTs such as red-black trees [GS78] or AVL trees [AVL62].
Dynamic Dictionaries in the DAM.
Here, our goal is to explore dynamic dictionaries in the DAM and to obtain results similar to those known for BSTs.
Surprisingly, prior work in this direction is quite limited. One previous attempt was in the work of Sherk [She95] where a generalization of splay trees to what we call the B-tree model was proposed, but without any strong results. Over ten years later, Bose et. al. [BDL08] studied a self-adjusting version of skip-lists and B-Trees, where nodes can be split and merged to adapt to the query distribution by moving elements closer or farther from the root of the tree (here we call this model classic self-adjusting B-trees, see Section 2). They showed that dynamic optimality in this model is closely related to the working set bound. This bound captures temporal locality: for an access sequence , it is defined as , where is the number of distinct elements accessed since the last access to the element . In [BDL08] the authors presented a data structure whose cost is upper bounded by and obtained a matching lower bound of for this model, which implies that their structure is dynamically optimal.
Note that the lower bound of [BDL08] shows a major limitation of B-trees with only split and merge operations: It implies there are sequences on which they are slower than BSTs. For example, repeatedly sequentially accessing all data items requires amortized time per search for BSTs like splay trees (this is the sequential access bound [Tar85]) while the lower bound implies an amortized cost in the classic self-adjusting model. In this work, we show that by adding just one more operation, an analogue of the rotation for B-Trees, we can overcome this limitation and obtain significant speedups with respect to standard B-trees.
Our Contribution.
In this work we initiate a systematic study of dynamic B-trees. First, we formally define the (dynamic) B-Tree model of computation (§2). Second, we show how to produce lower bounds in the B-Tree model (§3). Then, we introduce a data structure, which we call the Belga B-Tree222The Tango tree was invented on an overnight flight from JFK airport en route to Buenos Aires, Argentina. The work on the Belga B-Tree has been substantially completed at Cafe Belga, Ixelles, Belgium., which is competitive with any dictionary in the B-Tree model of computation, when (§4).
More generally, we conjecture the following in §6: any BST-model algorithm can be transformed into a (randomized) B-Tree model algorithm with a factor cost savings. This would imply that BST model algorithms such as the splay tree [ST85b] or greedy [DHI*+*09, Luc88] would have B-Tree model counterparts, and that a dynamically optimal BST-model algorithm would imply a dynamically optimal algorithm in the B-Tree model. We leave this conjecture open, but in §5 we do resolve the case of a static (no rotations allowed) BSTs by showing a randomized transformation from a static BST to a static B-Tree such that any algorithm in the static BST model would have factor speedup in the B-Tree model. We also show that no -factor speedup is possible for a deterministic transformation in general.
2 The B-Tree model of computation
In this section, we define the tree models discussed in this paper. In all cases, we consider data structures supporting searches over a universe of elements which we refer to as keys. The input is a valid tree and request sequence of searches , where is the th item to be searched.
2.1 The BST Model
In a Binary Search Tree (BST) data structure, each node stores a single key and three pointers, indicating its parent and its (left and right) children. The key value of a node is larger than all keys in its left subtree and smaller than all keys in its right subtree. To execute each request to search for element , a BST algorithm initializes a single pointer at the root (at unit cost) then may perform any sequence of the following unit-cost operations:
- •
Move the pointer to the parent or to the left or right child of the current node the pointer points to (if such a destination node exists).
- •
Perform a rotation of the edge between current node and its parent (if not the root).
Whenever the pointer moves to or it is initialized to a node , we say that node is touched. A BST-model search algorithm is correct if during each search, the element that is being searched for is touched. The cost of a BST algorithm on the search sequence equals the total number of unit-cost operations performed to execute the searches in the sequence. This model was formally defined in [DHIP07] and it is known to be equivalent up to constant factors to several alternative models which have been considered (e.g. [DHI*+*09, Wil89]).
A BST data structure can be augmented such that each node stores additional bits of information. The running time of such BST data structures in the RAM model is dominated by the number of unit-cost operations. A static BST is a restricted version of the BST model where rotations are not allowed and thus the shape of the tree never changes.
2.2 The B-tree model
We define the B-tree model to be a generalization of the BST model which allows more than one key to be stored in each node. The B-tree model is parameterized by a positive integer which represents the maximum number of children of each node333Recall that in the external memory model (defined in Section 1) B denotes the block size. Each B-tree node has at most children, contains words and thus it can be stored in blocks of size B.; in the case where the B-tree model will be equivalent to the BST model. We denote by the number of keys stored in a node . Every node has and child pointers (some of which could be null). A node which stores exactly keys is called full.
Suppose are the keys stored at node and are the children of . Keys satisfy the in-order condition, i.e. and for any key stored in the subtree rooted at , we have that .
Similar to the BST model, to execute each search there is a single pointer initialized to the root of the tree at unit cost. To execute a search for , a B-tree algorithm performs a sequence of the following unit-cost operations which are described formally later:
- •
Move the pointer to a child or to the parent of the current node.
- •
Split a node containing at least three keys.
- •
Join two sibling nodes storing no more than keys in total.
- •
Rotate the edge between the current node and its parent.
B-tree model algorithms that only use the first type of operations are referred to as static as the shape of the B-tree does not change. We now fully describe the unit-cost operations of rotating, splitting and joining:
Rotations:
Consider a (non-root) node and let be its parent. Let be the union of all keys stored in and . The keys stored at define an interval in . A rotation of the edge essentially updates this interval to , moving the keys as needed. Depending on the values of and we characterize a rotation as a promote/demote left — promote/demote right rotation. For example, a rotation of the type promote left — demote right sets (i.e. the leftmost keys of are promoted to ) and (i.e. keys are demoted to ). Values and should be non-negative and satisfy that after the rotation both and have at most keys. Rotations of the type demote left — promote right, promote left — promote right and demote left — demote right can be defined analogously. As an example, figure 1 shows a rotation of type demote left - promote right.
Splitting a node:
Let be a node (except the root) containing at least three keys and let be its non-full parent. Splitting node at key (which is not the smallest or the largest key stored at ) consists of promoting to and replacing by 2 nodes such that keys smaller than are contained in and keys larger than are in . To split the root (given that it stores at least three keys), we create an empty B-tree node, make it the parent of the root (i.e. the new root) and then perform a split operation as defined above.
Join:
This operation is the inverse of a split. Let and be two sibling nodes and let be their parent, such that there exists a unique key in such that is larger than all keys stored at and smaller than all keys stored at . Joining nodes and (given that they store no more than keys in total) consists of demoting to (and deleting it from ), adding all elements of (including the pointers to children) to and deleting . Note that after a join operation might become empty (in case was the unique key of ). In that case, we set the parent of to be the parent of (if it exists) and we delete . If is empty and it is the root, then we just delete and becomes the new root of the tree.
A B-tree can be augmented with additional bits of information for each node. The performance of B-trees in the external memory model with blocks of size , is within a constant factor of the sum of the unit-cost operations as we have defined them.
Relation with other B-tree models.
The classic structure of B-trees first appeared in [BM72]. In this framework, all leaves have the same depth and no join, split and rotate operations are performed during searches (to be precise, restricted versions of split and join were defined in order to support insertions and deletions and were not allowed for performing search operations, see [CLRS09] for an extensive treatment). We call this framework the classic B-tree model.
A more flexible model of B-trees was considered in [BDL08]: We start with a classic B-tree and an algorithm is allowed to perform joins and splits, but not rotations. Note that by performing join and split operations, the property that all leaves of the tree have the same depth is maintained throughout the whole execution. This model was called “self-adjusting B-trees”. To avoid confusion with our dynamic B-tree model, we call this model classic self-adjusting B-trees, in order to emphasize that all leaves have the same depth, as in classic B-trees. The self-adjustment relies on the fact that using joins and splits the algorithm might choose to bring an item closer to the root or demote it farther from the root. Also, note that the number of nodes in a B-tree on keys is not fixed (as opposed to BSTs where we always have exactly nodes) and the split/join operations might increase/decrease the number of nodes of the tree, changing thus its shape.
For the rest of this paper, whenever we use the term B-tree we refer to our B-tree model, unless stated otherwise.
3 Lower bounds: simulating dynamic B-trees using BSTs.
In this section we show how to simulate a dynamic B-tree algorithm using a BST-model algorithm with an overhead in the cost. This will allow us to transform lower bounds from the BST model into lower bounds for the B-tree model.
Notation.
For a search sequence , we denote and the optimal (offline) cost to serve using a BST-model and a B-tree-model data structure respectively.
Theorem 3.1**.**
For any search sequence , .
Proof.
We simulate a B-tree execution of using a BST in the following way: Each node of the B-tree is simulated by a red-black tree of depth . Thus our BST is a tree of red-black trees. We also augment the red-black tree data structure such that each node stores a counter on the number of keys in its subtree. Note that in this tree-of-trees, leaves of a red-black tree might have children, which are the roots of other red-black trees. To distinguish the leaves of each tree, we mark the root of each red-black tree. We also use the parent-child terminology for those red-black trees, i.e., if and are red-black trees corresponding to B-tree nodes and respectively such that is a child of , we will say that “tree is a child of tree ”.
It remains to show that each unit-cost B-tree operation can be simulated in time using our tree-of-trees BST data structure. Moving the pointer from a B-tree node to an adjacent node corresponds to moving the BST pointer from the root of one red-black tree to the root of its child/parent. This can be done in time, since the depth of our red-black trees is . For the other unit-cost operations showing this is more complicated. In order to keep the presentation as simple as possible, we proceed as follows: we first describe some basic properties of red-black trees, we then use them to develop operations of merging and separating red-black which will be useful in out tree-of-trees construction and finally we show how to implement the B-tree unit-cost operations using all those tools.
Background on red-black trees.
We note that red-black trees on nodes support split and concatenate operations, as well as finding the th largest (or smallest key) in time [CLRS09]. We now describe those operations.
- •
The split operation of a red-black tree at a node re-arranges the tree such that is the root and the left and right subtrees are red-black trees including keys of smaller and larger values than respectively.
- •
Concatenating two red-black trees whose roots are children of a common node , consists of re-arragning the subtree of to form a red-black tree on all keys of . This operation is also referred as concatenating at and it can be defined even if one of is empty. Particularly, in our tree of trees construction, if we concatenate at a node whose left (right) child is marked, then we treat its left (right) subtree as empty.
- •
Find the key with a given rank: Given an augmented red-black tree on nodes, where each node stores the number of keys in its subtree and a value , we can find its th largest (or smallest) key in time (see e.g. [CLRS09, Chapter 14]).
Combining and Separating red-black trees.
We now develop two procedures that will be useful in our implementation of B-tree unit cost operations. In particular we show how to merge and separate red-black trees in time, where is the total number of nodes in the trees involved.
- (i)
Merge(S,T): Given two red-black trees and such that is a child of , merge them into one valid red-black tree. We describe an implementation of this operation in time, where is the total number of nodes of and . Let be the root of . We can find the predecessor and the successor of in in time, by searching for the key value of in . Note that either or might not exist. We split at (if it exists) and then split the right subtree in (if it exists). Now, is the left subtree of (if does not exist, is just the right subtree of ). Unmark the root of . Then, concatenate at (skip this step if does not exist) and finally concatenate at (if it exists). The result is a valid red-black tree containing all keys of and . We used a constant number of -time operations. 2. (ii)
Separate(T,,): Given a red-black tree , separate keys with values in the interval , i.e. split into two trees where contains keys with values in the interval and is a parent of . In case is not specified (), we think of as being the minimum key value in and this operation separates all keys with value at most . Symmetrically, if is , we think of as being the maximum key value in and this operation separates keys with value at least . We implement this as follows. Let be the predecessor of in (if exists) and the successor or (if exists). Split in (skip this step if does not exist) and then split the subtree with values larger than at (skip this step if does not exist). As a result the left subtree of (or the right subtree of if does not exists) is the tree containing all keys in . Mark the root of . Then concatenate at (if exists) and finally concatenate at (if exists). As a result we get a valid red-black tree which is the parent of red-black tree containing all keys of the interval .
Simulating the unit-cost operations.
We now proceed on showing how to simulate B-tree rotations, splits and joins using our tree of red-black trees data structure with cost . In all cases, the total number of keys in the trees involved is and we perform a constant number of operations which take time .
- •
Rotations. We show how to implement a rotation of the form demote left - promote right (assuming valid values of and ). The other operations are defined analogously. Let be the B-tree edge which is rotated, where node is parent of and let and be the augmented red-black trees corresponding to and . Let and be the key values stored in and respectively such that for all we have that , similar to the example in figure 1. The rotation corresponds to promoting to the largest keys of , i.e. and demoting to the keys . We implement such a rotation as follows (see figure 2 for an illustration): We start by promoting the elements to . Find , i.e. the th largest key stored at . Then, Separate() to get a tree containing keys and a tree with the rest keys of . is a child of and is a child of . Now, we merge and to get a new tree , such that is a child of . It remains to demote to . To do that, we split at . Let and be the two subtrees of in . Note that are the largest keys of . Find , i.e., the th largest key of and Separate(). We get a separate tree containing . Mark the root of . Now, is a child of , so we can merge them to form , the tree corresponding to B-tree node . Finally we concatenate at the root , to form the final tree corresponding to , denoted by , where is a child of .
- •
Splitting a node of a B-tree. Let be the node which we want to split and its parent. Let also and the corresponding red-black trees, where is a child of . Let be the median key value of . We split at , so that is the root with subtrees and . Mark the roots of and and then merge (which is a single-node red-black tree) with . Clearly all those operations can be performing in time.
- •
Joining two sibling nodes. This is the inverse operation of splitting so the sequence of operations can be seen as the symmetric of the ones performed in splitting. Let and be the sibling B-tree nodes that we want to join, and their parent, with , and the corresponding red-black trees in our binary search tree. and are children of and there is a unique key in such that keys stored at are smaller than and keys stored at are larger. Thus, is the successor of the root of in and we can find it in time. We then Separate. Now we get a new tree containing all keys of except from , and is a single-node red-black tree, child of . and are the left and right children of . We unmark the roots of and and concatenate at , to get a new tree and mark its root. Now corresponds to the join node of and , and it is a child of the red-black tree which corresponds to the parent node in the B-tree. We performed a constant number of operations each of which takes time .
∎
Theorem 3.1 implies that we can transform any lower bound for binary search trees to a lower bound for dynamic B-trees, as shown in the following corollary.
Corollary 3.2**.**
Let be a search sequence and let be any lower bound on the cost of executing in the BST model. Then we have that .
Proof.
Since is a lower bound on , we have that , which implies . ∎
4 Belga B-trees
In this section, we develop a dynamic B-tree data structure yclept Belga B-tree that achieves a competitive ratio of , for search sequences of length , provided that , i.e. . Our construction is built upon the ideas used in [DHIP07] to get a similar competitive ratio for binary search trees. Particularly, we crucially connect the cost of our algorithm to the interleave lower bound. For completeness, we present here the setup and the necessary background regarding this lower bound.
Interleave Lower Bound and preferred paths (See Figure 3).
Let be the keys stored in our B-tree. Let be a (fixed) complete binary search tree on those keys. For each internal node in , we define its left region to be together with the subtree rooted at its left child and its right region to be the subtree rooted at its right child. Node has a preferred child, which is left or right, depending on whether the last search for a node in its subtree was in its left or right region (if no node of the subtree rooted at has been searched, then has no preferred child).
We define a preferred path in as follows: Start from a node that is not the preferred child of its parent (including the root) and perform a walk by following the preferred child of the current node, until reaching a node with no preferred child. Clearly, a preferred path contains keys.
Note that during a search for a key, the preferred child of some nodes that are ancestors of the node with the key being searched might change. Each change of preferred child, changes also the preferred paths of . For a search sequence , the interleave lower bound equals the total number of changes of preferred child from left to right or from right to left, over all nodes of . We use the following lemma of [DHIP07], which is a slight variant of the first lower bound of [Wil89]:
Lemma 4.1** (Lemma 3.2 in [DHIP07]).**
The cost to execute in the BST model is if .
High-level overview of our structure.
We store each preferred path in a balanced classic B-tree. We call such classic B-trees auxiliary trees. Our dynamic B-tree will be a tree of classic B-trees. Recall that Lemma 4.1 essentially tells us that the number of preferred paths touched during a request sequence is a lower bound on the value of . The idea here is to show that for each preferred path touched, and thus unit of lower bound incurred, we can perform search and all update operations (cutting and merging preferred paths) with an overhead factor . This will imply that we have a dynamic B-tree with cost . This combined with Lemma 4.1 and Corollary 3.2 implies that the cost of our dynamic B-tree data structure is .
Auxiliary trees.
Our auxiliary trees are augmented classic B-trees. Each auxiliary tree stores a preferred path. With each key we also store its depth in the reference tree . We call this value depth of key . Also, each node stores the minimum and maximum depth of a key in its subtree. Last, a node may be marked or unmarked, depending on whether it is the root of an auxiliary tree or not. Note that is just a reference tree used for the analysis. We do not need to store explicitly in order to implement our algorithm. All necessary information about is stored in our dynamic B-tree data structure.
During an execution of a search sequence we need to perform the following operations on a preferred path:
- (i)
Search for a key. 2. (ii)
Cut the preferred path into two paths, one consisting of keys of depth at most and the other of keys of depth greater than . 3. (iii)
Merge two preferred paths and , where the bottom node of is the parent of the top node of .
We will show that we can perform those operations using our auxiliary trees in time , where is the number of keys in the involved preferred paths. We defer this proof to the end of this section and we now proceed to the description and analysis of Belga B-trees, assuming that those operations can be done in time . For the rest of this section, whenever we refer to cutting/merging operations on auxiliary trees, we mean the implementation of cutting/merging the corresponding preferred paths in our B-tree data structure.
Our Algorithm.
A Belga B-tree is a tree of auxiliary classic B-trees, where each auxiliary tree stores a preferred path. Initially we transform the input tree to a valid Belga B-tree. Upon a request for a key , we start from the root and search for . Whenever we reach a marked node (i.e. a root of an auxiliary tree), we have to update the preferred paths. Let be the preferred path stored in the auxiliary tree of the parent of and the preferred path in the auxiliary tree rooted at . We update the preferred paths using the cut and merge operations of auxiliary trees. Particularly, if is the minimum depth of a key of (this value is stored at node of our B-tree), we cut the auxiliary tree storing at depth . This gives us two preferred paths and , where the first stores keys of of depth at most and the second keys of of depth greater than . We mark the roots of the auxiliary trees corresponding and . We then merge the auxiliary tree storing with the auxiliary tree rooted at (which stores ). We mark the root of the new tree and continue the search for .
Note that the only part where our algorithm needs to perform rotations is the initial step of transforming the input tree into a Belga B-tree.
Bounding the cost.
We now compare the cost of our Belga B-tree data structure to that of the optimal offline B-tree. The following lemma makes the essential connection between the number of preferred paths touched during a search and the cost of our algorithm.
Lemma 4.2**.**
Let be the number of preferred child changes during a search for key . Then the cost of Belga B-tree for searching is .
Proof.
To search for , we touch exactly preferred paths. We account separately for the search cost and the update cost.
For each preferred path touched, the search cost is , since we are searching a balanced B-tree on keys. Thus the total search cost is clearly .
We now account for the update cost. Recall that we can cut and merge preferred paths on keys in time . Since each preferred path has at most keys, we can perform those updates in time . There are preferred path changes, and for each change we perform one cut and and one merge operation, we get that the total time for merging and cutting is . The lemma follows. ∎
We now combine this lemma with Corollary 3.2 to get the competitive ratio of Belga B-trees.
Theorem 4.3**.**
For any search sequence of length , Belga B-trees are -competitive.
Proof.
We account only for the cost occured during searches, since the cost of transforming the input tree into a Belga B-tree is just a fixed additive term which does not depend on the input sequence.
The total number of preferred path changes is at most . The additive accounts for the fact that initially each node has no preferred child, so its first change from null to either left or right is not counted in . Using Lemma 4.2 and summing up over all search requests, we get that the cost of Belga B-trees is . By our assumption on the value of , we have that , thus the cost is in . By Lemma 4.1 this is bounded by . Using Corollary 3.2 we get that cost of Belga B-tree is
[TABLE]
Note that for any request sequence . Since , we have that . We get that the total cost is upper bounded by
[TABLE]
∎
Operations on auxiliary trees in logarithmic time.
We now show that our auxiliary B-trees support search, cut and merge in time , where is the total number of nodes in the trees which are involved.
Before proceeding to this proof we note that classic B-trees on nodes support search, split and concatenate (similar to the ones we presented in previous section for red-black trees) operations in time (see [CLRS09], Chapter 18). For completeness we describe here the split and concatenate operations:
- •
Splitting a B-tree at a key value consists of creating a tree where the root contains only , its left subtree is a B-tree on keys with value smaller than and the right subtree is a B-tree on keys greater than .
- •
Concatenating two classic B-trees with a key value such that all keys in are smaller than and all keys in are greater, consists of creating a new classic B-tree which contains all key values contained in , and .
Search can be clearly performed in time . We now describe the cut and merge operations on preferred paths.
Cut a preferred path at depth : Let be the tree storing the preferred path. Let and be the smallest and the largest key value respectively stored at depth greater than in the path. We wish to find and in the tree . This can be easily done using the maximum depth value of subtree stored in the nodes. We show how to find and for is symmetric. Start from the root and move to the leftmost child whose maximum depth is greater than . When we reach a node such that all its children have maximum depth smaller than , then is the smallest key in with depth greater than . Let predecessor of in (if it has one) and the successor of in (if it has one). Split at (skip this step if does not exist) and then split the right subtree at (skip this step if does not exist). Now, the left subtree of contains all keys with depth greater than . Let us call this tree . Mark the root of (and change values of depths, max depth, min depth in time ) and then use concatenate operations at the tree rooted at (if it exists) and then at the tree rooted at (if it exists) to make the remaining of a valid classic B-tree.
Merge two preferred paths: Let and be the preferred paths that we want to merge, where the bottom node of is the parent of the top node of . Merging is the inverse operation of a cut. Let and be the auxiliary trees storing and respectively, i.e is a parent of in our tree-of-trees construction and the key values stored at have are of smaller depth in than the key values stored in . Pick a key from the root of and find its predecessor and its successor in . Split in (skip this step if does not exist) and then split the right subtree at (skip this step if does not exist). Now the left subtree of is . Unmark the root of . Then, concatenate at to get a resulting tree which is the right subtree of the root (skip this step if does not exist). Then, concatenate at (if it exists), to get a valid B-tree which contains all keys of and . In each of the last two steps (if not skipped), updates of the values of depth, maximum depth, minimum depth take time .
5 Transforming any static BST into the B-Tree model
In this section we focus on static trees, with the goal to simulate a static BST using a static B-tree and achieving a speedup by a factor of . In the static BST and B-Tree models, all that is allowed in each operation is to move a single pointer around the tree, starting at the root, each time moving to a neighboring node, at unit cost per move. We refer to a sequence of moves of a single pointer as a walk. In particular, given a BST we wish to convert it to a B-Tree so that if a walk in the BST costs , a walk in the B-Tree that touches the same keys costs as little as possible in terms of ; is clearly possible since a BST is a B-tree, but when can we achieve ?
We note that the results of this section allow the pointer to move arbitrarily in a static BST/B-tree, i.e., it can visit nodes that are outside the path from the root to the searched node. In the case where only a search path of length is considered, the worst-case cost has been completely characterized in [DIL15] as when , , when and , and when .
Block-Connected Mappings.
The most natural approach to achieve our goal is to try to map a static BST into a static B-tree such that each node of corresponds to a connected subtree of . We call such a mapping , block-connected. Observe that in order to achieve a speedup for the B-tree model , it is necessary that a block-connected mapping should satisfy that every node at depth in is at depth in . However, as we will see, this is not sufficient.
The next theorem shows that, perhaps surprisingly, this approach fails to give any super-constant factor improvement, given that the mapping is deterministic. Afterwards, we show how to achieve an factor speedup using randomization.
Theorem 5.1**.**
There does not exist a block-connected mapping such that any walk on of length corresponds to a walk of length in .
Proof.
We proceed by contraction. Assume an and for some integer , and let be the perfectly balanced tree with nodes and thus leaves. Consider some BST model sequence of operations which is an inorder traversal of . Let be the number of different blocks (i.e. B-tree nodes) that stores the leaves of in, which must be at least . Let be the sequence of operations where the inorder traversal does not recurse whenever it encounters a node stored in the same block as a leaf. will still visit all blocks containing leaves, but its length will be exactly . This happens because the block-connected property ensures that will never visit two nodes, both of which are in the same block as a leaf of , as that would imply they would have an LCA also in the block, which would mean would not visit them. Thus has a BST cost of and a B-tree cost of , where which proves the theorem. ∎
Randomized Construction.
Theorem 5.1 above is based on an adversarial argument and relies crucially on the knowledge of the layout of the B-tree. To overcome this issue, we use randomization.
Theorem 5.2**.**
For any BST , there is a randomized block-connected mapping which produces a static B-tree such that for any walk of length in , there exists a corresponding walk in with expected cost .
Proof.
We construct the B-tree as follows. We choose uniformly at random an integer in . The root node of contains the key values of the first levels of . Then, we build the rest of the tree in a deterministic way, by storing levels of each subtree in a B-tree node, recursively. Consider any walk of operations on that starts at the root. We assume that the block containing the root and the current location of the walk are stored in memory. Whenever passes through an edge of , the probability that this move corresponds to a unit cost operation equals the probability that the endpoints of belong to different B-tree nodes in and equals .
We thus obtain that the expected cost of the corresponding sequence of operations in in is . Since for any , we get that the expected cost is . ∎
6 Open Problems
We conclude with some open problems. The first is that our Belga B-trees are -competitive only when , and thus the case of large where remains open. The main impediment is to figure out how to fit multiple preferred paths into one block.
A more general open problem is to resolve the following conjecture: Is it possible to convert any BST-model algorithm into a B-Tree model algorithm such that if an algorithm costs in the BST model, it costs in the B-Tree model? Special cases of this theorem, when applied to, for example, splay trees and greedy future, would also be interesting should the general conjecture prove too difficult to resolve.
A third open problem is whether, given two B-tree model algorithms, can you achieve the runtime that is the minimum of them; this would be the B-Tree model analogue of the BST result of [DILÖ13]. It would also allow one to then combine Belga B-trees with other B-tree model algorithms to get stronger results, like, for example [BDL08] to add the working-set bound; in the BST model [WDS06] gave a -competitive BST with the working set bound.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AV 88] Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related problems. Commun. ACM , 31(9):1116–1127, 1988.
- 2[AVL 62] G. M. Adelson-Velskiĭ and E. M. Landis. An algorithm for organization of information. Dokl. Akad. Nauk SSSR , 146:263–266, 1962.
- 3[BCDI 07] Mihai Badoiu, Richard Cole, Erik D. Demaine, and John Iacono. A unified access bound on comparison-based dynamic dictionaries. Theor. Comput. Sci. , 382(2):86–96, 2007.
- 4[BDIL 16] Prosenjit Bose, Karim Douïeb, John Iacono, and Stefan Langerman. The power and limitations of static binary search trees with lazy finger. Algorithmica , 76(4):1264–1275, 2016.
- 5[BDL 08] Prosenjit Bose, Karim Douïeb, and Stefan Langerman. Dynamic optimality for skip lists and b-trees. In Symposium on Discrete Algorithms, SODA , pages 1106–1114, 2008.
- 6[BHM 13] Prosenjit Bose, John Howat, and Pat Morin. A history of distribution-sensitive data structures. In Brodnik et al. [ BLRV 13 ] , pages 133–149.
- 7[BLRV 13] Andrej Brodnik, Alejandro López-Ortiz, Venkatesh Raman, and Alfredo Viola, editors. Space-Efficient Data Structures, Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday , volume 8066 of Lecture Notes in Computer Science . Springer, 2013.
- 8[BM 72] Rudolf Bayer and Edward M. Mc Creight. Organization and maintenance of large ordered indices. Acta Inf. , 1:173–189, 1972.
