Top Tree Compression of Tries
Philip Bille, Inge Li G{\o}rtz, Pawe{\l} Gawrychowski, Gad M. Landau,, and Oren Weimann

TL;DR
This paper introduces a novel top tree compression method for tries that operates efficiently on a pointer machine, achieving optimal space and query time for prefix searches without relying on advanced RAM techniques.
Contribution
It presents the first pointer machine-compatible compressed trie structure with worst-case optimal size and query time, along with new data structures for grammar-compressed string access and level ancestor problems.
Findings
Achieves $O(n/\log_\sigma n)$ space complexity.
Supports prefix search in $O(\min(m\log \sigma,m + \log n))$ time.
First pointer machine solution with sublinear space and optimal query performance.
Abstract
We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length over an alphabet of size into a compressed data structure of worst-case optimal size that given a pattern string of length determines if is a prefix of one of the strings in time . We show that this query time is in fact optimal regardless of the size of the data structure. Existing solutions either use space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Top Tree Compression of Tries111An extended abstract appeared at ISAAC 2019 [17]
Philip Bille
Paweł Gawrychowski
Inge Li Gørtz
Gad M. Landau
Oren Weimann
Abstract
We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length over an alphabet of size into a compressed data structure of worst-case optimal size that given a pattern string of length determines if is a prefix of one of the strings in time . We show that this query time is in fact optimal regardless of the size of the data structure.
Existing solutions either use space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine that achieves worst-case space. Along the way, we develop several interesting data structures that work on a pointer machine and are of independent interest. These include an optimal data structures for random access to a grammar-compressed string and an optimal data structure for a variant of the level ancestor problem.
1 Introduction
A string dictionary compactly represents a set of strings to support efficient prefix queries, that is, given a pattern string determine if is a prefix of some string in . Designing efficient string dictionaries is a fundamental data structural problem dating back to the 1960’s. String dictionaries are a key component in a wide range of applications in areas such as computational biology, data compression, data mining, information retrieval, natural language processing, and pattern matching.
A key challenge and the focus of most of the recent work is to design efficient compressed string dictionaries, that take advantage of repetitions in the strings to minimize space, while still supporting efficient queries. While many efficient solutions are known, they all rely on powerful word-RAM techniques, such as tabulation, address arithmetic, word-level parallelism, hashing, etc., to achieve efficient bounds. A natural question is whether or not such techniques are necessary for obtaining efficient compressed string dictionaries or if simpler and more basic computational primitives such as pointer-based data structures and character comparison suffice.
In this paper, we answer this question to the affirmative by introducing a new compressed string dictionary based on top tree compression that works on a standard comparison-based, pointer machine model of computation. We achieve the following bounds: let be the total length of the strings in , let be the size of the alphabet, and be the length of a query string . Our compressed string dictionary uses space (space is measured as the number of words and not bits, see discussion below) and supports queries in time. The space matches the information-theoretic worst-case space lower bound, and we further show that the query time is optimal for any comparison-based query algorithm regardless of the space. Compared to previous work our string dictionary is the first space solution in this model of computation.
1.1 Computational Models
We consider three computational models. In the comparison-based model algorithms only interact with the input by comparing elements. Hence they cannot exploit the internal representation of input elements, e.g., for hashing or word-level parallelism. The comparison-based model is a fundamental and well-studied computational model, e.g., in textbook results for sorting [45], string matching [44], and computational geometry [54]. Modern programming languages and libraries, such as the C++ standard template library, implement comparison-based algorithms by supporting abstract and user-specified comparison functions as function arguments. In our context, we say that a string dictionary is comparison-based if the query algorithm can only access the input string via single character comparisons of the form , where is a character.
In the pointer machine model, a data structure is a directed graph with bounded out-degree. Each node contains a constant number of data fields or pointer to other nodes and algorithms must access the data structure by traversing the graph. Hence, a pointer machine algorithm cannot implement random access structures such as arrays or perform address arithmetic. The pointer machine captures linked data structures such as linked-lists and search trees. The pointer machine model is a classic and well-studied model, see e.g. [60, 21, 37, 22, 1].
Finally, in the word RAM model of computation [36] the memory is an array of memory words, that each contain a logarithmic number of bits. Memory words can be operated on in unit-time using a standard set of arithmetic operations, boolean operations, and shifts. The word RAM model is strictly more powerful than the comparison-based model and the pointer-machine model and supports random access, hashing, address arithmetic, word-level parallelism, etc. (these are not possible in the other models).
The space of a data structure in the word RAM model is the number of memory words used and the space in the pointer machine model is the total number of nodes. To compare the space of the models, we assume that each field in a node in the pointer machine stores a logarithmic number of bits. Hence, the total number of bits we can represent in a given space in both models is within a constant factor of each other.
1.2 Previous work
The classic textbook string dictionary solution, due to Fredkin [31] from 1960, is to store the trie of the strings in and to answer prefix queries using a top-down traversal of , where at each step we match a single character from to the labels of the outgoing edges of a node. If we manage to match all characters of then is a prefix of a string in and otherwise it is not.
Depending on the representation of the trie and the model of computation we can obtain several combinations of space and time complexity. On a comparison-based, pointer machine model of computation, we can store the outgoing edges of each in a biased search tree [14], leading to an space solution with query time .
We can compress this solution by merging maximal identical complete subtrees of [28], thus replacing by a directed acyclic graph (DAG) that represents . This leads to a solution with the same query time as above but using only space, where is the size of the smallest DAG representing . The size of can be exponentially smaller than , but may not compress at all. Consider for instance the case where is a single path of length where all edges have the same label (i.e., corresponding to a single string of the same letter). Even though is highly compressible (we can represent it by the label and the length of the path) it does not contain any identical subtrees and hence its smallest DAG has size .
Using the power of the word RAM model improved representations are possible. Benoit et al. [13] and Raman et al. [55] gave succinct representations of tries that achieve space and query time, thus simultaneously achieving optimal query time and matching the worst-case information theoretic space lower bounds. These results rely on powerful word RAM techniques to obtain the bounds, such as tabulation and hashing. Numerous trie representations are known, see e.g., [26, 53, 34, 41, 4, 6, 7, 8, 5, 18, 61, 63, 40, 59, 62], but these all use word RAM techniques to achieve near optimal combinations of time and space.
Another approach is to compress the strings according to various measures of repetitiveness, such as the empirical -th order entropy [35, 46, 50, 56], the size of the Lempel-Ziv parse [9, 23, 42, 32, 33, 15, 52], the size of the smallest grammar [24, 25, 32], the run-length encoded Burrows-Wheeler transform, [47, 48, 49, 57], and others [51, 10, 58, 11, 30, 5]. The above solutions are designed to support more general queries on the strings, but as noted by Ars and Fischer [5] they are straightforward to adapt to prefix queries. For example, if is size of the Lempel-Ziv parse of the concatenation of the strings in , the result of Christiansen and Etienne [23] implies a string dictionary of size that supports queries in time . Since can be exponentially smaller than , the space is significantly improved on highly-compressible strings. Since in the worst-case, the space is always and thus almost optimal compared to the information theoretic lower bound. Similar bounds are known for the other measures of repetitiveness. As in the case of succinct representations of tries, all of these solutions use word RAM techniques.
1.3 Our results
We propose a new compressed string dictionary that achieves the following bounds:
Theorem 1
Let be a set of strings of total length over an alphabet of size . On a comparison-based, pointer machine model of computation, we can construct a compressed string dictionary that uses space and answer queries in time.
Note that the space bound for Theorem 1 matches the information theoretic lower bound and the time bound matches the classic linear space implementation of tries with biased search trees. The result is the first space solution in this model of computation. Furthermore, we show that this time bound is optimal.
Theorem 2
For any , , and , there exists a set of strings of total length over an alphabet of size such that any comparison-based algorithm that checks if a given pattern of length belongs to needs to perform comparisons in the worst case.
Note that Theorem 2 holds regardless of the space used, holds even for weaker membership queries, and only assumes that the algorithm is a comparison-based algorithm. We note that the upper bound holds on a pointer machine with comparisons and additions as arithmetic operations, while the lower bound only assumes comparisons.
1.4 Techniques
In top tree compression [19] one transforms a labeled tree into another tree (called a top tree) that is of height and represents a hierarchical decomposition of into connected subgraphs (called clusters). Each cluster overlaps with other clusters in at most two nodes. Every leaf in corresponds to a cluster consisting of a single edge in and every internal node in corresponds to a merge of two clusters. The top tree is then compressed using the classical DAG compression resulting in the top DAG . The top DAG supports basic navigational queries on in time, has size , can compress exponentially better than DAG compression, and is never worse than DAG compression by more than a factor [39, 19, 16, 29].
Our main technical contribution is implementing prefix search optimally on the top DAG. To this end, we develop several optimal pointer machine data structures of independent interest:
- •
A data structure for the path extraction problem, that asks to compactly represent an edge-labeled tree such that given a node we can efficiently return the labels on the root-to- path in . While an optimal solution for this problem can be obtained by plugging in known tools, more specifically a fully persistent queue [38], we believe that our self-contained solution is simpler and elegant.
- •
A data structure for the weighted level ancestor problem, that asks to compactly represent an edge-weighted tree such that given a node and a positive number we can efficiently return the rootmost ancestor of whose distance from the root is at least . An immediate implication of our weighted level ancestor data structure is an optimal data structure for the random access problem on grammar compressed strings. This improves a SODA’11 result [20] that required word RAM bit tricks.
- •
A data structure for the spine path extraction problem, that asks to compactly represent a top-tree compression such that given a cluster we can efficiently return the characters of the unique path between the two boundary nodes of .
- •
For the lower bound, we show that any algorithm that given a string checks if needs to perform comparisons in the worst case. We then show that when this implies the bound for our problem and when it implies the bound for our problem.
1.5 Roadmap
In Section 2 we recall top trees and how a top tree of a tree is obtained by merging (either vertically or a horizontally) the top trees of two subtrees of that overlap on a single node. In Section 3 we present a simple randomized Monte-Carlo word RAM solution to the compressed string indexing problem that is the basis of our deterministic pointer machine solutions in the following sections. The solution is based on top trees and efficiently handles horizontal merges (deterministically) and vertical merges (randomized Monte-Carlo). In Section 4 we show how to handle vertical merges deterministically on a pointer machine, and in Section 5 we show that this suffices to achieve the query time in Theorem 1. We show a different way to handle vertical merges in Section 6 and horizontal merges in Section 7. In Section 8 we show that these suffice to achieve the query time in Theorem 1. Finally, in Section 9 we give a matching lower bound showing that the query time in Theorem 1 is optimal regardless of the size of the structure.
2 Preliminaries
In this section we briefly review Karp-Rabin fingerprints [43], top trees [3], and top tree compression [19].
2.1 Karp-Rabin Fingerprints
The Karp-Rabin fingerprint [43] of a string is defined as , where is a randomly chosen positive integer, and is a prime. Karp-Rabin fingerprints guarantee that given two strings and , if then . Furthermore, if , then with high probability . Fingerprints can be composed and subtracted as follows.
Lemma 1
Let be a string decomposable into a prefix and suffix . Given any two of the Karp-Rabin fingerprints , and , it is possible to calculate the remaining fingerprint in constant time.
2.2 Clustering
Let be a node in with children in left-to-right order. Define to be the subtree induced by and all proper descendants of . Define to be the forest induced by all proper descendants of . For let be the connected component induced by the nodes .
A cluster with top boundary node is a connected component of the form , . A cluster with top boundary node and bottom boundary node is a connected component of the form , , where is a node in . We denote the top boundary node of a cluster by . Clusters can therefore have either one or two boundary nodes. For example, let denote the parent of then a single edge of is a cluster where is the top boundary node. If is a leaf then there is no bottom boundary node, otherwise is a bottom boundary node. Nodes that are not boundary nodes are called internal nodes. The path between the top and bottom boundary nodes in a cluster is called the cluster’s spine, and the string obtained by concatenating the labels on the spine from top to bottom is denoted .
Two edge disjoint clusters and whose vertices overlap on a single boundary node can be merged if their union is also a cluster. There are five ways of merging clusters (see Figure 1). Merges of type (a) and (b) are called vertical merges ( is then a vertical cluster) and can be done only if the common boundary node is not a boundary node of any other cluster except and . Merges of type (c),(d), and (e) are called horizontal merges ( is then a horizontal cluster) and can be done only if at least one of or does not have a bottom boundary node.
2.3 Top Trees
A top tree of is a hierarchical decomposition of into clusters. It is an ordered, rooted, labeled, and binary tree defined as follows (see Figure 2(a)-(c)).
The nodes of correspond to clusters of .
The root of corresponds to the cluster itself. The top boundary node of the root of is the root of .
The leaves of correspond to the edges of . The label of each leaf is the label of the corresponding edge in .
Each internal node of corresponds to the merged cluster of its two children. The label of each internal node is the type of merge it represents (out of the five merging options). The children are ordered so that the left child is the child cluster visited first in a preorder traversal of .
Lemma 2** (Alstrup et al. [3])**
Given a tree of size , we can construct in time a top tree of that is of size and height .
2.4 Top Dags
Every labeled tree can be represented with a directed acyclic graph (DAG) by identifying identical rooted subtrees and replacing them with a single copy. The top DAG of , denoted , is the minimal DAG representation of the top tree of . We can compute it in time from [28]222Here we use edge labels instead of nodes label. The two definitions are equivalent and edge labels are more natural for tries.. Top DAGs have important properties for compression and computation [19, 16, 39, 29]. We need the following optimal worst-case compression bound.
Lemma 3** (Dudek and Gawrychowski [29])**
Given an ordered tree with nodes over an alphabet of size , we can construct a top DAG in time of size .
3 A Simple Index
We first present a simple randomized Monte-Carlo word RAM string index, that will be the starting point for our deterministic, comparison-based pointer machine solution in the later sections.
3.1 Data Structure
Let be the trie of the strings and let be the corresponding top DAG of . Our data structure augments with additional information. For each cluster in we store the following information.
- •
If is a leaf cluster representing an edge , we store the label of .
- •
If is an internal cluster with left and right child and , we store the label of the edge to the rightmost child of the top boundary node, the fingerprint , and the length .
This requires constant space for each cluster and hence space in total.
3.2 Searching
Given a pattern of length , we denote the unique node in whose path from the root matches the longest prefix of , the
Given a pattern of length we find the longest matching prefix of in , i.e., the unique node in whose path from the root matches the longest prefix of , as follows. First, compute and store all fingerprints of prefixes of in time and space. By Lemma 1, we can then compute the fingerprint of any substring of in time.
Next, we traverse top-down while matching . Initially, we search for starting at the root of . Suppose we have reached cluster and have matched . If we return . Otherwise () there are three cases:
Case 1: is a leaf cluster.
Let be the edge stored in . We compare with the label of . We return if they match and otherwise .
Case 2: is a horizontal cluster.
Let and be the left and right child of , respectively. We compare with the label of the edge to the rightmost child of . If , we continue the search in for . Otherwise, we continue the search in for .
Case 3: is vertical cluster.
Let and be the left and right child of , respectively. If we continue the search in for . Otherwise, we compare the fingerprint with . If they match, we continue the search in for . Otherwise, we continue the search in for .
Lemma 4
The algorithm correctly computes the longest matching prefix of in .
*Proof. * We show by induction that at cluster the prefix matches the path from the root of to and . If is the root of the empty path to matches the empty prefix and . Inductively, suppose matches the path from the root to and . If the longest prefix is thus and . In each case, the algorithm maintains the invariant. The algorithm greedily matches as many characters from as possible, and hence at the end of the traversal the algorithm has found the longest matching prefix of .
Next consider the running time. We compute all fingerprints of in time. Each step of top-down traversal requires constant time and since the depth of is the total time is . In summary, we have the following theorem.
Theorem 3
Let be a set of strings of total length , and let be the corresponding top DAG for the trie of . On a word RAM model of computation, we can solve the compressed string indexing problem in space and time for any pattern of length . The solution is randomized Monte-Carlo.
In the next sections we show how to convert the above algorithm from a randomized algorithm on a word RAM machine into a deterministic algorithm on a pointer machine. We note that Theorem 3 and our subsequent solutions can be extended to other variants of prefix queries, such as counting queries, that return the number of occurrences of . To do so, we store the size of each cluster in and use the above top-down search modified to also record the highest cluster whose top boundary is . Since the size of is the number of occurrences of , we obtain a solution that also supports counting within the same complexities. From we can also support reporting queries, that return the strings in with prefix , by simply decompressing incurring additional linear time in the lengths of the strings with matching prefix.
4 Spine Extraction
We first consider how to handle vertical clusters (Case 3) deterministically on a pointer machine. The key challenge is to efficiently extract the characters on the spine path of a vertical cluster from top to bottom without decompressing the whole cluster. We will use this to efficiently compute longest common prefixes between spine paths and substrings of in order to achieve total time.
Given the top DAG , the spine path extraction problem is to compactly represent such that given any vertical cluster we can return the characters of . We require that the characters are reported online and from top-to-bottom, that is, the characters must be reported in sequence and we can stop extraction at any point in time. The goal is to obtain a solution that is efficient in the length of the reported prefix. In the following sections we show how to solve the problem in space and total time over all spine path extractions.
We present a new data structure derived from the top DAG called the vertical top DAG and show how to use this to extract characters from a spine path. We then use this to compute the longest common prefixes between a spine path and any string and plug this in to the top down traversal in the simple solution from Section 3 to obtain Theorem 1.
4.1 Vertical Top Forest and Vertical Top DAG
The vertical top forest of is a forest of ordered, rooted, and labeled binary trees. The nodes in are all the vertical clusters of and the leaf clusters of that correspond to edges of a spine path of some cluster in . The edges of are defined as follows. A cluster of type (a) with children and in has two children in . The left and right children are the unique vertical or leaf descendants of in whose spine path is and , respectively. A cluster of type (b) with children and in has a single child in , which is the unique vertical or leaf descendant of in whose spine path is . See Figure 3(a). We have the following correspondence between spine paths and subtrees in .
Lemma 5
Let be a vertical merge in and be the leaves of . Then, are the edges on and . Furthermore, the left-to-right ordering of corresponds to the top-down ordering of the edges on .
*Proof. * By definition of and the ordering of children in and it follows that the edges on the spine in top-down order are the leaves in left-to-right order. A cluster of type (b) in has a child that is either a leaf or a cluster of type (a). All clusters of type (a) have two children and hence .
For instance in Figure 3(a), the descendant leaves of are , , in left-to-right ordering corresponding to the edges in the spine of in Figure 2(b).
The vertical top DAG is the DAG obtained by merging identical subtrees of according to the DAG compression of . See Figure 3(b).
4.2 Spine Extraction
We now show how to solve spine path extraction using the vertical top DAG . The key idea is to simulate a depth-first left-to-right order traversal of using a recursive traversal of . In order to use spine path extraction to search for a pattern we also need to be able to continue the search in some horizontal cluster of the top DAG after extracting characters on the spine. We will therefore define what we call a vertical exit cluster, from which we can quickly find the cluster to continue the search from.
Define the vertical exit cluster, , for at position , to be the lowest common ancestor of leaves and in . Intuitively, if we have extracted the first characters of , then is the cluster such that all leaves in the left subtree have been extracted and only one leaf in the right subtree (corresponding to the th character) has been extracted. Our goal is to implement spine path extraction in time . This will yield a telescoping sum when doing multiple extractions.
Our data structure consists of the vertical top DAG . We augment each internal cluster by the label of the first edge on its spine path and each leaf cluster by the label of the stored edge. This uses space.
Given a cluster we implement spine path extraction by simulating a depth-first left-to-right order traversal of using a recursive traversal of . To extract the first character we return the stored label at . Suppose we have extracted characters, . To extract the next character continue the simulated depth-first search until we reach a cluster in whose leftmost leaf is the th leaf of . Return the character stored at and the parent of in as . (Note the parent of is the cluster visited right before in the simulated depth-first search.)
By Lemma 5, the algorithm correctly solves spine path extraction and the total time to extract characters is . We need a stack to keep track of the current search path in the traversal using space. In summary, we have the following lemma.
Lemma 6
Let be the vertical top DAG. We can represent in space such that given a vertical cluster , we can support spine path extraction on in time, where is the length of the extracted prefix of .
Note that we can use Lemma 6 to compute the longest common prefix of and any string by reporting the characters on the spine path from top-to-bottom and comparing them with the string until we get a mismatch. This uses time, where is the length of the longest common prefix.
5 An Time Solution
We now plug in our spine path extraction algorithm from Section 4 into the simple algorithm from Section 3.
Define the horizontal entry cluster for a vertical cluster , denoted , to be the highest horizontal cluster or leaf cluster in that contains all edges from to children within . For a horizontal cluster or a leaf the horizontal exit cluster is the cluster itself. Note is the highest horizontal cluster or leaf cluster on the path from to the leftmost leaf of .
Our data structure consists of the data structures from Section 3 without fingerprints and Section 4. This uses space. To search for a string of length , we use the same algorithm as in Section 3, but with the following new implementation of the vertical merges.
Case 3: is vertical cluster.
Recall we have reached a vertical cluster and have matched prefix . We check if the first character on matches . If it does not, we continue the algorithm from . If it does, we extract characters from in order to compute the length of the longest common prefix of and and the corresponding vertical exit cluster . Let be the right child of in . We traverse the leftmost path from to find and continue the search for from there.
Lemma 7
The algorithm correctly computes the longest matching prefix of in .
*Proof. * We show by induction that at cluster the prefix matches the path from the root of to and . If is the root of the empty path to matches the empty prefix and . Inductively, suppose matches the path from the root to and . If the longest prefix is thus and . Correctness of Case 1 and Case 2 follows from Lemma 4.
Consider Case 3 and let and be as in the description. By induction and correctness of spine extraction it follows that matches the path from the root of to . By induction and thus is a descendant of in . Since is not a boundary node in it follows that all ancestors of in contains exactly the same edges out of as . Hence, .
Consider the time used in a vertical step from a cluster . The time to compute the longest common prefix computation extracting characters and walking to the corresponding horizontal entry cluster is . Hence, if we have vertical steps from clusters extracting characters ending in , respectively, we use time
[TABLE]
This follows from the fact that and all lie on the same root-to-leaf path in and that . As in Section 3, the total time used at horizontal merges is , as all lie on the same root-to-leaf path in and we only walk down in the tree during the horizontal merges. This concludes the proof of the query time in Theorem 1.
6 Spine Path Extraction with Constant Overhead
Next, we show how to achieve the query time in Theorem 1. Our current solutions for horizontal merges (Case 2) from Section 3 and vertical merges (Case 3) from Section 5 both require and hence we need new techniques for both cases to achieve the time bound. We consider vertical merges in this section and horizontal merges in the next section.
In this section, we improve the total time used on spine extraction to optimal time. To do so we first introduce and present a novel solution to a new path extraction problem on trees in Section 6.1 and then show how to use this to extract characters from the spine in Section 6.2.
6.1 Path Extraction in Trees
Given a tree with nodes, the path extraction problem is to compactly represent such that given a node we can return the nodes on the path from the root of to in constant time per node. We require that the nodes are reported online and from top-to-bottom, that is, the nodes must be reported in sequence and we can stop the extraction at any point in time. The ordering of the nodes from top to bottom is essential. The other direction (from to the root) is trivial since we can simply store parent pointers and traverse that path using linear space and constant time per node. If we allow word RAM tricks then we can easily solve the problem in the same bounds by using an existing level ancestor data structure [12, 2, 27]. We present an optimal solution that does not use word RAM tricks and works on a pointer machine. As mentioned in the introduction, an optimal solution can be also obtained by plugging in known tools, but we believe that our method is simpler and elegant.
Let and be the distance from to the root and to deepest leaf in ’s subtree, respectively. Decompose into a top part consisting of nodes , such that , and a bottom part consisting of the remaining nodes. For each leaf in we store the path from the root of to explicitly in a linked list sorted by increasing depth. (see Figure 4). Note that multiple copies of the same node may be stored across different lists. Each such path to a leaf uses space, and hence the total space for all paths in is
[TABLE]
where the first equality follows by definition of the decomposition and the second follows since the longest paths from a descendant leaf in to a leaf in are disjoint for all the leaves in . For all internal nodes in we store a pointer to a leaf below it. For all nodes in we store a pointer to the unique ancestor that is a leaf in . We answer a path extraction query for a node as follows. If is in we follow the leaf pointer and output the path stored in this leaf from the root until we reach . If is in we jump to the unique ancestor leaf of in . We extract the path from the root to , while simultaneously following parent pointers from until we reach storing these nodes on a stack. That is, each time we extract a node from the root-to- path we follow a parent pointer and put the next node on the stack. We stop pushing nodes to the stack when we reach . When we have output all nodes from the root to the leaf in we output the nodes from the stack. Since the path from the root to is at least as long as the path from to plus 1. Therefore, the whole path is extracted. We spend time per node and hence we have the following result.
Lemma 8
Given a tree with nodes, we can solve the path extraction problem in linear space and preprocessing and constant time per reported node.
6.2 Optimal Spine Path Extraction
We plug the path extraction solution into our depth-first search traversal of the vertical top DAG to speed up spine extraction and longest common prefix computation. Recall that given a vertical cluster , our goal is to simulate a depth-first left-to-right order traversal of the subtree using the vertical top DAG .
We construct the left-path suffix forest of as follows. The nodes of are the nodes of . If has a left child in then is the parent of in . Hence, any leftmost path in corresponds to a path from a node to an ancestor of the node in . We now store with the path extraction data structure from Lemma 8. We implement the depth-first traversal as before except that whenever the traversal reaches an unexplored cluster in we begin path extraction for that cluster corresponding to the path from to the leftmost descendant leaf . We extract the leaf and then continue the depth-first traversal from there. Hence, the current search path of the depth-first traversal is partitioned into an alternating sequence of leftmost paths and right edges. Whenever we need to go up on a left edge in the traversal we extract the next node for the corresponding path extraction instance.
To extract the topmost characters of we now use constant time to find the leftmost descendant leaf of and then time to traverse the first leaves. Hence, we improve the time from to . At any point during the traversal we maintain ongoing path extractions instances along the current search path. The stacks each of these need are of size at most linear in the length of their corresponding subpath of the search path and hence this requires at most extra space.
Lemma 9
We can represent the vertical top DAG in space such that given a vertical cluster , we can support spine path extraction on in time, where is the length of the extracted prefix of .
7 Horizontal Access
We now show how to efficiently handle horizontal merges (Case 2). In the simple algorithm from Section 3 we use constant time at each horizontal merge leading to an total time solution. Since we cannot afford time we instead show how to handle all horizontal merges in time. The key idea is to convert the problem into a variant of the random access problem for grammar compressed strings, and then design a linear-space logarithmic-query solution to the random access problem. We describe the random access problem in Section 7.1 and present our solution to it in Section 7.2, we introduce the horizontal top DAG in Section 7.3, and define and solve the horizontal access problem in Section 7.4.
7.1 Grammars and Random Access
Grammar-based compression replaces a long string by a small context-free grammar (CFG) . We view a grammar as a DAG, where each node is a grammar symbol and each rule defines directed ordered edges from the righthand side to the lefthand side. Given a node in , we define to be the parse tree rooted at and to be the string consisting of the leaves of in left-to-right order. Note that given a rule we have that , where denotes concatenation. Given a grammar representing a string , the random access problem is to compactly represent while supporting fast access queries, that is, given an index in report . Bille et al. [20] showed how to do random access in time using space333Here for any constant denotes the inverse of the row of Ackermann’s function, defined as so that , , , and so on. on a pointer machine model. Furthermore, given a node in , access queries can be supported on the string in time .
For our purposes, we need to slightly extend this result to gapped grammars. A gapped grammar is a grammar except that each internal rule is now of the form , where is a non-negative integer called the gap. The string generated by is now and hence the resulting string generated is as before except for the inserted gaps of runs of 0’s. Note that . The above random access result is straightforward to generalize to gapped grammars:
Lemma 10** (Bille et al. [20])**
Let be a string compressed into a gapped grammar of size . Given a node in , we can support random access queries in in time using space. The solution works on a pointer machine model of computation.
7.2 Horizontal Access in Linear Space
Bille et al. [20] further showed that the inverse-Ackermann factor in the space complexity of Lemma 10 can be removed if we assume a word RAM model of computation. In this section we show that this can also be achieved on a pointer machine. To this end, we need to replace a single component in the solution of Bille et al., their weighted level ancestor structure. In the weighted level ancestor problem, we are given a tree on nodes with positive weights on the edges. For every node , let be its distance to the root, and let be its parent. Then, the goal is to preprocess to answer the following weighted level ancestor queries: given a non-root node and a positive number , find an ancestor such that but .
Without getting into the proof of Lemma 10, it suffices to say that (1) performing a random access query boils down to performing weighted level ancestor queries, and (2) in order for all these queries to be done in total time, the time for each weighted level ancestor query should be proportional to . Intuitively, we seek a position on an edge at distance from the root, and the longer the found edge is the smaller the query time should be. We next show how to achieve such query time using linear space on a pointer machine, implying an inverse-Ackermann factor improvement to Lemma 10.
Lemma 11
A tree on nodes can be preprocessed in space to answer a weighted level ancestor query for a node and a number in time, where is the found ancestor of .
*Proof. * We start with partitioning into slices. The slice, denoted , consists of all nodes such that . Observe that each is a collection of trees. For each node , we store a pointer to an arbitrary descendant such that no child of belongs to , denoted . In other words, is a leaf in its corresponding tree of (and also a descendant of that belongs to the same tree of ). To answer a query for a node and a number , we first replace with . This does not increase by more than 1 and, because we replace with its descendant, returns the same node. Thus, from now on we can assume that the input to a query is a node that is a leaf in its tree of . For each such node, we store a pointer to the highest ancestor of that still belongs to . To answer a query for a node that is a leaf in its tree of and a number , we then check the following three cases:
, then we repeat with replaced with . 2. 2.
and , then we return . 3. 3.
, then we search for the answer among the ancestors of in its tree of .
Observe that whenever Case 1 applies the value of decreases by at least 1, and so it is enough to show how to separately preprocess each tree of for weighted ancestor queries in time, where is the found node and is its distance to the root of the corresponding tree of (note that the maximum value of is ).
We can therefore focus on the following problem: preprocess a tree with a parameter such that for every for weighted ancestor queries in time, where is the found ancestor of , and is always a leaf. The preprocessing proceeds recursively. We first partition into the top part, denoted , and a collection of trees constituting the bottom part, denoted . A node belongs to when . Each leaf stores a pointer to its highest ancestor that still belongs to . Let denote the collection of trees obtained by removing all leaves from . Each leaf additionally stores a pointer to an arbitrary leaf in the subtree rooted at in , and a pointer to an arbitrary leaf in the subtree rooted at in . We apply the above construction recursively with a parameter on and on every tree of . See Figure 5 for an illustration.
To answer a query for a leaf and a number , we check the following four cases:
, then we repeat with replaced with in . 2. 2.
and , then we return . 3. 3.
, then we return . 4. 4.
and , then we repeat with replaced with and decreased by in the corresponding tree of .
The cases are not mutually exclusive as it might happen that . Correctness of Case 2 and 3 is immediate. In Case 1 and 4 we recurse while maintaining the invariant that is a leaf in the current tree, and the sought node is easily seen to belong to or (because we require we can indeed consider instead of ), respectively. In every recursive step, the value of decreases by 1. Also, if then after steps the edge from to cannot belong to the currently considered tree, and so there are at most steps making the query time as required. To analyze the space, we assume that the partition into and is only conceptual, and the stored information , and is associated with a node . Because the leaves of for which we need to store information are then removed and do not participate further in the construction, this is indeed possible and shows that the overall space is per node of . Finally, even though we have only described how to answer a query for a leaf , the query algorithm rewritten to use the information stored at nodes of behaves as if and hence is correct.
Corollary 1
Let be a string compressed into a gapped grammar of size . Given a node in , we can support random access queries in in time using space. The solution works on a pointer machine model of computation.
7.3 Horizontal Top Tree and Horizontal Top DAGs
Similar to the vertical top forest we define the horizontal top forest of as a forest of ordered and rooted trees that consists of all horizontal clusters of and leaves of whose top boundary is shared with a horizontal cluster. We define the edges in of in as follows. Let be a horizontal cluster with children and in . If is a horizontal cluster or a leaf then the left child of is , and if is a vertical cluster then the left child of is . Similarly, the right child of is either or . See Figure 3. We have the following property of .
Lemma 12
Let be a horizontal merge in . Then, the leaves of are the edges to children of the top boundary node of and the left-to-right ordering of the leaves correspond to the left-to-right ordering of the children of in . All nodes in has as top boundary node.
*Proof. * By definition of and the ordering of the children in and it follows that the edges to children of the top boundary node of correspond to the leaves in in left-to-right order. Let be a horizontal cluster with children and in . Then . Furthermore, by definition . Hence, all nodes in has as top boundary node.
For instance in Figure 3(c) the descendant leaves of are , , , and in left to right ordering corresponding to the edges to the children of . Given the horizontal top forest we define the horizontal top DAG as the DAG obtained by merging the subtrees of according to the DAG compression of into .
7.4 Gapped Grammars and Horizontal Access
Let be an internal cluster in . The spine child of is the unique child of that contains the first edge of . A descendant cluster of is a spine descendant of if all clusters on the path from to are spine children of their parent. Define the horizontal exit cluster for a horizontal cluster and character , denoted , to be the highest cluster in that has the unique leaf in labeled as a spine descendant.
Given the horizontal top DAG , the horizontal access problem, is to compactly represent such that given a horizontal merge and a character , we can efficiently determine if has an edge to a child labeled within and if so return the horizontal exit cluster . In this section, we show how to solve the horizontal access problem in space and time.
The characteristic vector of a cluster is a binary string encoding the labels of edges to children of . More precisely, given a character define as the rank of in the sorted order of characters of . Also, given a cluster in define to be the set of ranks of leaf labels in . We define the characteristic vector recursively as follows. If is a leaf cluster and if is an internal cluster with children , then where . Note that for any cluster . From the definition we have the following correspondence between the characteristic vector and the leaf labels of a cluster.
Lemma 13
Given a cluster in and a character , is a leaf label in iff .
Let be the root clusters of the trees in and note that if we add a virtual root cluster as the parent of , is a gapped parse tree for the string . Hence, the horizontal top DAG is a gapped grammar for the same string. By Lemma 13 we can determine if there is an edge labeled out of in using a random access query on the corresponding gapped grammar using time . If this edge exists, we can also find in the same time using similar ideas. More precisely, we have the following result.
Lemma 14
Given a cluster in and a character we can solve the horizontal acces problem in space and time.
*Proof. * By construction the characteristic vector of has length at most . Hence, by Corollary 1, we can determine if there is an edge out of in using space and time. If this is the case, we need to find in the same complexity. To do so, we augment the random access result of Corollary 1 as follows.
We need the following definitions from Bille et al. [20] applied to to explain the approach. The heavy-path decomposition of partitions into heavy and light edges with the property that any root-to-leaf path in is decomposed into an alternating sequence of heavy paths and single light edges. The heavy-path suffix forest of compactly encodes the heavy paths of in space and has the property that a subpath of a heavy path in uniquely corresponds to a path from a node to an ancestor of in . Our random access solution from Corollary 1 on solves weighted ancestor queries on using Lemma 11 and computes the alternating sequence of heavy subpaths and single light edges from to the leaf cluster containing the edge labeled .
We construct a new contracted forest from as follows. Imagine we mark all the edges going to non-spine children in . Then, is the highest descendant of whose path to the leaf containing only consist of unmarked edges. Now mark the corresponding edges in and construct be contracting all unmarked edges. The weight of a contracted node is the weight of the highest of its included nodes in . A weighted ancestor query on now identifies the node corresponding to the lowest horizontal entry cluster on the heavy path in . Since we contract edges in and reweigh them by adding the contracted edges the time for the weighted ancestor query is no more than the time for the corresponding query in .
To find , we traverse the alternating sequence of heavy paths and light edges from top-to-bottom in to find the lowest marked edge whose lowest endpoint is . At each heavy path we use a weighted ancestor query and at each light edge we simply check if it is marked. In total, this takes time.
8 An Solution
We can now plug in the spine extraction from Section 6.2 and the horizontal access from Section 7 into the simple algorithm from Section 3. Define the vertical entry cluster for a horizontal cluster , denoted , to be the highest vertical cluster or leaf cluster in that contains the first edge on .
Our data structure consists of the data structure from Section 6.2 for spine path extraction and the data structure from Section 7.3 for horizontal access. Furthermore, we store for each vertical cluster in a pointer to its horizontal entry cluster and for each horizontal cluster a pointer to its vertical entry cluster. In total this uses space.
To search we alternate between horizontal accesses using Lemma 14 and spine path extractions using Lemma 9. Instead of traversals to find entry clusters we jump directly using the new pointers. Specifically, we have the following modified algorithm:
Initially, we search for starting at the root of . Suppose we have reached cluster and have matched . If we return . Otherwise () there are three cases:
Case 1: is a leaf cluster.
Let be the edge stored in . We compare with the label of . We return if they match and otherwise .
Case 2: is a horizontal cluster.
Compute . If does not match return . Otherwise, continue the search for from .
Case 3: is vertical cluster.
We check if the first character on matches . If it does not we continue the algorithm from . Otherwise, we extract characters from in order to compute the length of the longest common prefix of and and the corresponding vertical exit cluster . Continue the search for from .
Lemma 15
The algorithm correctly computes the longest matching prefix of in .
*Proof. * We show by induction that at cluster the prefix matches the path from the root of to and . If is the root of the empty path to matches the empty prefix and . Inductively, suppose matches the path from the root to and . If the longest prefix is thus and . Correctness of Case 1 and Case 3 follows from Lemma 7.
Consider Case 2. There are two cases. If does not match, then by induction and we are done.
Otherwise, has an edge to a child labeled in and is a descendant of . Let be as in the description. By Lemma 12 and by the definition of we have . Hence, by induction matches the path from the root of to . Recall, that the horizontal exit cluster is the highest horizontal cluster in that has as a spine descendant. Hence, every cluster on the path from to has and all descendants of in as internal nodes. In particular, and hence by definition .
Consider the alternating sequence of horizontal accesses and spine extractions. Each time we go from a horizontal access to a spine extraction the current character of must match the first character on the spine. Hence, each horizontal access is on a distinct character of and the total number of horizontal accesses is at most . By Lemma 14 it follows that the total time for horizontal accesses is . Since the sequence is alternating the number of spine extractions is at most . Hence, by Lemma 9 the total time for spine extractions is at most . This concludes the proof of the query time in Theorem 1.
9 Lower Bound
In this section we prove Theorem 2. Namely, we show that any structure storing a set of strings of total length over an alphabet of size needs to perform comparisons to decide if a given pattern string belongs to . Every comparison should be of the form “”, where is a character. Note that the size of the structure is irrelevant for us. We start with a technical lemma that is the gist of our lower bound.
Lemma 16
For any and , any comparison-based algorithm that given a string over an alphabet of size checks if needs to perform comparisons in the worst case.
*Proof. * The number of strings over an alphabet of size such that is at least . Consider the decision tree corresponding to a comparison-based algorithm that decides if using less than comparisons in the worst case. Each node of corresponds to a subset of possible inputs of the form , in particular the root of corresponds to and its leaves correspond to disjoint subsets of inputs for which the answer is the same (yes or no) that together cover the whole . Because the depth of is assumed to be less than , contains less than leaves, so there exists a leaf corresponding to a subset of inputs and two distinct strings and such that and for every . Because and are distinct, there exists such that , and without losing generality . We define a new string by setting for every and . Then and for every , so the algorithm incorrectly decides that the answer for is the same as for .
We proceed to the main part of the lower bound. Fix , and . We consider two cases.
.
The set contains all strings such that . There are at most of such strings, and each of them is of length , making their total length at most . Any structure that stores and allows checking if a given pattern belongs to implies a comparison-based algorithm that checks if . By Lemma 16, this needs comparisons.
.
We choose the largest integer such that (by the assumption on and , ). Now the set contains all strings such that and . The total length of all such strings is at most . Any structure that stores and allows checking if a given pattern belongs to implies a comparison-based algorithm that checks if and additionally . When executed with the algorithm clearly needs to access every and so perform at least comparisons. Additionally, the algorithm can be converted into a procedure that given a pattern checks if , which by Lemma 16 requires comparisons. Combining these two lower bounds we obtain that comparisons are necessary. Rewriting the condition on and using the assumption that , we obtain , making our lower bound .
Combining the above two cases give us a lower bound of , because depending on the value of we have a lower bound of either or , thus the minimum of these two is always a correct lower bound. This proves Theorem 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Peyman Afshani, Lars Arge, and Kasper Green Larsen. Higher-dimensional orthogonal range reporting and rectangle stabbing in the pointer machine model. In Proc. 28th So CG , pages 323–332, 2012.
- 2[2] Stephen Alstrup and Jacob Holm. Improved algorithms for finding level ancestors in dynamic trees. In Proc. 27th ICALP , pages 73–84, 2000.
- 3[3] Stephen Alstrup, Jacob Holm, Kristian De Lichtenberg, and Mikkel Thorup. Maintaining information in fully dynamic trees with top trees. ACM Trans. Algorithms , 1(2):243–264, 2005.
- 4[4] J-I Aoe. An efficient digital search algorithm by using a double-array structure. IEEE Trans. Soft. Eng. , 15(9):1066–1077, 1989.
- 5[5] Julian Arz and Johannes Fischer. Lz-compressed string dictionaries. In Proc. 24th DCC , pages 322–331, 2014.
- 6[6] Julian Arz and Johannes Fischer. Lempel–ziv-78 compressed string dictionaries. Algorithmica , pages 1–36, 2018.
- 7[7] Nikolas Askitis and Ranjan Sinha. Engineering scalable, cache and space efficient tries for strings. The VLDB Journal , 19(5):633–660, 2010.
- 8[8] Djamal Belazzougui, Paolo Boldi, and Sebastiano Vigna. Dynamic z-fast tries. In Proc. 17th SPIRE , pages 159–172, 2010.
