Dynamic Path-Decomposed Tries
Shunsuke Kanda, Dominik K\"oppl, Yasuo Tabei, Kazuhiro Morita and, Masao Fuketa

TL;DR
This paper introduces a dynamic, space-efficient keyword dictionary using path decomposition and compact hash tries, significantly reducing memory usage while maintaining performance.
Contribution
It presents a novel dynamic keyword dictionary based on path decomposition and compact hash tries, addressing the challenge of space efficiency in dynamic settings.
Findings
Requires up to 68% less space than existing solutions
Achieves a favorable space-time tradeoff
Effective on real-world datasets
Abstract
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword…
| Size | MinLen | MaxLen | AveLen | |||
|---|---|---|---|---|---|---|
| GeoNames | 109 MiB | 7.3 M | 2 | 152 | 15.7 | 99 |
| AOL | 224 MiB | 10.2 M | 2 | 523 | 23.2 | 85 |
| Wiki | 286 MiB | 14.1 M | 2 | 252 | 21.2 | 200 |
| DNA | 189 MiB | 15.3 M | 13 | 13 | 13.0 | 16 |
| LUBMS | 3.1 GiB | 52.6 M | 10 | 80 | 63.7 | 57 |
| LUBML | 13.8 GiB | 230.1 M | 10 | 80 | 64.2 | 57 |
| UK | 2.7 GiB | 39.5 M | 17 | 2,030 | 72.4 | 103 |
| WebBase | 6.6 GiB | 118.2 M | 10 | 10212 | 60.2 | 223 |
| GeoNames | AOL | Wiki | DNA | LUBMS | LUBML | UK | WebBase | |
|---|---|---|---|---|---|---|---|---|
| AveHeight of | 6.0 | 6.2 | 6.3 | 9.0 | 7.5 | 7.9 | 7.8 | 7.3 |
| AveHeightLB of | 5.2 | 5.2 | 5.3 | 8.9 | 6.6 | 7.4 | 6.0 | 6.2 |
| AveHeightUB of | 8.5 | 10.5 | 9.7 | 10.7 | 11.8 | 12.4 | 14.7 | 15.4 |
| AveHeight of | 15.7 | 23.2 | 21.2 | 13.0 | 63.7 | 64.2 | 72.4 | 60.2 |
| Wiki | LUBMS | UK | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Steps | Space | Time | Steps | Space | Time | Steps | Space | Time | |
| 4 | 19.55% | 0.36 | 20.9 | 7.75% | 0.78 | 116.1 | 37.60% | 1.22 | 116.5 |
| 8 | 6.12% | 0.27 | 18.2 | 2.83% | 0.78 | 96.8 | 15.51% | 1.20 | 96.2 |
| 16 | 1.32% | 0.27 | 17.1 | 0.31% | 0.78 | 84.9 | 5.22% | 1.20 | 87.6 |
| 32 | 0.12% | 0.27 | 17.3 | 0.02% | 0.79 | 83.3 | 1.31% | 1.20 | 86.0 |
| 64 | 0.00% | 0.27 | 17.2 | 0.00% | 0.80 | 82.9 | 0.23% | 1.21 | 85.4 |
| 128 | 0.00% | 0.27 | 17.4 | 0.00% | 0.80 | 82.9 | 0.04% | 1.22 | 85.3 |
| 256 | 0.00% | 0.28 | 17.2 | 0.00% | 0.81 | 83.6 | 0.01% | 1.22 | 85.2 |
| 512 | 0.00% | 0.28 | 17.2 | 0.00% | 0.82 | 83.3 | 0.00% | 1.23 | 85.4 |
| 1024 | 0.00% | 0.28 | 17.3 | 0.00% | 0.83 | 83.1 | 0.00% | 1.24 | 85.4 |
| Steps | |||||||
|---|---|---|---|---|---|---|---|
| AveNLL | |||||||
| GeoNames | 6.34% | 1.44% | 0.28% | 0.04% | 0.00% | 0.00% | 6.1 |
| AOL | 22.16% | 6.83% | 1.26% | 0.11% | 0.01% | 0.00% | 10.6 |
| Wiki | 19.55% | 6.12% | 1.32% | 0.12% | 0.00% | 0.00% | 8.7 |
| DNA | 0.11% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 1.4 |
| LUBMS | 7.75% | 2.83% | 0.31% | 0.02% | 0.00% | 0.00% | 3.7 |
| LUBML | 7.66% | 2.80% | 0.31% | 0.02% | 0.00% | 0.00% | 3.7 |
| UK | 37.60% | 15.51% | 5.22% | 1.31% | 0.23% | 0.04% | 18.0 |
| WebBase | 24.15% | 8.94% | 2.46% | 0.50% | 0.08% | 0.02% | 11.1 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 0.32 | 0.59 | 0.53 |
| PDT-SB | 0.12 | 1.20 | 0.75 |
| PDT-CB | 0.11 | 1.30 | 0.80 |
| PDT-PFK | 0.33 | 0.63 | 0.71 |
| PDT-SFK | 0.14 | 0.74 | 0.90 |
| PDT-CFK | 0.12 | 0.91 | 0.99 |
| STLHash | 0.58 | 0.44 | 0.24 |
| GoogleDense | 0.73 | 0.37 | 0.12 |
| Sparsepp | 0.42 | 0.58 | 0.15 |
| Hopscotch | 0.84 | 0.40 | 0.10 |
| Robin | 0.84 | 0.31 | 0.10 |
| ArrayHash | 0.30 | 0.56 | 0.12 |
| HAT | 0.18 | 0.47 | 0.22 |
| Judy | 0.32 | 0.76 | 0.57 |
| ART | 0.59 | 0.85 | 0.56 |
| Cedar-R | 0.47 | 0.68 | 0.39 |
| Cedar-P | 0.26 | 0.71 | 0.34 |
| PCT-Bit | 2.96 | 9.12 | 10.43 |
| PCT-Hash | 5.13 | 7.74 | 5.49 |
| ZFT | 1.13 | 3.25 | 2.42 |
| CTrie++ | 1.34 | 2.58 | 0.79 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 0.32 | 0.59 | 0.53 |
| PDT-SB | 0.12 | 1.20 | 0.75 |
| PDT-CB | 0.11 | 1.30 | 0.80 |
| PDT-PFK | 0.33 | 0.63 | 0.71 |
| PDT-SFK | 0.14 | 0.74 | 0.90 |
| PDT-CFK | 0.12 | 0.91 | 0.99 |
| STLHash | 0.58 | 0.44 | 0.24 |
| GoogleDense | 0.73 | 0.37 | 0.12 |
| Sparsepp | 0.42 | 0.58 | 0.15 |
| Hopscotch | 0.84 | 0.40 | 0.10 |
| Robin | 0.84 | 0.31 | 0.10 |
| ArrayHash | 0.30 | 0.56 | 0.12 |
| HAT | 0.18 | 0.47 | 0.22 |
| Judy | 0.32 | 0.76 | 0.57 |
| ART | 0.59 | 0.85 | 0.56 |
| Cedar-R | 0.47 | 0.68 | 0.39 |
| Cedar-P | 0.26 | 0.71 | 0.34 |
| PCT-Bit | 2.96 | 9.12 | 10.43 |
| PCT-Hash | 5.13 | 7.74 | 5.49 |
| ZFT | 1.13 | 3.25 | 2.42 |
| CTrie++ | 1.34 | 2.58 | 0.79 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 0.52 | 1.01 | 0.65 |
| PDT-SB | 0.25 | 1.78 | 0.86 |
| PDT-CB | 0.21 | 1.87 | 0.90 |
| PDT-PFK | 0.52 | 0.80 | 0.80 |
| PDT-SFK | 0.27 | 0.94 | 1.13 |
| PDT-CFK | 0.23 | 1.16 | 1.18 |
| STLHash | 1.01 | 0.52 | 0.26 |
| GoogleDense | 1.72 | 0.67 | 0.15 |
| Sparsepp | 0.77 | 0.76 | 0.18 |
| Hopscotch | 1.04 | 0.51 | 0.12 |
| Robin | 1.79 | 0.47 | 0.12 |
| ArrayHash | 0.70 | 0.83 | 0.13 |
| HAT | 0.32 | 0.59 | 0.27 |
| Judy | 0.51 | 0.97 | 0.76 |
| ART | 0.91 | 0.96 | 0.77 |
| Cedar-R | 1.07 | 0.87 | 0.61 |
| Cedar-P | 0.49 | 0.80 | 0.57 |
| PCT-Bit | 4.07 | 12.42 | 14.34 |
| PCT-Hash | 7.66 | 10.26 | 6.99 |
| ZFT | 1.82 | 3.84 | 2.57 |
| CTrie++ | 2.12 | 3.01 | 1.10 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 0.64 | 0.98 | 0.68 |
| PDT-SB | 0.28 | 1.60 | 0.96 |
| PDT-CB | 0.24 | 1.71 | 1.02 |
| PDT-PFK | 0.67 | 0.79 | 0.86 |
| PDT-SFK | 0.31 | 0.96 | 1.15 |
| PDT-CFK | 0.27 | 1.14 | 1.22 |
| STLHash | 1.29 | 0.50 | 0.27 |
| GoogleDense | 1.64 | 0.54 | 0.14 |
| Sparsepp | 0.97 | 0.69 | 0.18 |
| Hopscotch | 1.08 | 0.42 | 0.13 |
| Robin | 1.83 | 0.41 | 0.12 |
| ArrayHash | 0.69 | 0.73 | 0.14 |
| HAT | 0.43 | 0.60 | 0.27 |
| Judy | 0.66 | 0.92 | 0.74 |
| ART | 1.23 | 1.00 | 0.73 |
| Cedar-R | 1.19 | 0.89 | 0.59 |
| Cedar-P | 0.63 | 0.89 | 0.61 |
| PCT-Bit | 5.67 | 11.85 | 13.79 |
| PCT-Hash | 10.11 | 9.48 | 7.09 |
| ZFT | 2.24 | 3.56 | 2.64 |
| CTrie++ | 2.92 | 3.01 | 1.09 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 0.84 | 1.18 | 0.65 |
| PDT-SB | 0.29 | 1.80 | 0.87 |
| PDT-CB | 0.20 | 2.00 | 0.91 |
| PDT-PFK | 0.80 | 0.85 | 0.88 |
| PDT-SFK | 0.33 | 1.02 | 1.11 |
| PDT-CFK | 0.25 | 1.34 | 1.20 |
| STLHash | 1.02 | 0.91 | 0.34 |
| GoogleDense | 1.25 | 0.24 | 0.09 |
| Sparsepp | 0.67 | 0.50 | 0.13 |
| Hopscotch | 1.50 | 0.27 | 0.07 |
| Robin | 1.50 | 0.26 | 0.08 |
| ArrayHash | 0.54 | 0.47 | 0.13 |
| HAT | 0.32 | 0.39 | 0.24 |
| Judy | 0.34 | 0.95 | 0.61 |
| ART | 1.01 | 0.65 | 0.63 |
| Cedar-R | 0.24 | 0.70 | 0.21 |
| Cedar-P | 0.31 | 0.67 | 0.24 |
| PCT-Bit | 7.05 | 8.44 | 9.91 |
| PCT-Hash | 8.45 | 5.44 | 6.86 |
| ZFT | 2.57 | 3.30 | 2.88 |
| CTrie++ | 2.50 | 2.55 | 0.78 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 2.37 | 1.62 | 1.10 |
| PDT-SB | 0.83 | 1.93 | 1.19 |
| PDT-CB | 0.66 | 2.04 | 1.22 |
| PDT-PFK | 2.46 | 1.09 | 1.14 |
| PDT-SFK | 0.95 | 1.27 | 1.42 |
| PDT-CFK | 0.78 | 1.52 | 1.52 |
| STLHash | 7.47 | 0.61 | 0.51 |
| GoogleDense | 9.93 | 0.89 | 0.28 |
| Sparsepp | 6.22 | 0.83 | 0.39 |
| Hopscotch | 6.87 | 0.70 | 0.27 |
| Robin | 9.87 | 0.61 | 0.26 |
| ArrayHash | 5.44 | 0.98 | 0.30 |
| HAT | 2.23 | 1.43 | 0.57 |
| Judy | 1.66 | 1.45 | 1.26 |
| ART | 5.83 | 0.91 | 0.77 |
| Cedar-R | 1.97 | 1.79 | 1.66 |
| Cedar-P | 1.46 | 2.10 | 1.71 |
| PCT-Bit | n/a | n/a | n/a |
| PCT-Hash | n/a | n/a | n/a |
| ZFT | 9.27 | 6.33 | 5.65 |
| CTrie++ | 8.13 | 4.25 | 2.43 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 2.37 | 1.62 | 1.10 |
| PDT-SB | 0.83 | 1.93 | 1.19 |
| PDT-CB | 0.66 | 2.04 | 1.22 |
| PDT-PFK | 2.46 | 1.09 | 1.14 |
| PDT-SFK | 0.95 | 1.27 | 1.42 |
| PDT-CFK | 0.78 | 1.52 | 1.52 |
| STLHash | 7.47 | 0.61 | 0.51 |
| GoogleDense | 9.93 | 0.89 | 0.28 |
| Sparsepp | 6.22 | 0.83 | 0.39 |
| Hopscotch | 6.87 | 0.70 | 0.27 |
| Robin | 9.87 | 0.61 | 0.26 |
| ArrayHash | 5.44 | 0.98 | 0.30 |
| HAT | 2.23 | 1.43 | 0.57 |
| Judy | 1.66 | 1.45 | 1.26 |
| ART | 5.83 | 0.91 | 0.77 |
| Cedar-R | 1.97 | 1.79 | 1.66 |
| Cedar-P | 1.46 | 2.10 | 1.71 |
| PCT-Bit | n/a | n/a | n/a |
| PCT-Hash | n/a | n/a | n/a |
| ZFT | 9.27 | 6.33 | 5.65 |
| CTrie++ | 8.13 | 4.25 | 2.43 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 10.1 | 1.43 | 1.00 |
| PDT-SB | 3.6 | 2.20 | 1.49 |
| PDT-CB | 2.8 | 2.39 | 1.53 |
| PDT-PFK | 10.7 | 1.32 | 1.39 |
| PDT-SFK | 4.1 | 1.63 | 1.79 |
| PDT-CFK | 3.4 | 1.91 | 1.90 |
| STLHash | 32.9 | 0.67 | 0.59 |
| GoogleDense | 40.1 | 0.99 | 0.32 |
| Sparsepp | 27.3 | 0.89 | 0.44 |
| Hopscotch | 41.2 | 1.01 | 0.27 |
| Robin | 41.2 | 0.69 | 0.28 |
| ArrayHash | 21.9 | 1.06 | 0.32 |
| HAT | 9.5 | 1.40 | 0.78 |
| Judy | 7.8 | 1.52 | 1.29 |
| ART | 25.8 | 1.05 | 0.93 |
| Cedar-R | n/a | n/a | n/a |
| Cedar-P | n/a | n/a | n/a |
| PCT-Bit | n/a | n/a | n/a |
| PCT-Hash | n/a | n/a | n/a |
| ZFT | n/a | n/a | n/a |
| CTrie++ | n/a | n/a | n/a |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 2.32 | 1.45 | 0.94 |
| PDT-SB | 1.26 | 2.76 | 1.44 |
| PDT-CB | 1.09 | 2.87 | 1.44 |
| PDT-PFK | 2.32 | 1.27 | 1.24 |
| PDT-SFK | 1.38 | 1.74 | 1.93 |
| PDT-CFK | 1.21 | 2.04 | 2.02 |
| STLHash | 6.05 | 0.67 | 0.50 |
| GoogleDense | 10.50 | 1.09 | 0.27 |
| Sparsepp | 5.06 | 0.96 | 0.37 |
| Hopscotch | 6.23 | 0.75 | 0.25 |
| Robin | 9.23 | 0.63 | 0.25 |
| ArrayHash | 5.91 | 1.16 | 0.28 |
| HAT | 2.68 | 1.08 | 0.51 |
| Judy | 2.21 | 1.88 | 1.59 |
| ART | 5.17 | 1.64 | 1.19 |
| Cedar-R | 7.37 | 2.24 | 2.30 |
| Cedar-P | 2.02 | 2.20 | 2.28 |
| PCT-Bit | 18.05 | 25.92 | 33.49 |
| PCT-Hash | n/a | n/a | n/a |
| ZFT | 7.53 | 6.20 | 5.03 |
| CTrie++ | 8.17 | 4.75 | 2.86 |
| Space | Insert | Lookup | |
|---|---|---|---|
| PDT-PB | 5.4 | 1.66 | 1.19 |
| PDT-SB | 2.6 | 2.57 | 1.77 |
| PDT-CB | 2.3 | 2.68 | 1.74 |
| PDT-PFK | 5.7 | 1.36 | 1.42 |
| PDT-SFK | 2.9 | 1.84 | 2.03 |
| PDT-CFK | 2.6 | 2.15 | 2.18 |
| STLHash | 16.3 | 0.64 | 0.55 |
| GoogleDense | 19.5 | 0.82 | 0.28 |
| Sparsepp | 13.5 | 0.83 | 0.43 |
| Hopscotch | 20.3 | 0.93 | 0.24 |
| Robin | 20.3 | 0.64 | 0.26 |
| ArrayHash | 10.4 | 0.98 | 0.29 |
| HAT | 6.7 | 1.08 | 0.55 |
| Judy | 5.9 | 2.09 | 1.76 |
| ART | 14.0 | 1.76 | 1.45 |
| Cedar-R | n/a | n/a | n/a |
| Cedar-P | n/a | n/a | n/a |
| PCT-Bit | n/a | n/a | n/a |
| PCT-Hash | n/a | n/a | n/a |
| ZFT | 19.6 | 5.77 | 5.06 |
| CTrie++ | 23.5 | 5.12 | 3.12 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Web Data Mining and Analysis
Dynamic Path-Decomposed Tries
Shunsuke Kanda
RIKEN Center for Advanced Intelligence ProjectJapan
,
Dominik Köppl
Kyushu UniversityJapan
Japan Society for Promotion of Science
,
Yasuo Tabei
RIKEN Center for Advanced Intelligence ProjectJapan
,
Kazuhiro Morita
Tokushima UniversityJapan
and
Masao Fuketa
Tokushima UniversityJapan
Abstract.
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones, while achieving a relevant space-time tradeoff.
1. Introduction
An associative array is called a keyword dictionary if its keys are strings. In this article, we study the problem to maintain a keyword dictionary in main memory efficiently. When storing words extracted from text collections written in natural or computer languages, the size of a keyword dictionary is not of major concern. This is because, after carefully polishing the extracted strings with natural language processing tools like stemmers, the size of grows sublinearly as for some over a text of words due to Heaps’ Law (Heaps, 1978; Baeza-Yates and Ribeiro-Neto, 2011). However, as reported in (Martínez-Prieto et al., 2016), some natural language applications such as web search engines and machine translation systems need to handle large datasets that are not under Heaps’ Law. Also, other recent applications as in Semantic Web graphs and in bioinformatics handle massive string databases with keyword dictionaries (Martínez-Prieto et al., 2016; Mavlyutov et al., 2015). Although common implementations like hash tables are fast, their memory consumption is a severe drawback in such scenarios. Here, a space-efficient implementation of the keyword dictionary is important. In this paper, we focus on the practical side of this problem.
In the static setting, omitting the insertion and deletion of keywords, a number of compressed keyword dictionaries have been developed for a decade, some of which we highlight in the following. We start with Martínez-Prieto et al. (Martínez-Prieto et al., 2016), who proposed and evaluated a number of compressed keyword dictionaries based on techniques like hashing, front-coding, full-text indexes, and tries. They demonstrated that their implementations use up to 5% space of the original dataset size, while also supporting searches of prefixes and substrings of the keywords. Subsequently, Grossi and Ottaviano (Grossi and Ottaviano, 2014) proposed a cache-friendly keyword dictionary through path decomposition of tries. Arz and Fischer (Arz and Fischer, 2018) adapted the LZ78 compression to devise a keyword dictionary. Finally, Kanda et al. (Kanda et al., 2017a) proposed a keyword dictionary based on a compressed double-array trie. As we can see from these representations, space-efficient static keyword dictionaries have been well studied because of the advancements of practical (yet static) succinct data structures collected in well maintained libraries such as SDSL (Gog et al., 2014) and Succinct (Grossi and Ottaviano, 2013).
Under the dynamic setting, however, only a few space-efficient keyword dictionaries have been realized, probably due to the implementation difficulty. Although HAT-trie (Askitis and Sinha, 2010) and Judy (Baskins, 2002) are representative space-efficient dynamic implementations as demonstrated in previous experiments111Such as http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/#perf and https://github.com/Tessil/hat-trie/blob/master/README.md#benchmark., they still waste memory by maintaining many pointers. The Cedar trie (Yoshinaga and Kitsuregawa, 2014) is a space-efficient implementation embracing heavily 32-bit pointers to address memory, and therefore cannot be applied to massive datasets. Its implementation makes it hard to switch to 64-bit pointers, but we expect that doing so will increase its space consumption considerably. Although several practical dynamic succinct data structures (Prezza, 2017; Poyias et al., 2017, 2018) have been recently developed, modern dynamic keyword dictionaries are heavily based on pointers, consuming a large fraction of the entire space requirement. Nonetheless, there are some applications that need dynamic keyword dictionaries for massive datasets such as search engines (Brazil Inc., 2019; Busch et al., 2012), RDF stores (Mavlyutov et al., 2015), or Web crawler (Ueda et al., 2013). Consequently, realizing a practical space-efficient dynamic keyword dictionaries is an important open challenge.
1.1. Space-Efficient Dynamic Tries
Common keyword dictionary implementations represent the keywords in a trie, supporting the retrieval of keywords with trie navigation operations. In this subsection, we summarize space-efficient dynamic tries.
Theoretical Discussion
We consider a dynamic trie with nodes over an alphabet of size . Arroyuelo et al. (Arroyuelo et al., 2016) introduced succinct representations that require almost optimal bits of space, while supporting insertion and deletion of a leaf in amortized time if and in amortized time otherwise.222Throughout this paper, the base of the logarithm is 2, whenever not explicitly indicated. Jansson et al. (Jansson et al., 2015) presented a dynamic trie representation that uses bits of space, while supporting insertion and deletion of a leaf in expected amortized time.
Hash Tries
On the practical side, Poyias et al. (Poyias et al., 2018) proposed the m-Bonsai trie, a practical dynamic compact trie representation. It is a variant of the Bonsai trie (Darragh et al., 1993) that represents the trie nodes as entries in a compact hash table. It takes bits of space, while supporting update and some traversal operations in expected time. Fischer and Köppl (Fischer and Köppl, 2017) presented and evaluated a number of dynamic tries for LZ78 (Ziv and Lempel, 1978) and LZW (Welch, 1984) factorization. They also proposed an efficient hash-based trie representation in a similar way to m-Bonsai, which is referred to as FK-hash.333The representation is referred to as hash or cht in their paper (Fischer and Köppl, 2017). To avoid confusion, we name it FK-hash by using the initial letters of the proposers, Fischer and Köppl. Although FK-hash uses bits of space, its update algorithm is simple and practically fast. However, we are not aware of any space-efficient approach using them as keyword dictionaries.
Compacted Tries
Another line of research focuses on limiting the space of the trie in relation to the number of keywords. Suppose that we want to maintain a set of strings with a total length of on a machine, where characters fit into a single machine word . In this setting, Belazzougui et al. (Belazzougui et al., 2010) proposed the (dynamic) z-fast trie, which takes bits of space and supports retrieval, insertion and deletion of a string in expected time. Takagi et al. (Takagi et al., 2016) proposed the packed compact trie, which takes bits of space and supports the same operations in expected time. Recently, Tsuruta et al. (Tsuruta et al., 2020) developed a hybrid data structure of the z-fast trie and the packed compact trie, which also takes bits of space, but improves each of these operations to run in expected time.
1.2. Our Contribution
We propose a novel space-efficient dynamic keyword dictionary, called the dynamic path-decomposed trie (abbreviated as DynPDT). DynPDT is based on a trie formed by path decomposition (Ferragina et al., 2008). The path decomposition is a trie transformation technique, which was proposed for constructing cache-friendly trie dictionaries. It was up to now utilized only in static applications (Grossi and Ottaviano, 2014; Hsu and Ottaviano, 2013). Here, we adapt this technique for the dynamic construction of DynPDT, which gives DynPDT two main advantages over other known keyword dictionaries.
- (1)
The first is that the data structure is cache efficient because of the path decomposition. During the retrieval of a keyword, most parts of the keyword can be scanned in a cache-friendly manner without node-to-node traversals based on random accesses. 2. (2)
The second is that the path decomposition allows us to plug in any dynamic trie representation for the path-decomposed trie topology. For this job, we choose the hash-based trie representations m-Bonsai and FK-hash as these are fast and memory efficient in the setting when all trie nodes have to be represented explicitly (which is the case for the nodes of the path-decomposed trie).
Based on these advantages, DynPDT becomes a fast and space-efficient dynamic keyword dictionary.
From experiments using massive real-world datasets, we demonstrate that DynPDT is more space efficient compared to existing keyword dictionaries while achieving a relevant space-time tradeoff. For example, to construct a keyword dictionary from a large URI dataset of 13.8 GiB, DynPDT needs only 2.5 GiB of working space, while a HAT-trie and a Judy trie need 9.5 GiB and 7.8 GiB, respectively. The time performance is competitive in many cases thanks to the path decomposition. The source code of our implementation is available at https://github.com/kampersanda/poplar-trie.
1.3. Paper Structure
In Section 2, we introduce the keyword dictionary, and review the trie data structure and the path decomposition in our preliminaries. We introduce our new data structure DynPDT in Section 3. Subsequently, we present our DynPDT representations based on m-Bonsai and FK-hash in Sections 4 and 5, respectively. In Section 6, we provide our experimental results. Finally, we conclude the paper in Section 7.444A preliminary version of this work appeared in our conference paper (Kanda et al., 2017b) and the first author’s Ph.D. thesis (Kanda, 2018). This paper contains the significant differences as follows: (1) a fast variant of m-Bonsai was incorporated in Section 4.1; (2) an efficient implementation of the bijective hash function in m-Bonsai was incorporated in Section 4.2; (3) a growing algorithm of m-Bonsai was presented in Section 4.3; (4) FK-hash was also considered in addition to m-Bonsai in Section 5; (5) the experimental results in Section 6 and all descriptions were significantly enhanced.
2. Preliminaries
A string is a (finite) sequence of characters over a finite alphabet. Our strings always start at position 0. Given a string of length , denotes the substring for . Particularly, is a prefix of and is a suffix of . Let denote the length of . The same notation is also applied to arrays. The cardinality of a set is denoted by .
Our model of computation is the transdichotomous word RAM model of word size , where is the total length of all keywords of a given problem, i.e., the size of the problem. We can read and process bits in constant time.
2.1. Keyword Dictionary
A keyword is a string over an alphabet that is terminated with a special character \texttt{\}\not\in\mathcal{A}. A keyword dictionary is a dynamic associative array that maps a dynamic set of keywords to values , where belongs to a finite set . It supports the retrieval, the insertion, and the deletion of keywords while maintaining the key-value mapping. In detail, it supports the following operations:
- •
returns the value associated with the keyword if or otherwise.
- •
inserts the keyword in , i.e., , and associates the value with .
- •
removes the keyword from , i.e., .
2.2. Tries
A trie (Knuth, 1998; Fredkin, 1960) is a rooted labeled tree representing a set of keywords . Each edge in is labeled by a character. All outgoing edges of a node are labeled with a distinct character. The label of the edge between a node and its parent is called the branching character of . The parent and branching character unique determines . Each keyword is represented by exactly one path from the root to a leaf , i.e., the keyword can be extracted by concatenating the edge labels on the path from the root to . Since is prefix-free ($ is a unique delimiter of each keyword), there is a 1-to-1 correlation between leaves and keywords.
Given a keyword of length , retrieves by traversing nodes from the root to a leaf while matching the characters of with the edge labels of the traversed path. In representations storing all trie nodes explicitly, we visit nodes during this traversal. However, this traversal suffers poor locality of reference since it needs to access pointers usually addressing non-consecutive memory. In practice, this cache inefficiency is a critical bottleneck especially for long strings such as URLs. Grossi and Ottaviano (Grossi and Ottaviano, 2014) successfully solved this problem through path decomposition (Ferragina et al., 2008) in practice (but, in static settings).
2.3. Path Decomposition
The path decomposition (Ferragina et al., 2008) of a trie is a recursive procedure that first chooses an arbitrary root-to-leaf path in , then compactifies the path to a single node, and subsequently repeats the procedure in each subtrie hanging off the path . As a result, is partitioned into a set of node-to-leaf paths because there are leaves in . This decomposition produces the path-decomposed trie , which is composed of compactified nodes.
For explaining the properties of , we call the concatenation of the labels of all edges of a node-to-leaf path in the path string of . The path strings of the compactified paths of are the node labels of . In detail, each node in is associated with a node-to-leaf path of and is labeled by the path string of , denoted by . Each edge in is labeled by a pair consisting of a branching character and an integer, which are defined as follows (see also Figure 1): Take a node in and one of its children . Suppose that and are associated with the paths and in , respectively, such that and are the path labels of and . The edge has the label if, in , the first node on the path is the node
- •
whose branching character is , and
- •
whose parent is the -th node555Throughout this paper, we start counting from zero. visited on the path .
The edge labels of are characters drawn from the alphabet , where is the longest length of all node labels.
Example 2.1 (Path-Decomposed Trie).
Figure 2 illustrates a root-to-leaf path in and the corresponding root in after compactifying to . The root is labeled by the path string of , which is . The branching character of in is because in is the child of the third node on the path with branching character . Also for the subtries rooted at the nodes in , the decomposition is recursively applied to produce the children of the root in .
Given a keyword , the retrieval on can be simulated with a traversal of starting at its root: Let denote the currently visited node in . On visiting , we compare the path string with the characters of . If we find a mismatch at with , we descend to the child with branching character and drop the first characters of .
When storing the characters of each path string in consecutive memory locations, the number of random accesses involved in the retrieval on is bounded by , where is the height of . The following property regarding the height is satisfied by construction.
Property 1.
The height of cannot be larger than that of .
Centroid Path Decomposition
A way to improve this height bound in the static case is the centroid path decomposition (Ferragina et al., 2008). Given an inner node in , the heavy child of is the child whose subtrie has the most leaves (ties are broken arbitrarily). Given a node , the centroid path is the path from to a leaf obtained by descending only to heavy children. The centroid path decomposition yields the following property by always choosing centroid paths in the decomposition.
Property 2 ((Ferragina et al., 2008)).
Through the centroid path decomposition, the height of is bounded by .
Key-Value Mapping
We can implement the key-value mapping through because there is a 1-to-1 correlation between nodes in and keywords in . A simple approach is to store the associated values in an array such that stores the value associated with node . If we assign each of the nodes in a unique id from the range , then has no vacant entry (i.e. ). Another approach is to embed the value of at the end of , where the node corresponds to the keyword . This approach can be used without considering the assignment of node ids. In our experiments, we used the latter approach.
3. Dynamic Path-Decomposed Trie
Although the centroid path decomposition gives a logarithmic upper bound on the height of (cf. Section 2), it can be adapted only in static settings because we have to know the complete topology of a priori to determine the centroid paths. As a matter of fact, previous data structures embracing the path decomposition (Grossi and Ottaviano, 2014; Hsu and Ottaviano, 2013; Ferragina et al., 2008) consider only static applications.
In this section, we present the incremental path decomposition, which is a novel procedure to construct a dynamic path-decomposed trie, which we call DynPDT in the following. Our procedure incrementally chooses666We actually do not construct , but represent it with the DynPDT a node-to-leaf path in and directly updates the DynPDT on inserting a new keyword of . This incrementally chosen path is not a centroid path in general. Thus, the incremental path decomposition does not necessarily satisfy Property 2 but always satisfies Property 1.
In this section, we drop the technical detail of storing the values to ease the explanation of DynPDT, for which we omit the second argument in the insert operation of a new keyword .
3.1. Incremental Path Decomposition
In the following, we simulate a dynamic trie by DynPDT . Suppose that is non-empty. On inserting a new keyword into , we proceed as follows:
- (1)
First traverse from the root by matching characters of until reaching the deepest node whose string label is a prefix of . 2. (2)
Decompose into for and , which is possible since and K[|K|-1]=\texttt{\}$. 3. (3)
Finally, insert a new child of with branching character and append, from node , new nodes corresponding to the suffix .
In other words, the task of on is to create a new node-to-leaf path representing the suffix . We call that path the incremental path of the keyword . We simulate by creating a new node in whose label is the path label of this incremental path :
- •
If , create the root and associate the keyword with by .
- •
Otherwise (), retrieve the keyword from the root in three steps after setting variables and :
- (1)
Compare with . If , terminate because is already inserted; otherwise, proceed with Step 2. 2. (2)
Find such that and ( exists since and K[|K|-1]=\texttt{\}u({S[i],i})uSS[i+1,|S|)$; otherwise, proceed with Step 3. 3. (3)
Insert into by creating a new child of with branching character , and store the remaining suffix in by .
Example 3.1 (Construction).
Figure 3 illustrates the construction process of DynPDT when inserting the keywords K_{1}=\texttt{technology\}K_{2}=\texttt{technics$}K_{3}=\texttt{technique$}K_{4}=\texttt{technically$}iu_{i}\mathcal{T}^{c}_{\mathcal{S}}$.
- (a)
In the first insertion , we create the root and associate with , that is, becomes technology\mathcal{T}^{c}{\mathcal{S}}\mathcal{S}={K{1}}$ is shown in Figure 3a. 2. (b)
In the second insertion , we define a string variable initially set to . We try to retrieve in by comparing with , but fail as there is a mismatching character i at position 5 with and . Based on this mismatch result, we search the child of with branching character . However, since there is no such child, we add a new child to with branching character and associate the remaining suffix S[6,|S|)=\texttt{cs\}L_{u_{2}}\mathcal{T}^{c}{\mathcal{S}}\mathcal{S}={K{1},K_{2}}$ is shown in Figure 3b. 3. (c)
In the third insertion , we initially set the string variable to and then compare with in the same manner as the second insertion. Since and , we descend to child with branching character . After updating S\leftarrow S[6,|S|)=\texttt{que\}SL_{u_{2}}S[0]=\texttt{q}\neq\texttt{c}=L_{u}[0]({\texttt{q},0})u_{3}L_{u_{3}}S[1,|S|)=\texttt{ue$}\mathcal{T}^{c}{\mathcal{S}}\mathcal{S}={K{1},K_{2},K_{3}}$ is shown in Figure 3c. 4. (d)
The fourth insertion is also conducted in the same manner. The final trie is shown in Figure 3d.
3.2. Dictionary Operations
It is left to define the operations lookup and delete to make DynPDT a keyword dictionary. Similar to insert, the operation lookup can be performed by traversing from the root. After matching all the characters of , returns the value associated with the last visited node. It returns on a mismatch.
Example 3.2 (Retrieval).
We provide an example for a successful and an unsuccessful search. Both examples are similar to the construction described in Example 3.1.
- (1)
We consider \textsf{lookup}(\texttt{technically\})\mathcal{T}^{c}{\mathcal{S}}in Figure [3d](#S3.F3.sf4). We define a string variableSS\leftarrow\texttt{technically$}SL{u_{1}}S[0,5)=L_{u_{1}}[0,5)=\texttt{techn}S[5]=\texttt{i}\neq\texttt{o}=L_{u_{1}}[5]u_{2}({\texttt{i},5})SS\leftarrow S[6,|S|)=\texttt{cally$}u_{4}({\texttt{a},1})S[0,1)=L_{u_{1}}[0,1)=\texttt{c}S[1]=\texttt{a}\neq\texttt{s}=L_{u_{2}}[1]S\leftarrow S[2,|S|)=\texttt{lly$}SL_{u_{4}}u_{4}$. 2. (2)
We consider \textsf{lookup}(\texttt{technical\})\mathcal{T}^{c}{\mathcal{S}}in Figure [3d](#S3.F3.sf4). In the same manner as in the above case, we reach nodeu{4}S=\texttt{l$}L_{u_{4}}S[0,1)=L_{u_{4}}[0,1)=\texttt{c}S[1]=\texttt{$}\neq L_{u_{4}}[1]=\texttt{l}({\texttt{$},1})\textsf{lookup}(\texttt{technical$})\bot$.
The operation delete can be implemented by introducing deletion flags for each node (i.e., for each keyword), a trick that is also used in hashing with open addressing (Knuth, 1998, Chapter 6.4, Algorithm L). In other words, retrieves and sets the deletion flag for the node corresponding to . However, this approach additionally needs one bit for each node. Another approach is to set the value associated with the deleted keyword to as an invalid value. This approach does not need additional space for the deletion flags. Although these approaches do not free up space after deletion, the space is reused for keywords inserted subsequently if the new keywords share sufficiently long prefixes with the deleted ones.
3.3. Fixing the Alphabet
In practice, a critical problem of DynPDT is that the domain of the edge labels in and the longest length of all node labels are not constant in general. We tackle this problem by limiting the size of . To this end, we introduce a new parameter to forcibly fix the alphabet as in advance. Within this limitation, suppose that we want to create an edge labeled from node with . As this label is not in , we create dummy nodes called step nodes with a special character by repeating the following procedure until becomes less than : add a new child of with branching character and recursively set and . is the empty string if is a step node.
Example 3.3 (Step Node).
We consider \textsf{insert}(\texttt{technological\})\mathcal{T}^{c}{\mathcal{S}}in Figure [3d](#S3.F3.sf4) with\lambda=8S\leftarrow\texttt{technological$}SL{u_{1}}S[0,9)=L_{u_{1}}[0,9)=\texttt{technolog}S[9]=\texttt{i}\neq\texttt{y}=L_{u_{1}}[9]({\texttt{i},9})i\geq\lambdau_{5}\phii\leftarrow i-\lambda=1i\lambdau_{6}u_{5}({\texttt{i},1})S[10,|S|)=\texttt{cal$}L_{u_{6}}\mathcal{T}^{c}_{\mathcal{S}}$ is depicted in Figure 4.
This solution creates additional nodes depending on . When is too small, many step nodes are created and extra node traversals are involved. When is too large, the alphabet size becomes large and the space usage can increase significantly. Therefore, it is necessary to determine a suitable . In Section 6, we empirically determine 32 and 64 to be favorable values for .
3.4. Representation Scheme
To use standard trie techniques, we split up into two parts:
- (1)
a (standard) trie structure for a set of strings to represent with the difference that it assigns a node to a unique id instead of its node label, and 2. (2)
an associative array that maps the ids of the nodes of to their corresponding node labels, called node label map (NLM).
For example, in Figure 4, the trie built on the string set and the NLM stores node labels to be accessed by the respective node ids .
Node-Label-Map
NLM dynamically manages node labels depending on the node ids assigned. As explained in Section 1, we use the m-Bonsai (Poyias et al., 2018) and FK-hash (Fischer and Köppl, 2017) representations for . Moreover, we design the NLM data structures for m-Bonsai and FK-hash individually, which we respectively present in Sections 4 and 5.
Trie Representation
To discuss the representation approaches in the next sections, we define to be a dynamic trie with nodes whose edge labels are characters drawn from the alphabet of size . Although the number of nodes depends on , we write for simplicity. supports the following operations:
- •
adds a new child of with branching character and returns its id.
- •
returns the id of the child of with branching character if exists, or returns otherwise.
Motivation for m-Bonsai and FK-hash
We briefly review some common trie representations and point out their suitability for . The simplest representation is a list trie (Askitis, 2007, Chapter 2.3.2), which transforms an arbitrary trie to its first-child next-sibling representation. In this representation, each node of the list trie stores its branching character, a pointer to its first child, and a pointer to its next sibling. The list trie represents in bits and supports addchild and getchild in time; however, the operation time becomes problematic if is large. Another representation is a ternary search trie (TST) (Bentley and Sedgewick, 1997) that reduces the time complexity of the list trie to ; however, the space usage grows to bits. A well-known time- and space-efficient representation is the double array (Aoe, 1989). Its space usage is bits in the best case, while supporting getchild in time; however, a double array for a large alphabet tends to be sparse in practice. Actually, we are only aware of dynamic double-array implementations handling byte characters (e.g., (Yoshinaga and Kitsuregawa, 2014; Kanda et al., 2018)). Judy (Baskins, 2002) and ART (adaptive radix tree) (Leis et al., 2013) are trie representations that dynamically choose suitable data structures for the trie topology; however, both are also designed for byte characters. As each trie node is associated with an id, compact tries like the z-fast trie (Belazzougui et al., 2010) representing only nodes explicitly become inefficient with this requirement.
Compared to these trie representations, m-Bonsai and FK-hash have better complexities. m-Bonsai can represent in bits of expected space for a constant , while supporting getchild and addchild in expected time (Poyias et al., 2018). Compared to that, FK-hash needs additional bits of expected space, but supports faster insertions in practice.
A straightforward solution to provide the NLM for m-Bonsai and FK-hash is to store the node labels as satellite data in the respective hash table. However, by doing so, we would waste space for each unoccupied entry in the hash table. In the following, we present efficient solutions for the NLM tailored to m-Bonsai and FK-hash.
4. Representation Based on m-Bonsai
This section presents our approach based on m-Bonsai (Poyias et al., 2018). m-Bonsai represents trie nodes as entries in a closed hash table that, spoken informally, compactify the stored keys with compact hashing (Knuth, 1998).
Outline
We present a plain and a compact form of the representation based on m-Bonsai. We refer to the former as PBT (Plain m-Bonsai Trie), which is a non-compact variant of m-Bonsai. PBT can be useful for fast implementation although it has not been considered in any applications yet. We refer to the latter as CBT (Compact m-Bonsai Trie) as it uses the original m-Bonsai implementation. We describe PBT and CBT in Sections 4.1 and 4.2, respectively. In both variants, we maintain a hash table of size with the load factor to store nodes. In Section 4.3, we propose a linear-time growing algorithm based on the approach of Arroyuelo et al. (Arroyuelo et al., 2017). Finally, in Section 4.4, we propose NLM data structures designed for PBT and CBT.
4.1. Plain Trie Representation
PBT uses a hash function . Trie nodes are elements in the hash table. As their locations in the hash table are fixed unless the hash table is rebuilt, we use these locations as node ids. In other words, the id of a node located at is . is performed as follows. We first compose the hash key and then compute its initial address .777This paper defines as . Let be the first vacant address from determined by linear probing. We create the new child by . That is, the id of the new child becomes . getchild can be also computed in the same manner. If is fully independent and uniformly random, the operations can be performed in expected time. PBT uses bits of space.
Practical Implementation
The table size is a power of two in order to quickly compute the modulo operation of by using the bitwise AND operation (Migliore et al., 2019, Section 4.4). We set the maximum load factor to . If reaches during an update, we double the size of the hash table by the growing algorithm described in Section 4.3. We set the initial capacity of the hash table to . Our hash function is a XorShift hash function888http://xorshift.di.unimi.it/splitmix64.c. derived from (Steele Jr et al., 2014).
4.2. Compact Trie Representation
CBT reduces the space usage of PBT with the compact hashing technique (Knuth, 1998). Locating nodes on a compact hash table is identical to PBT with the difference that CBT uses a bijective transform that maps a key to its hash value and its quotient . Instead of , the compact hash table stores only its quotient in . The hash value can be restored from the initial address and the quotient , where is the first empty slot at or after the initial address . The original key can also be restored from the hash value since is bijective. Therefore, addchild and getchild can be performed in the same manner as PBT if the corresponding initial address can be identified from the location .
The remaining problem is how to identify the corresponding initial address from . Poyias et al. (Poyias et al., 2018) solved this problem by introducing a displacement array such that keeps the number of probes from to , that is, . Given a location , one can compute the corresponding initial address with . Although a value in is at most , the average value becomes small if is fully independent and uniformly random and the load factor is small. Poyias et al. (Poyias et al., 2018) demonstrated that can be represented in bits using CDRW (Compact Dynamic ReWritable) arrays. As takes bits for the quotients, CBT can represent in expected bits of space.
Practical Representation of the Displacement Array
The representation of with the CDRW array seems impractical. Poyias et al. (Poyias et al., 2018) gave an alternative practical representation, where is represented by three data structures , and as follows.
- (1)
is a simple array of length in which each element uses bits for a constant . 2. (2)
is a compact hash table (CHT) described by Cleary (Cleary, 1984), which stores keys from and values from for a constant . The keys are stored in a closed hash table of length through the compact hashing technique (Knuth, 1998), where is a power of two (a property that is in common with ). In detail, the hash table consists of
- •
a bijective transform ,
- •
an integer array of length to store the quotients of the keys (i.e., entry indices of ) representable in bits,
- •
an integer array of length to store displacement values of representable in bits, and
- •
two bit arrays each of length storing the displacement values of the quotients in (not to be confused with the displacement values stored in ).
On inserting a key , we store its quotient in the first vacant slot in starting at the initial address . The collisions in are therefore resolved with linear probing. However, this collision resolution poses the same problem as in CBT, as additional displacement information is required to restore the initial address of a stored quotient in . Cleary solves this problem by using two bit arrays (see (Cleary, 1984)). Finally, stores the value associated with the key whose quotient is stored in . Since uses bits of space, uses bits of space in total. 3. (3)
is a standard associative array that maps keys from to values from . In our implementation, is a closed hash table with linear probing. Given is the capacity of , takes bits.
The representation of the entry for an integer depends on its actual value:
- (1)
If , then we store in the bits of . 2. (2)
If , we represent by the key-value pair stored in . 3. (3)
Finally, if , we represent by the key-value pair stored in .
In the experiments, we set and . We set the initial capacities of and to and , respectively. We set the maximum load factor of and to 0.9. If the actual load factor of (resp. ) reaches the maximum load factor 0.9, we double the size of (resp. of ).
Design of the Bijective Transform
Since we assume that , , and are powers of two, the bijective transform is for some . We design this function as the concatenation of two bijective functions , where for an integer larger than and for a large prime smaller than . is based on the XorShift random number generators (Marsaglia, 2003), where the inverse function is given by . The inverse function of is given by , where is the multiplicative inverse of such that (see (Köppl et al., 2020) for details). By construction, the inverse function of is . Our hash function is inspired by the SplitMix algorithm (Steele Jr et al., 2014).
4.3. Linear-Time Growing Algorithm
If the load factor of hash table of length reaches the maximum load factor , we create a new hash table (and a new displacement array for CBT) of length and relocate all nodes to . Since a node depends on the position of its parent in , we can relocate a node only after having relocated all its ancestors. This can be done in a top-down traversal (e.g., in BFS or DFS order) of the tree during which all children of a node are successively selected. However, because selecting all children of a node is performed by checking getchild for all possible characters in , the relocation based on a top-down traversal needs expected time and is therefore only for tiny alphabets practical. Here we describe a bottom-up approach that is based on the approach by Arroyuelo et al. (Arroyuelo et al., 2017). This approach, called growing algorithm, runs in expected time. A pseudo code of it is shown in Algorithm 1.
Given a trie with a hash table of length , the algorithm constructs an equivalent trie with a hash table of length . To explain the algorithm, we define two operations returning the branching character of node and returning the parent id of node . They can be computed in constant time because explicitly stores the branching character and the parent id as the hash key in PBT. CBT can also restore the hash key from and .
In the growing algorithm, we initially define two auxiliary arrays Map and Done: Map is an integer array and Done is a bit array, each of length . We store in a 1 after relocating the node stored in . We keep the invariant that whenever , then stores the position in of the node stored in . All bits in Done are initialized by 0 except for the root. We scan from left to right and perform the following steps for each non-vacant slot . We first set to and to an empty string, and then climb up the path from the node to the root. We prematurely stop when encountering a node with . In this case, all ancestors of have already been relocated such that there is no need to visit them again. Subsequently, we walk down the computed path while relocating the visited nodes. Since we do not reprocess already visited nodes, we can perform the node relocation in expected time, with for a constant loaf factor .
Extra Working Space
Algorithm 1 maintains the auxiliary arrays Map of bits, Done of bits and of bits, where is the height of . Thus, the extra working space is bits if we create the auxiliary arrays naively. However, the working space of Map can be shared with because for is no longer needed. In PBT, the working space of Map can be fully placed in because the space of is bits and is at least in practice.999Even for , a simple bit array suffices. Based on this in-place approach, the extra working space of Algorithm 1 is only bits, taking account for Done and in PBT. In practice, the space of is negligible because is bounded by the maximum length of keywords in and .
In CBT, uses only bits. As in most scenarios, it is difficult to completely store Map in ; however, we can also use the space of , which is bits. If , Map can be fully placed in and ; otherwise, the extra working space of bits for Map is needed in addition to that of Done and .
4.4. NLM Data Structures
In m-Bonsai, the node ids are values drawn from the universe whose randomness depend on the used hash function. As the task of an NLM data structure is to map node ids to their respective node labels, an appropriate NLM data structure for m-Bonsai is a dynamic associative array that stores node label strings for arbitrary integer keys . In what follows, we first present a plain approach and then show how to compactify it.
Plain NLM
The simplest approach is to use a pointer array of length such that stores the pointer to or if no node with id exists. We refer to the approach as PLM (Plain Label Map). Figure 5a shows an example of PLM. Given a node of id , PLM can obtain through in time. However, takes bits, where the word size is . This space consumption is obviously large.
Sparse NLM
We present an alternative compact approach that reduces the pointer overhead of PLM in a manner similar to Google’s sparse hash table (Google Inc., 2005). In this approach, we divide the node labels into groups of labels over the ids. That is, the first group consists of , the second group consists of , and so on. Moreover, we introduce a bitmap such that iff exists. We concatenate all node labels with of the same group together, sorted in the id order. The length of becomes by maintaining, for each group, a pointer to its concatenated label string. We refer to the approach as SLM (Sparse Label Map).
With the array and the bitmap , we can access as follows: If , we are done since does not exist in this case; otherwise, we obtain the concatenated label string storing from , where . Given for the bit chunk , is the -th node label of the concatenated label string. As , counting the occurrences of 1s in chunk is supported in constant time using the popcount operation (González et al., 2005). It is left to explain how to search in the respective concatenated label string. For that we present two representations of the concatenated label strings:
- (1)
If the node labels are straightforwardly concatenated (e.g., the second group in Figure 5a is cal in ), we can sequentially count the (j-1)(j-1)L_{i}L_{i}\mathcal{O}(\ell\Lambda)\Lambda$ again denotes the maximum length of all node labels. 2. (2)
We can shorten the scan time with the skipping technique used in array hashing (Askitis and Zobel, 2005). This technique puts its length in front of each node label via some prefix encoding such as VByte (Williams and Zobel, 1999). Note that we can omit the terminators of each node label. The skipping technique allows us to jump ahead to the start of the next node label; therefore, the scan is supported in time. Figure 5b shows an example of SLM with the skipping technique.
Regarding the space usage of SLM, and use and bits, respectively. For , the total space usage becomes bits, which is smaller than bits in PLM; however, the access time is .
5. Representation Based on FK-hash
This section presents our DynPDT representation approaches based on FK-hash (Fischer and Köppl, 2017). The basic idea of FK-hash is the same as that of m-Bonsai. The difference is that FK-hash incrementally assigns node ids and explicitly stores them as values in the hash table, while m-Bonsai uses the locations of the stored elements of the hash table as node ids. Although FK-hash uses more space than m-Bonsai, the assignment of node ids simplifies the growing algorithm.
Outline
In the same manner as m-Bonsai, we consider a plain and a compact representation based on FK-hash. In Section 5.1 we present both representations. In Section 5.2 we propose NLM data structures designed for FK-hash.
5.1. Trie Representations
Like m-Bonsai, FK-hash locates nodes on a closed hash table of length , but does not use the addresses of as node ids. FK-hash incrementally assigns node ids from zero and explicitly stores them in an integer array of length . In other words, when creating the -th node by storing it in , its node id is , which is stored in . In a way similar to m-Bonsai, is performed as follows: We compose the key , hash it with , and then search the first vacant slot from by linear probing. Given is the currently largest node id, we assign the id to the new child, and set and . The displacement information is maintained analogously to m-Bonsai.
In the same manner as m-Bonsai, we can think of two representations depending on whether is compactified or not. The non-compact one is referred to as PFKT (Plain FK-hash Trie). The compact one is referred to as CFKT (Compact FK-hash Trie). Compared to PBT and CBT, PFKT and CFKT keep an additional integer array and require additional bits of space.
Table Growing
An advantage of FK-hash is that growing the hash table is done in the same manner as in standard closed hash tables. In detail, can be enlarged by scanning nodes on from left to right and relocating the nodes in a new hash table of length . The growing algorithm takes expected time. This time complexity is identical to that of Algorithm 1; however, the growing algorithm of FK-hash is faster in practice because of its simplicity. In addition, no auxiliary data structure is needed like Map and Done used by Algorithm 1.
5.2. NLM Data Structures
Like in Section 4.4, we introduce PLM and SLM adapted to FK-hash. Figure 6 shows an example for each of them. Although PLM in FK-hash is basically identical to that in m-Bonsai, SLM can be simplified as follows.
In m-Bonsai, it is necessary to identify whether exists and the rank of in the group because node ids are randomly assigned; therefore, we introduced a bitmap of length and utilized the popcount operation. In FK-hash, however, such a bitmap is not needed because node ids are incrementally assigned. Put simply, a node label is stored in the group of id and located at the -th position in the group. When using the skipping technique, care has to be taken for the step nodes whose node labels are empty. For each of them, we put the length 0 in its corresponding concatenated label string. For example, we put a ’0’ in the second concatenated label string for the step node in Figure 6b. Finally, we can insert a new node label by appending it to the last concatenated label string.
6. Experiments
In this section we evaluate the practical performance of DynPDT. The source code for our experiments are available at https://github.com/kampersanda/dictionary_bench.
6.1. Setup
We conducted all experiments on one core of a quad-core Intel Xeon CPU E5-2680 v2 clocked at 2.80 Ghz in a machine with 256 GB of RAM, running the 64-bit version of CentOS 6.10 based on Linux 2.6. We implemented our data structures in C++17. We compiled the source code with g++ (version 7.3.0) in optimization mode -O3. We used 4-byte integers for the values associated with the keywords.
Datasets
Our benchmarks are based on the following eight real-world datasets:
- •
GeoNames consists of 7 million different names for the geographic points provided by the GeoNames database.101010http://download.geonames.org/export/dump/ Managing such geographic identifiers within a limited resource is essential in modern geographic information systems as described in (Martínez-Prieto et al., 2016). We obtained the geographic names by extracting the asciiname column of the GeoNames dump in the same manner as (Martínez-Prieto et al., 2016).
- •
AOL consists of 10 million different search queries in the AOL database, which is a huge collection of 20 million search queries from 650,000 users sampled over three months.111111http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/ The dataset contains keywords written in natural English, which has been often used to benchmark search algorithms such as (Grossi and Ottaviano, 2014).
- •
Wiki consists of 14 million different page titles from the English Wikipedia dump at September 2018.121212https://dumps.wikimedia.org/enwiki/ As the dataset contains various special characters encoded in UTF-8, the alphabet size is larger than that of AOL. It is also a well-used dataset to benchmark search algorithms such as (Grossi and Ottaviano, 2014; Arz and Fischer, 2018; Kanda et al., 2017a).
- •
DNA consists of all 12-mers (i.e., substrings of length 12) found in the DNA dataset from the Pizza&Chili corpus.131313http://pizzachili.dcc.uchile.cl/texts/dna/ Among the used datasets, it has the smallest alphabet and the shortest keywords. The number of keywords is 15 million. In bioinformatics, popular alignment software need to manage such keywords within limited space as described in (Martínez-Prieto et al., 2016).
- •
LUBMS consists of 53 million different URIs extracted from the RDF dataset generated by the Lehigh University Benchmark (Guo et al., 2005) for 1,600 universities.141414The dataset is distributed under the name ‘DS5’ at https://exascale.info/projects/web-of-data-uri/. Modern RDF systems (Wylot et al., 2011; Wylot et al., 2014) encode URIs in a huge set into unique integers by using a dynamic keyword dictionary. The dataset is evaluated in (Mavlyutov et al., 2015) to analyze the performances of RDF systems.
- •
LUBML consists of 230 million different URIs extracted from the RDF dataset generated by the Lehigh University Benchmark (Guo et al., 2005) for 7,000 universities.151515Although this dataset is not distributed, one can obtain the identical dataset through the LUBM data generator (called UBA) at http://swat.cse.lehigh.edu/projects/lubm/. The dataset is a larger version of LUBMS. It is also evaluated in (Mavlyutov et al., 2015).
- •
UK consists of 40 million different URLs obtained from a 2005 crawl of the .uk domain performed by UbiCrawler (Boldi et al., 2004).161616http://law.di.unimi.it/webdata/uk-2005/ URLs are traditionally used to benchmark search algorithms for long strings such as (Grossi and Ottaviano, 2014; Arz and Fischer, 2018; Kanda et al., 2017a; Askitis and Sinha, 2010). Also, the modern Web crawler (Ueda et al., 2013) manages a huge set of URLs by using a dynamic keyword dictionary.
- •
WebBase consists of 118 million different URLs of a 2001 crawl performed by the WebBase crawler (Hirai et al., 2000).171717http://law.di.unimi.it/webdata/webbase-2001/ The dataset is larger than UK and also used in previous experiments of keyword dictionaries such as (Grossi and Ottaviano, 2014).
Table 1 summarizes relevant statistics for each dataset.
6.2. Average Height
We evaluate the average height of the DynPDT built on our datasets. The average height of is the arithmetic mean of the heights of all nodes over the number of nodes, omitting step nodes in the calculation. Although the average height is an important measure related to the average number of random accesses, we cannot a priori predict the average height of DynPDT because this number depends on the insertion order of the keywords. To reason about the quality of the average height, we study it in relation to the following known lower and upper bounds on it: The lower bound is the average height of the path-decomposed trie created by the centroid path decomposition (Alexandre, 2016, Corollary 3). The upper bound is the average height of the path-decomposed trie created by always choosing the child whose subtrie has the fewest number of leaves.
Table 2 shows the experimental results of the average heights of and for all the datasets. To analyze the performance of DynPDT in our experiments, we constructed DynPDT dictionaries by inserting keywords in random order. For that, we shuffled the dataset with the Fisher–Yates shuffle algorithm (Durstenfeld, 1964). Naturally, the actual average heights of are between their lower and upper bounds, and those of are the same as AveLen. The upper bounds are more than twice as large as the lower bounds for AOL, UK, and WebBase; however, the upper bounds were up to 5.4x smaller than the average heights of due to the path decomposition, especially for long keywords such as URIs. Therefore, the incremental path decomposition can make dynamic keyword dictionaries more cache-friendly, especially for long keywords even if the insertion order is inconvenient and the average height is close to the upper bound.
6.3. Parameter for Step Nodes
The parameter influences the number of step nodes. We analyze the space and time performance of DynPDT when varying the parameter . In this experiment, we constructed DynPDT dictionaries for each parameter on the datasets Wiki, LUBMS and UK, and observed the working space and the construction time. For the DynPDT representation, we tested the combination of CFKT and SLM with , referred to as PDT-CFK in the following. As described in Section 6.2, the dictionary was constructed by inserting keywords in random order. The working space was measured by checking the maximum resident set size (RSS) required during the online construction.
Table 3 shows the experimental results for construction. Since has a direct impact on , which influences the space usage of , the working space depends on the value of . Although this dependency looks like and the taken space are in direct correlation, for Wiki and UK, the working spaces for (i.e., 0.36 GiB and 1.22 GiB respectively) were not the smallest. For Wiki, the reason for this is that many step nodes raised the load factor and involved an additional enlargement of the hash table. Specifically, the enlargements were conducted nine times with , although they were conducted eight times with . For UK, this reason is that the high load factor caused by a huge number of step nodes raised the average displacement value stored in and involved the use of and , although no additional enlargement was conducted. Regarding the time performance, this huge number of step nodes slowed down the construction. Therefore, a too small parameter can involve large space requirements and long construction times. On the other hand, when , the working space and construction time do not significantly vary.
From this observation, we derive two facts for : On the one hand, the most important recommendation is not to choose a parameter that is too small. On the other hand, choosing a large parameter is not a significant problem because the space and time performance do not significantly decrease as grows. For example, when on Wiki, the proportion of step nodes is 0.12%; however, even with a larger parameter such as 512 or 1024, the working space and construction time are almost the same. Table 4 shows Steps for each parameter and the average length of the node labels (denoted by AveNLL) for all the datasets. Even for long keywords like URLs (i.e., UK), AveNLL is bounded by 18.0 and Steps is within 1% of all nodes when . Among the tested values for , we suggest setting to 32 or 64 for keywords whose length is not much longer than that of the URL datasets.
6.4. Comparison among DynPDT Representations
We compared the performance of our DynPDT representations, for which we benchmarked the following six combinations:
- •
PDT-PB is the combination of PBT and PLM,
- •
PDT-SB is the combination of PBT and SLM,
- •
PDT-CB is the combination of CBT and SLM,
- •
PDT-PFK is the combination of PFKT and PLM,
- •
PDT-SFK is the combination of PFKT and SLM, and
- •
PDT-CFK is the combination of CFKT and SLM.
We evaluated the working space during the construction and the running times of insert and lookup. Like in Section 6.3, we constructed each dictionary and measured its working space. To measure the lookup time, we chose 1 million random keywords from each dataset. The running times are the average of 10 runs. For SLM, we tested . For , we chose the smallest value among those from Table 4 where Steps is less than 1%.
Figure 7 shows the experimental results for GeoNames and WebBase. Regarding the representations using SLM, the working space is the largest but the running times are the shortest with , and vice versa with . In other words, for each representation in the plots, the rightmost and lowest result is the one with , and the leftmost and highest result is the one with .
We observe that
- •
SLM significantly reduces the working space of PLM. Compared to PDT-PB, PDT-SB is 57–65% smaller for GeoNames and 46–56% smaller for WebBase. Compared to PDT-PFK, PDT-SFK is 56–61% smaller for GeoNames and 47–52% smaller for WebBase.
- •
Regarding the representations based on m-Bonsai, the insert time of SLM is slower than that of PLM because inserting a new node label into the group is costly. When , the insertion of PDT-SB is 29-163% slower than that of PDT-PB; however, the lookup times are competitive.
- •
Regarding the representations based on FK-hash, SLM with is competitive to PLM with respect to the insert time because the update algorithm is simple. Also, the lookup times are competitive.
- •
The time performance of SLM with large group sizes ( or ) is worse than that of SLM with small group sizes ( or ). For example, for GeoNames, PDT-SB with is 19% smaller but 81–105% slower than PDT-SB with .
- •
The compact trie representations CBT and CFKT are more lightweight but slower than the plain representations PBT and PFKT; however, the differences are small. For example, PDT-SB is 12% smaller but 8–11% slower than PDT-CB for GeoNames.
- •
The representations based on m-Bonsai are smaller than those based on FK-hash. Also regarding the lookup time, the m-Bonsai representations are faster. However, regarding the insert time, the FK-hash representations are faster because the growing algorithm is simple.
6.5. Comparison with Existing Data Structures
We compare the performance of DynPDT with existing data structures. We exhaustively tested existing implementations of dynamic keyword dictionaries such as open-source dynamic hash containers (Gregory, 2016; Tessil, 2017c, 2016) and recent dynamic trie indexes (Tsuruta et al., 2020; Takagi et al., 2016). However, compared to DynPDT, most of them consumed significantly more space. For our benchmarks, we selected the following four space-efficient implementations:181818All the experimental results are shown in Appendix A.
- •
ArrayHash is a cache-conscious hash table with string keys (Askitis and Zobel, 2005).
- •
HAT is a hybrid data structure of the burst trie (Heinz et al., 2002) and ArrayHash (Askitis and Sinha, 2010).
- •
Judy is a trie-based dictionary implementation developed at Hewlett-Packard Research Labs (Baskins, 2002).
- •
Cedar developed by Yoshinaga (Yoshinaga and Kitsuregawa, 2014) is an efficient dictionary implementation based on dynamic double-array tries (Aoe, 1989).
For ArrayHash and HAT, we used Tessil’s implementations (Tessil, 2017a, b). From the three implementation variations of Cedar, we took one based on a reduced trie (Yoshinaga and Kitsuregawa, 2014) and one based on prefix trie (Aoe, 1989), and denote them by Cedar-R and Cedar-P, respectively. Cedar-R is suitable for short keywords191919We cannot be more concrete here since the efficiency of the heuristics of these data structures do not merely depend on the keyword lengths., whereas Cedar-P is suitable for the general case.
We evaluated the working space and the running times in the same manner as Section 6.4. Figure 8 shows the experimental results for the four datasets GeoNames, AOL, Wiki, and DNA consisting of short keywords. Figure 9 shows the experimental results for the four datasets LUBMS, LUBML, UK, and WebBase consisting of long keywords. For our methods, we only plot the results of PDT-SB, PDT-CB, PDT-SFK and PDT-CFK, setting to 8, 16, or 32. To keep focus on the competitive contestants in the plots, we omitted some weaker instances, namely the DynPDT dictionaries with and the dictionaries with PLM. The former are too slow, while the latter take too much working space. Only for DNA, we plotted the results of Cedar-R instead of Cedar-P because Cedar-R is superior on that instance. For LUBML and WebBase,we were not able to run our experiments with Cedar because the resulting number of trie nodes becomes too large to be representable in Cedar based on 32-bit pointers. For the long keywords (Figure 9), we omitted the results of ArrayHash because its working space is too large. For example, ArrayHash is 143% larger than HAT for LUBMS.
Based on Figure 8 showing the evaluation for short keywords, we can state the following observations:
- •
The DynPDT dictionaries are the smallest. PDT-CB for is 25–48% smaller than the existing smallest data structures (Cedar-R for DNA and HAT for the others). PDT-CFK with is 29–39% smaller than HAT for the datasets except DNA.
- •
Regarding the insert time, HAT is the fastest. Except for DNA, the DynPDT dictionaries based on FK-hash, PDT-SFK and PDT-CFK, are competitive to the other data structures.
- •
Regarding the lookup time, ArrayHash is the fastest. Except for DNA, the DynPDT dictionaries based on m-Bonsai, PDT-SB and PDT-CB, are competitive to Judy.
- •
For DNA consisting of short keywords, the DynPDT dictionaries are not efficient because the merits of the path decomposition applied to a trie with only short paths become negligible to the additional burden of representing the trie with two separate data structures, one for its path-decomposed trie topology and one for its node labels.
Based on Figure 9 showing the evaluation for long keywords, we can state the following observations:
- •
The DynPDT dictionaries are the smallest for all the datasets. When , PDT-CB is 49–60% smaller than Cedar-P for LUBMS and UK, and is 64–68% smaller than Judy for LUBML and WebBase. When , PDT-CFK is 42–49% smaller than Cedar-P for LUBMS and UK, and is 58–59% smaller than Judy for LUBML and WebBase.
- •
Regarding the insert time, PDT-SFK is competitive to the other data structures.
- •
Regarding the lookup time, HAT is the fastest although its working space is large. Compared to PDT-SB with , HAT is 40–78% faster but 48–61% larger.
- •
In many cases, the DynPDT dictionaries outperform Judy and Cedar-P. For example, PDT-SFK with is 48% smaller and 4–25% faster than Judy for WebBase. PDT-CB with is 48% smaller and 15–35% faster than Cedar-P for LUBMS.
Summary
Throughout all dataset instances, DynPDT is the smallest data structure. Especially for long keywords such as URIs, our dictionaries are space-efficient and fast thanks to the path decomposition; however, they are not efficient for extremely short keywords because the path decomposition does not work well on such instances. In summary, DynPDT is useful for in-memory applications handling massive datasets consisting of long keywords.
For example, the RDF database system Diplodocus (Wylot et al., 2011; Wylot et al., 2014) encodes every URI as an integer number through a dynamic keyword dictionary because the fixed-size integers can be handled more efficiently than the original strings having variable lengths. Since the encoding time is a significant part of the query execution time on the Diplodocus system, Mavlyutov et al. (Mavlyutov et al., 2015) experimentally compared a series of dynamic keyword dictionaries. Actually, LUBMS and LUBML of our datasets are exactly those evaluated in (Mavlyutov et al., 2015). They concluded that HAT is a good data structure taking aspects like working space and time performance into account.202020Judy and Cedar were not evaluated in (Mavlyutov et al., 2015). However, as demonstrated in our experiments, our DynPDT dictionaries can maintain the URI datasets in space up to 74% smaller than HAT, while keeping competitive insertion times. Although DynPDT’s slow lookup time is a drawback compared to HAT, maintaining massive RDF database systems in main-memory is essential, and we believe that DynPDT’s high memory efficiency will contribute to the future of Semantic Web applications.
7. Conclusion
We presented a novel data structure for dynamic keyword dictionaries — called DynPDT — which is applicable to scalable string data processing. For that, we applied path decomposition and utilized the recent hash-based trie representations m-Bonsai and FK-hash. We demonstrated with experiments on real-world massive datasets that the memory footprint of DynPDT is the smallest within a careful selection of efficient dynamic keyword dictionaries. It is especially efficient for long keywords due to the path decomposition approach.
Our results pave new ways for major improvements in various existing systems because the dynamic keyword dictionary problem is a common task in applications such as vocabulary accumulation for inverted-index construction (Heinz et al., 2002), RDF database systems (Wylot et al., 2011; Wylot et al., 2014), in-memory OLTP (online transaction processing) database systems (Leis et al., 2013), Web crawlers (Ueda et al., 2013), and search engines (Brazil Inc., 2019; Busch et al., 2012). DynPDT can contribute to those systems especially by reducing their memory requirements. Although we have put the focus on the keyword dictionary problem in this paper, DynPDT as a general data structure is of independent interest, being useful for applications handling dynamic tries. An interesting application is the LZD compression (Goto et al., 2015; Badkobeh et al., 2017), a variation of the LZ78 compression (Ziv and Lempel, 1978). Since the LZD algorithm maintains long factors (or strings) in a dynamic trie, we are confident that the incremental path decomposition on such a trie will have performance benefits.
Our future plans for DynPDT are as follows.
- •
The burst trie developed by Heinz et al. (Heinz et al., 2002) maintains sparse subtries in a trie in dynamic containers of strings by collapsing the subtries. DynPDT would be suited as an alternative container representation to enhance the memory efficiency of the burst trie.
- •
In our experiments, we implemented the second data structure of the displacement array through the CHT by Cleary (Cleary, 1984), following the original m-Bonsai approach (Poyias et al., 2018). Recently, Köppl et al. (Köppl et al., 2020) developed space-efficient hash tables with separate chaining and compact hashing. Although the CHT needs additional displacement information (i.e., two bit arrays), his hash tables do not need such additional information. We expect that his hash tables are suitable representations of .
Acknowledgements.
We thank Kazuya Tsuruta for kindly providing us the implementations used in (Tsuruta et al., 2020). We thank the anonymous reviewers for their helpful comments. A part of this work was supported by JSPS KAKENHI Grant Numbers 17J07555 and JP18F18120.
Appendix A Experimental Results
Within the same setting as in Section 6.5, we present an extended evaluation including the following contestants:
- •
STLHash is the hash table std::unordered_map of the C++ standard library.
- •
GoogleDense is the hash table implementation google::dense_hash_map of Google (Google Inc., 2005).
- •
Sparsepp is Gregory Popovitch’s space-efficient hash container implementation derived from Google’s sparse hash table (Gregory, 2016).
- •
Hopscotch is Tessil’s hash table implementation using hopscotch hashing (Tessil, 2016).
- •
Robin is Tessil’s hash table implementation using robin hood hashing (Tessil, 2017c).
- •
ART is Armon Dadgar’s implementation (Dadgar, 2012) of the adaptive radix tree (Leis et al., 2013).
Further, we include the following implementations, which are also used and studied in the experimental section of (Tsuruta et al., 2019):
- •
PCT-Bit is a packed compact trie using bit parallelism (Takagi et al., 2016).
- •
PCT-Hash is a packed compact trie using additionally STLHash as a dictionary in each micro trie (Takagi et al., 2016).
- •
ZFT is Tsuruta’s C++ implementation of the z-fast trie (Belazzougui et al., 2010).
- •
CTrie++ is a trie (Tsuruta et al., 2019) combining aspects of the z-fast trie with the packed compact trie.
Table 5 shows the results for the datasets consisting of short keywords (i.e., GeoNames, AOL, Wiki and DNA). Table 6 shows the results for the datasets consisting of long keywords (i.e., LUBMS, LUBML, UK and WebBase). In these tables, Space is the working space in GiB, Insert is the average insertion time in microseconds, and Lookup is the average lookup time in microseconds. For SLM of DynPDT, the results with are shown. Concerning PCT-Bit, PCT-Hash, ZFT and CTrie++, we could not obtain some results for large datasets because the resulting trie was too large to fit into RAM.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Alexandre (2016) Daigle Alexandre. 2016. Optimal path-decomposition of tries . Ph.D. Dissertation. University of Waterloo.
- 3Aoe (1989) Jun’ichi Aoe. 1989. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering 15, 9 (1989), 1066–1077. https://doi.org/10.1109/32.31365 · doi ↗
- 4Arroyuelo et al . (2017) Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Rajeev Raman. 2017. LZ 78 compression in low main memory space. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE) . 38–50. https://doi.org/10.1007/978-3-319-67428-5_4 · doi ↗
- 5Arroyuelo et al . (2016) Diego Arroyuelo, Pooya Davoodi, and Srinivasa Rao Satti. 2016. Succinct dynamic cardinal trees. Algorithmica 74, 2 (2016), 742–777. https://doi.org/10.1007/s 00453-015-9969-x · doi ↗
- 6Arz and Fischer (2018) Julian Arz and Johannes Fischer. 2018. Lempel–Ziv-78 compressed string dictionaries. Algorithmica 80, 7 (2018), 2012–2047. https://doi.org/10.1007/s 00453-017-0348-7 · doi ↗
- 7Askitis (2007) Nikolas Askitis. 2007. Efficient data structures for cache architectures . Ph.D. Dissertation. RMIT University.
- 8Askitis and Sinha (2010) Nikolas Askitis and Ranjan Sinha. 2010. Engineering scalable, cache and space efficient tries for strings. The VLDB Journal 19, 5 (2010), 633–660. https://doi.org/10.1007/s 00778-010-0183-9 · doi ↗
