Dynamic Path-Decomposed Tries

Shunsuke Kanda; Dominik K\"oppl; Yasuo Tabei; Kazuhiro Morita and; Masao Fuketa

arXiv:1906.06015·cs.DS·July 23, 2020

Dynamic Path-Decomposed Tries

Shunsuke Kanda, Dominik K\"oppl, Yasuo Tabei, Kazuhiro Morita and, Masao Fuketa

PDF

Open Access 2 Repos

TL;DR

This paper introduces a dynamic, space-efficient keyword dictionary using path decomposition and compact hash tries, significantly reducing memory usage while maintaining performance.

Contribution

It presents a novel dynamic keyword dictionary based on path decomposition and compact hash tries, addressing the challenge of space efficiency in dynamic settings.

Findings

01

Requires up to 68% less space than existing solutions

02

Achieves a favorable space-time tradeoff

03

Effective on real-world datasets

Abstract

A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword…

Tables14

Table 1. Table 1. Statistics for the datasets used. Size is the total length of the keywords, n 𝑛 n is the number of all distinct keywords in millions (M), MinLen (resp. MaxLen and AveLen ) is the maximum (resp. minimum and average) length of the keywords, and | 𝒜 | 𝒜 |\mathcal{A}| is the actual alphabet size of the keywords.

	Size	$n$	MinLen	MaxLen	AveLen	$\| 𝒜 \|$
GeoNames	109 MiB	7.3 M	2	152	15.7	99
AOL	224 MiB	10.2 M	2	523	23.2	85
Wiki	286 MiB	14.1 M	2	252	21.2	200
DNA	189 MiB	15.3 M	13	13	13.0	16
LUBMS	3.1 GiB	52.6 M	10	80	63.7	57
LUBML	13.8 GiB	230.1 M	10	80	64.2	57
UK	2.7 GiB	39.5 M	17	2,030	72.4	103
WebBase	6.6 GiB	118.2 M	10	10212	60.2	223

Table 2. Table 2. Experimental results of the average heights of 𝒯 𝒮 c subscript superscript 𝒯 𝑐 𝒮 \mathcal{T}^{c}_{\mathcal{S}} and 𝒯 𝒮 subscript 𝒯 𝒮 \mathcal{T}_{\mathcal{S}} denoted by AveHeight . Also, AveHeightLB and AveHeightUB are the lower bound and the upper bound of the average height of 𝒯 𝒮 c subscript superscript 𝒯 𝑐 𝒮 \mathcal{T}^{c}_{\mathcal{S}} , respectively (defined in Section 6.2 ). AveHeightLB is the average height of the path-decomposed trie obtained by the centroid path decomposition. AveHeightUB is the average height of the path-decomposed trie obtained by the path decomposition selecting children with the fewest leaves.

	GeoNames	AOL	Wiki	DNA	LUBMS	LUBML	UK	WebBase
AveHeight of $𝒯_{𝒮}^{c}$	6.0	6.2	6.3	9.0	7.5	7.9	7.8	7.3
AveHeightLB of $𝒯_{𝒮}^{c}$	5.2	5.2	5.3	8.9	6.6	7.4	6.0	6.2
AveHeightUB of $𝒯_{𝒮}^{c}$	8.5	10.5	9.7	10.7	11.8	12.4	14.7	15.4
AveHeight of $𝒯_{𝒮}$	15.7	23.2	21.2	13.0	63.7	64.2	72.4	60.2

Table 3. Table 3. Experimental results of PDT-CFK for various values of the parameter λ 𝜆 \lambda . Steps is the proportion of the number of step nodes among all nodes in DynPDT, Space is the working space in GiB, and Time is the elapsed time for the construction in seconds.

	Wiki			LUBMS			UK
$λ$	Steps	Space	Time	Steps	Space	Time	Steps	Space	Time
4	19.55%	0.36	20.9	7.75%	0.78	116.1	37.60%	1.22	116.5
8	6.12%	0.27	18.2	2.83%	0.78	96.8	15.51%	1.20	96.2
16	1.32%	0.27	17.1	0.31%	0.78	84.9	5.22%	1.20	87.6
32	0.12%	0.27	17.3	0.02%	0.79	83.3	1.31%	1.20	86.0
64	0.00%	0.27	17.2	0.00%	0.80	82.9	0.23%	1.21	85.4
128	0.00%	0.27	17.4	0.00%	0.80	82.9	0.04%	1.22	85.3
256	0.00%	0.28	17.2	0.00%	0.81	83.6	0.01%	1.22	85.2
512	0.00%	0.28	17.2	0.00%	0.82	83.3	0.00%	1.23	85.4
1024	0.00%	0.28	17.3	0.00%	0.83	83.1	0.00%	1.24	85.4

Table 4. Table 4. Proportion of the number of step nodes to the total number of nodes in DynPDT. Bold font indicates the results with the smallest λ 𝜆 \lambda such that Steps is less than 1%. AveNLL is the average length of the node labels.

	$λ = 4$	$λ = 8$	$λ = 16$	$λ = 32$	$λ = 64$	$λ = 128$	AveNLL
	Steps
GeoNames	6.34%	1.44%	0.28%	0.04%	0.00%	0.00%	6.1
AOL	22.16%	6.83%	1.26%	0.11%	0.01%	0.00%	10.6
Wiki	19.55%	6.12%	1.32%	0.12%	0.00%	0.00%	8.7
DNA	0.11%	0.00%	0.00%	0.00%	0.00%	0.00%	1.4
LUBMS	7.75%	2.83%	0.31%	0.02%	0.00%	0.00%	3.7
LUBML	7.66%	2.80%	0.31%	0.02%	0.00%	0.00%	3.7
UK	37.60%	15.51%	5.22%	1.31%	0.23%	0.04%	18.0
WebBase	24.15%	8.94%	2.46%	0.50%	0.08%	0.02%	11.1

Table 5. Table 5. Experimental results for short keywords.

	Space	Insert	Lookup
PDT-PB	0.32	0.59	0.53
PDT-SB	0.12	1.20	0.75
PDT-CB	0.11	1.30	0.80
PDT-PFK	0.33	0.63	0.71
PDT-SFK	0.14	0.74	0.90
PDT-CFK	0.12	0.91	0.99
STLHash	0.58	0.44	0.24
GoogleDense	0.73	0.37	0.12
Sparsepp	0.42	0.58	0.15
Hopscotch	0.84	0.40	0.10
Robin	0.84	0.31	0.10
ArrayHash	0.30	0.56	0.12
HAT	0.18	0.47	0.22
Judy	0.32	0.76	0.57
ART	0.59	0.85	0.56
Cedar-R	0.47	0.68	0.39
Cedar-P	0.26	0.71	0.34
PCT-Bit	2.96	9.12	10.43
PCT-Hash	5.13	7.74	5.49
ZFT	1.13	3.25	2.42
CTrie++	1.34	2.58	0.79

Table 6. (a) GeoNames

	Space	Insert	Lookup
PDT-PB	0.32	0.59	0.53
PDT-SB	0.12	1.20	0.75
PDT-CB	0.11	1.30	0.80
PDT-PFK	0.33	0.63	0.71
PDT-SFK	0.14	0.74	0.90
PDT-CFK	0.12	0.91	0.99
STLHash	0.58	0.44	0.24
GoogleDense	0.73	0.37	0.12
Sparsepp	0.42	0.58	0.15
Hopscotch	0.84	0.40	0.10
Robin	0.84	0.31	0.10
ArrayHash	0.30	0.56	0.12
HAT	0.18	0.47	0.22
Judy	0.32	0.76	0.57
ART	0.59	0.85	0.56
Cedar-R	0.47	0.68	0.39
Cedar-P	0.26	0.71	0.34
PCT-Bit	2.96	9.12	10.43
PCT-Hash	5.13	7.74	5.49
ZFT	1.13	3.25	2.42
CTrie++	1.34	2.58	0.79

Table 7. (b) AOL

	Space	Insert	Lookup
PDT-PB	0.52	1.01	0.65
PDT-SB	0.25	1.78	0.86
PDT-CB	0.21	1.87	0.90
PDT-PFK	0.52	0.80	0.80
PDT-SFK	0.27	0.94	1.13
PDT-CFK	0.23	1.16	1.18
STLHash	1.01	0.52	0.26
GoogleDense	1.72	0.67	0.15
Sparsepp	0.77	0.76	0.18
Hopscotch	1.04	0.51	0.12
Robin	1.79	0.47	0.12
ArrayHash	0.70	0.83	0.13
HAT	0.32	0.59	0.27
Judy	0.51	0.97	0.76
ART	0.91	0.96	0.77
Cedar-R	1.07	0.87	0.61
Cedar-P	0.49	0.80	0.57
PCT-Bit	4.07	12.42	14.34
PCT-Hash	7.66	10.26	6.99
ZFT	1.82	3.84	2.57
CTrie++	2.12	3.01	1.10

Table 8. (c) Wiki

	Space	Insert	Lookup
PDT-PB	0.64	0.98	0.68
PDT-SB	0.28	1.60	0.96
PDT-CB	0.24	1.71	1.02
PDT-PFK	0.67	0.79	0.86
PDT-SFK	0.31	0.96	1.15
PDT-CFK	0.27	1.14	1.22
STLHash	1.29	0.50	0.27
GoogleDense	1.64	0.54	0.14
Sparsepp	0.97	0.69	0.18
Hopscotch	1.08	0.42	0.13
Robin	1.83	0.41	0.12
ArrayHash	0.69	0.73	0.14
HAT	0.43	0.60	0.27
Judy	0.66	0.92	0.74
ART	1.23	1.00	0.73
Cedar-R	1.19	0.89	0.59
Cedar-P	0.63	0.89	0.61
PCT-Bit	5.67	11.85	13.79
PCT-Hash	10.11	9.48	7.09
ZFT	2.24	3.56	2.64
CTrie++	2.92	3.01	1.09

Table 9. (d) DNA

	Space	Insert	Lookup
PDT-PB	0.84	1.18	0.65
PDT-SB	0.29	1.80	0.87
PDT-CB	0.20	2.00	0.91
PDT-PFK	0.80	0.85	0.88
PDT-SFK	0.33	1.02	1.11
PDT-CFK	0.25	1.34	1.20
STLHash	1.02	0.91	0.34
GoogleDense	1.25	0.24	0.09
Sparsepp	0.67	0.50	0.13
Hopscotch	1.50	0.27	0.07
Robin	1.50	0.26	0.08
ArrayHash	0.54	0.47	0.13
HAT	0.32	0.39	0.24
Judy	0.34	0.95	0.61
ART	1.01	0.65	0.63
Cedar-R	0.24	0.70	0.21
Cedar-P	0.31	0.67	0.24
PCT-Bit	7.05	8.44	9.91
PCT-Hash	8.45	5.44	6.86
ZFT	2.57	3.30	2.88
CTrie++	2.50	2.55	0.78

Table 10. Table 6. Experimental results for long keywords.

	Space	Insert	Lookup
PDT-PB	2.37	1.62	1.10
PDT-SB	0.83	1.93	1.19
PDT-CB	0.66	2.04	1.22
PDT-PFK	2.46	1.09	1.14
PDT-SFK	0.95	1.27	1.42
PDT-CFK	0.78	1.52	1.52
STLHash	7.47	0.61	0.51
GoogleDense	9.93	0.89	0.28
Sparsepp	6.22	0.83	0.39
Hopscotch	6.87	0.70	0.27
Robin	9.87	0.61	0.26
ArrayHash	5.44	0.98	0.30
HAT	2.23	1.43	0.57
Judy	1.66	1.45	1.26
ART	5.83	0.91	0.77
Cedar-R	1.97	1.79	1.66
Cedar-P	1.46	2.10	1.71
PCT-Bit	n/a	n/a	n/a
PCT-Hash	n/a	n/a	n/a
ZFT	9.27	6.33	5.65
CTrie++	8.13	4.25	2.43

Table 11. (a) LUBMS

	Space	Insert	Lookup
PDT-PB	2.37	1.62	1.10
PDT-SB	0.83	1.93	1.19
PDT-CB	0.66	2.04	1.22
PDT-PFK	2.46	1.09	1.14
PDT-SFK	0.95	1.27	1.42
PDT-CFK	0.78	1.52	1.52
STLHash	7.47	0.61	0.51
GoogleDense	9.93	0.89	0.28
Sparsepp	6.22	0.83	0.39
Hopscotch	6.87	0.70	0.27
Robin	9.87	0.61	0.26
ArrayHash	5.44	0.98	0.30
HAT	2.23	1.43	0.57
Judy	1.66	1.45	1.26
ART	5.83	0.91	0.77
Cedar-R	1.97	1.79	1.66
Cedar-P	1.46	2.10	1.71
PCT-Bit	n/a	n/a	n/a
PCT-Hash	n/a	n/a	n/a
ZFT	9.27	6.33	5.65
CTrie++	8.13	4.25	2.43

Table 12. (b) LUBML

	Space	Insert	Lookup
PDT-PB	10.1	1.43	1.00
PDT-SB	3.6	2.20	1.49
PDT-CB	2.8	2.39	1.53
PDT-PFK	10.7	1.32	1.39
PDT-SFK	4.1	1.63	1.79
PDT-CFK	3.4	1.91	1.90
STLHash	32.9	0.67	0.59
GoogleDense	40.1	0.99	0.32
Sparsepp	27.3	0.89	0.44
Hopscotch	41.2	1.01	0.27
Robin	41.2	0.69	0.28
ArrayHash	21.9	1.06	0.32
HAT	9.5	1.40	0.78
Judy	7.8	1.52	1.29
ART	25.8	1.05	0.93
Cedar-R	n/a	n/a	n/a
Cedar-P	n/a	n/a	n/a
PCT-Bit	n/a	n/a	n/a
PCT-Hash	n/a	n/a	n/a
ZFT	n/a	n/a	n/a
CTrie++	n/a	n/a	n/a

Table 13. (c) UK

	Space	Insert	Lookup
PDT-PB	2.32	1.45	0.94
PDT-SB	1.26	2.76	1.44
PDT-CB	1.09	2.87	1.44
PDT-PFK	2.32	1.27	1.24
PDT-SFK	1.38	1.74	1.93
PDT-CFK	1.21	2.04	2.02
STLHash	6.05	0.67	0.50
GoogleDense	10.50	1.09	0.27
Sparsepp	5.06	0.96	0.37
Hopscotch	6.23	0.75	0.25
Robin	9.23	0.63	0.25
ArrayHash	5.91	1.16	0.28
HAT	2.68	1.08	0.51
Judy	2.21	1.88	1.59
ART	5.17	1.64	1.19
Cedar-R	7.37	2.24	2.30
Cedar-P	2.02	2.20	2.28
PCT-Bit	18.05	25.92	33.49
PCT-Hash	n/a	n/a	n/a
ZFT	7.53	6.20	5.03
CTrie++	8.17	4.75	2.86

Table 14. (d) WebBase

	Space	Insert	Lookup
PDT-PB	5.4	1.66	1.19
PDT-SB	2.6	2.57	1.77
PDT-CB	2.3	2.68	1.74
PDT-PFK	5.7	1.36	1.42
PDT-SFK	2.9	1.84	2.03
PDT-CFK	2.6	2.15	2.18
STLHash	16.3	0.64	0.55
GoogleDense	19.5	0.82	0.28
Sparsepp	13.5	0.83	0.43
Hopscotch	20.3	0.93	0.24
Robin	20.3	0.64	0.26
ArrayHash	10.4	0.98	0.29
HAT	6.7	1.08	0.55
Judy	5.9	2.09	1.76
ART	14.0	1.76	1.45
Cedar-R	n/a	n/a	n/a
Cedar-P	n/a	n/a	n/a
PCT-Bit	n/a	n/a	n/a
PCT-Hash	n/a	n/a	n/a
ZFT	19.6	5.77	5.06
CTrie++	23.5	5.12	3.12

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Web Data Mining and Analysis

Full text

Dynamic Path-Decomposed Tries

Shunsuke Kanda

[email protected]

RIKEN Center for Advanced Intelligence ProjectJapan

,

Dominik Köppl

[email protected]

Kyushu UniversityJapan

Japan Society for Promotion of Science

,

Yasuo Tabei

[email protected]

RIKEN Center for Advanced Intelligence ProjectJapan

,

Kazuhiro Morita

[email protected]

Tokushima UniversityJapan

and

Masao Fuketa

[email protected]

Tokushima UniversityJapan

Abstract.

A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones, while achieving a relevant space-time tradeoff.

1. Introduction

An associative array is called a keyword dictionary if its keys are strings. In this article, we study the problem to maintain a keyword dictionary in main memory efficiently. When storing words extracted from text collections written in natural or computer languages, the size of a keyword dictionary $A$ is not of major concern. This is because, after carefully polishing the extracted strings with natural language processing tools like stemmers, the size of $A$ grows sublinearly as $\mathcal{O}(N^{\beta})$ for some $\beta\approx 0.5$ over a text of $N$ words due to Heaps’ Law (Heaps, 1978; Baeza-Yates and Ribeiro-Neto, 2011). However, as reported in (Martínez-Prieto et al., 2016), some natural language applications such as web search engines and machine translation systems need to handle large datasets that are not under Heaps’ Law. Also, other recent applications as in Semantic Web graphs and in bioinformatics handle massive string databases with keyword dictionaries (Martínez-Prieto et al., 2016; Mavlyutov et al., 2015). Although common implementations like hash tables are fast, their memory consumption is a severe drawback in such scenarios. Here, a space-efficient implementation of the keyword dictionary is important. In this paper, we focus on the practical side of this problem.

In the static setting, omitting the insertion and deletion of keywords, a number of compressed keyword dictionaries have been developed for a decade, some of which we highlight in the following. We start with Martínez-Prieto et al. (Martínez-Prieto et al., 2016), who proposed and evaluated a number of compressed keyword dictionaries based on techniques like hashing, front-coding, full-text indexes, and tries. They demonstrated that their implementations use up to 5% space of the original dataset size, while also supporting searches of prefixes and substrings of the keywords. Subsequently, Grossi and Ottaviano (Grossi and Ottaviano, 2014) proposed a cache-friendly keyword dictionary through path decomposition of tries. Arz and Fischer (Arz and Fischer, 2018) adapted the LZ78 compression to devise a keyword dictionary. Finally, Kanda et al. (Kanda et al., 2017a) proposed a keyword dictionary based on a compressed double-array trie. As we can see from these representations, space-efficient static keyword dictionaries have been well studied because of the advancements of practical (yet static) succinct data structures collected in well maintained libraries such as SDSL (Gog et al., 2014) and Succinct (Grossi and Ottaviano, 2013).

Under the dynamic setting, however, only a few space-efficient keyword dictionaries have been realized, probably due to the implementation difficulty. Although HAT-trie (Askitis and Sinha, 2010) and Judy (Baskins, 2002) are representative space-efficient dynamic implementations as demonstrated in previous experiments111Such as http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/#perf and https://github.com/Tessil/hat-trie/blob/master/README.md#benchmark., they still waste memory by maintaining many pointers. The Cedar trie (Yoshinaga and Kitsuregawa, 2014) is a space-efficient implementation embracing heavily 32-bit pointers to address memory, and therefore cannot be applied to massive datasets. Its implementation makes it hard to switch to 64-bit pointers, but we expect that doing so will increase its space consumption considerably. Although several practical dynamic succinct data structures (Prezza, 2017; Poyias et al., 2017, 2018) have been recently developed, modern dynamic keyword dictionaries are heavily based on pointers, consuming a large fraction of the entire space requirement. Nonetheless, there are some applications that need dynamic keyword dictionaries for massive datasets such as search engines (Brazil Inc., 2019; Busch et al., 2012), RDF stores (Mavlyutov et al., 2015), or Web crawler (Ueda et al., 2013). Consequently, realizing a practical space-efficient dynamic keyword dictionaries is an important open challenge.

1.1. Space-Efficient Dynamic Tries

Common keyword dictionary implementations represent the keywords in a trie, supporting the retrieval of keywords with trie navigation operations. In this subsection, we summarize space-efficient dynamic tries.

Theoretical Discussion

We consider a dynamic trie with $t$ nodes over an alphabet of size $\sigma$ . Arroyuelo et al. (Arroyuelo et al., 2016) introduced succinct representations that require almost optimal $2t+t\log\sigma+o(t\log\sigma)$ bits of space, while supporting insertion and deletion of a leaf in $\mathcal{O}(1)$ amortized time if $\sigma=\mathcal{O}(\mathrm{polylog}(t))$ and in $\mathcal{O}(\log\sigma/\log\log\sigma)$ amortized time otherwise.222Throughout this paper, the base of the logarithm is 2, whenever not explicitly indicated. Jansson et al. (Jansson et al., 2015) presented a dynamic trie representation that uses $\mathcal{O}(t\log\sigma)$ bits of space, while supporting insertion and deletion of a leaf in $\mathcal{O}(\log\log t)$ expected amortized time.

Hash Tries

On the practical side, Poyias et al. (Poyias et al., 2018) proposed the m-Bonsai trie, a practical dynamic compact trie representation. It is a variant of the Bonsai trie (Darragh et al., 1993) that represents the trie nodes as entries in a compact hash table. It takes $\mathcal{O}(t\log\sigma)$ bits of space, while supporting update and some traversal operations in $\mathcal{O}(1)$ expected time. Fischer and Köppl (Fischer and Köppl, 2017) presented and evaluated a number of dynamic tries for LZ78 (Ziv and Lempel, 1978) and LZW (Welch, 1984) factorization. They also proposed an efficient hash-based trie representation in a similar way to m-Bonsai, which is referred to as FK-hash.333The representation is referred to as hash or cht in their paper (Fischer and Köppl, 2017). To avoid confusion, we name it FK-hash by using the initial letters of the proposers, Fischer and Köppl. Although FK-hash uses $\mathcal{O}(t\log\sigma+t\log t)$ bits of space, its update algorithm is simple and practically fast. However, we are not aware of any space-efficient approach using them as keyword dictionaries.

Compacted Tries

Another line of research focuses on limiting the space of the trie in relation to the number of keywords. Suppose that we want to maintain a set of $n$ strings with a total length of $N$ on a machine, where $\beta=\log_{\sigma}N$ characters fit into a single machine word $w$ . In this setting, Belazzougui et al. (Belazzougui et al., 2010) proposed the (dynamic) z-fast trie, which takes $N\log\sigma+\mathcal{O}(n\log N)$ bits of space and supports retrieval, insertion and deletion of a string $S$ in $\mathcal{O}(|S|/\beta+\log|S|+\log\log\sigma)$ expected time. Takagi et al. (Takagi et al., 2016) proposed the packed compact trie, which takes $N\log\sigma+\mathcal{O}(nw)$ bits of space and supports the same operations in $\mathcal{O}(|S|/\beta+\log\log N)$ expected time. Recently, Tsuruta et al. (Tsuruta et al., 2020) developed a hybrid data structure of the z-fast trie and the packed compact trie, which also takes $N\log\sigma+\mathcal{O}(nw)$ bits of space, but improves each of these operations to run in $\mathcal{O}(|S|/\beta+\log\beta)$ expected time.

1.2. Our Contribution

We propose a novel space-efficient dynamic keyword dictionary, called the dynamic path-decomposed trie (abbreviated as DynPDT). DynPDT is based on a trie formed by path decomposition (Ferragina et al., 2008). The path decomposition is a trie transformation technique, which was proposed for constructing cache-friendly trie dictionaries. It was up to now utilized only in static applications (Grossi and Ottaviano, 2014; Hsu and Ottaviano, 2013). Here, we adapt this technique for the dynamic construction of DynPDT, which gives DynPDT two main advantages over other known keyword dictionaries.

(1)

The first is that the data structure is cache efficient because of the path decomposition. During the retrieval of a keyword, most parts of the keyword can be scanned in a cache-friendly manner without node-to-node traversals based on random accesses. 2. (2)

The second is that the path decomposition allows us to plug in any dynamic trie representation for the path-decomposed trie topology. For this job, we choose the hash-based trie representations m-Bonsai and FK-hash as these are fast and memory efficient in the setting when all trie nodes have to be represented explicitly (which is the case for the nodes of the path-decomposed trie).

Based on these advantages, DynPDT becomes a fast and space-efficient dynamic keyword dictionary.

From experiments using massive real-world datasets, we demonstrate that DynPDT is more space efficient compared to existing keyword dictionaries while achieving a relevant space-time tradeoff. For example, to construct a keyword dictionary from a large URI dataset of 13.8 GiB, DynPDT needs only 2.5 GiB of working space, while a HAT-trie and a Judy trie need 9.5 GiB and 7.8 GiB, respectively. The time performance is competitive in many cases thanks to the path decomposition. The source code of our implementation is available at https://github.com/kampersanda/poplar-trie.

1.3. Paper Structure

In Section 2, we introduce the keyword dictionary, and review the trie data structure and the path decomposition in our preliminaries. We introduce our new data structure DynPDT in Section 3. Subsequently, we present our DynPDT representations based on m-Bonsai and FK-hash in Sections 4 and 5, respectively. In Section 6, we provide our experimental results. Finally, we conclude the paper in Section 7.444A preliminary version of this work appeared in our conference paper (Kanda et al., 2017b) and the first author’s Ph.D. thesis (Kanda, 2018). This paper contains the significant differences as follows: (1) a fast variant of m-Bonsai was incorporated in Section 4.1; (2) an efficient implementation of the bijective hash function in m-Bonsai was incorporated in Section 4.2; (3) a growing algorithm of m-Bonsai was presented in Section 4.3; (4) FK-hash was also considered in addition to m-Bonsai in Section 5; (5) the experimental results in Section 6 and all descriptions were significantly enhanced.

2. Preliminaries

A string is a (finite) sequence of characters over a finite alphabet. Our strings always start at position 0. Given a string $S$ of length $n$ , $S[i,j)$ denotes the substring $S[i],S[i+1],\ldots,S[j-1]$ for $0\leq i\leq j\leq n$ . Particularly, $S[0,j)$ is a prefix of $S$ and $S[i,n)$ is a suffix of $S$ . Let $|S|:=n$ denote the length of $S$ . The same notation is also applied to arrays. The cardinality of a set $A$ is denoted by $|A|$ .

Our model of computation is the transdichotomous word RAM model of word size $w=\Theta(\log N)$ , where $N$ is the total length of all keywords of a given problem, i.e., the size of the problem. We can read and process $\mathcal{O}(w)$ bits in constant time.

2.1. Keyword Dictionary

A keyword is a string over an alphabet $\mathcal{A}$ that is terminated with a special character $\texttt{\$ }\not\in\mathcal{A} $at its end. In a *prefix-free* set of strings, no string is a prefix of another string. A set of keywords is always prefix-free due to the character$ . A keyword dictionary is a dynamic associative array that maps a dynamic set of $n$ keywords $\mathcal{S}=\{K_{1},K_{2},...,K_{n}\}\subset\mathcal{A}^{*}$ to values $x_{1},x_{2},\ldots,x_{n}$ , where $x_{i}$ belongs to a finite set $\mathcal{X}$ . It supports the retrieval, the insertion, and the deletion of keywords while maintaining the key-value mapping. In detail, it supports the following operations:

•

$\textsf{lookup}(K)$ returns the value associated with the keyword $K$ if $K\in\mathcal{S}$ or $\bot$ otherwise.

•

$\textsf{insert}(K,x)$ inserts the keyword $K$ in $\mathcal{S}$ , i.e., $\mathcal{S}\leftarrow\mathcal{S}\cup\{K\}$ , and associates the value $x$ with $K$ .

•

$\textsf{delete}(K)$ removes the keyword $K$ from $\mathcal{S}$ , i.e., $\mathcal{S}\leftarrow\mathcal{S}\setminus\{K\}$ .

2.2. Tries

A trie (Knuth, 1998; Fredkin, 1960) is a rooted labeled tree $\mathcal{T}_{\mathcal{S}}$ representing a set of keywords $\mathcal{S}$ . Each edge in $\mathcal{T}_{\mathcal{S}}$ is labeled by a character. All outgoing edges of a node are labeled with a distinct character. The label $c$ of the edge $(u,v)$ between a node $v$ and its parent $u$ is called the branching character of $v$ . The parent $u$ and branching character $c$ unique determines $v$ . Each keyword $K\in\mathcal{S}$ is represented by exactly one path from the root to a leaf $u$ , i.e., the keyword $K$ can be extracted by concatenating the edge labels on the path from the root to $u$ . Since $\mathcal{S}$ is prefix-free ($ is a unique delimiter of each keyword), there is a 1-to-1 correlation between leaves and keywords.

Given a keyword $K$ of length $m$ , $\mathcal{T}_{\mathcal{S}}$ retrieves $K$ by traversing nodes from the root to a leaf while matching the characters of $K$ with the edge labels of the traversed path. In representations storing all trie nodes explicitly, we visit $m$ nodes during this traversal. However, this traversal suffers poor locality of reference since it needs to access pointers usually addressing non-consecutive memory. In practice, this cache inefficiency is a critical bottleneck especially for long strings such as URLs. Grossi and Ottaviano (Grossi and Ottaviano, 2014) successfully solved this problem through path decomposition (Ferragina et al., 2008) in practice (but, in static settings).

2.3. Path Decomposition

The path decomposition (Ferragina et al., 2008) of a trie $\mathcal{T}_{\mathcal{S}}$ is a recursive procedure that first chooses an arbitrary root-to-leaf path $\pi$ in $\mathcal{T}_{\mathcal{S}}$ , then compactifies the path $\pi$ to a single node, and subsequently repeats the procedure in each subtrie hanging off the path $\pi$ . As a result, $\mathcal{T}_{\mathcal{S}}$ is partitioned into a set of $n$ node-to-leaf paths because there are $n$ leaves in $\mathcal{T}_{\mathcal{S}}$ . This decomposition produces the path-decomposed trie $\mathcal{T}^{c}_{\mathcal{S}}$ , which is composed of $n$ compactified nodes.

For explaining the properties of $\mathcal{T}^{c}_{\mathcal{S}}$ , we call the concatenation of the labels of all edges of a node-to-leaf path $\pi$ in $\mathcal{T}_{\mathcal{S}}$ the path string of $\pi$ . The path strings of the compactified paths of $\mathcal{T}_{\mathcal{S}}$ are the node labels of $\mathcal{T}^{c}_{\mathcal{S}}$ . In detail, each node $u$ in $\mathcal{T}^{c}_{\mathcal{S}}$ is associated with a node-to-leaf path $\pi$ of $\mathcal{T}_{\mathcal{S}}$ and is labeled by the path string of $\pi$ , denoted by $L_{u}\in\mathcal{A}^{*}$ . Each edge in $\mathcal{T}^{c}_{\mathcal{S}}$ is labeled by a pair consisting of a branching character and an integer, which are defined as follows (see also Figure 1): Take a node $u$ in $\mathcal{T}^{c}_{\mathcal{S}}$ and one of its children $v$ . Suppose that $u$ and $v$ are associated with the paths $\pi_{u}$ and $\pi_{v}$ in $\mathcal{T}_{\mathcal{S}}$ , respectively, such that $L_{u}$ and $L_{v}$ are the path labels of $\pi_{u}$ and $\pi_{v}$ . The edge $({u,v})$ has the label $({b,i})$ if, in $\mathcal{T}_{\mathcal{S}}$ , the first node on the path $\pi_{v}$ is the node

•

whose branching character is $b$ , and

•

whose parent is the $i$ -th node555Throughout this paper, we start counting from zero. visited on the path $\pi_{u}$ .

The edge labels of $\mathcal{T}^{c}_{\mathcal{S}}$ are characters drawn from the alphabet $\mathcal{B}:=\mathcal{A}\times\{0,1,\ldots,\Lambda-1\}$ , where $\Lambda$ is the longest length of all node labels.

Example 2.1 (Path-Decomposed Trie).

Figure 2 illustrates a root-to-leaf path $\pi$ in $\mathcal{T}_{\mathcal{S}}$ and the corresponding root $r$ in $\mathcal{T}^{c}_{\mathcal{S}}$ after compactifying $\pi$ to $r$ . The root $r$ is labeled by the path string of $\pi$ , which is $L_{r}=c_{1}c_{2}c_{3}c_{4}c_{5}$ . The branching character of $u^{\prime}_{5}$ in $\mathcal{T}^{c}_{\mathcal{S}}$ is $({b_{5},3})$ because $u_{5}$ in $\mathcal{T}_{\mathcal{S}}$ is the child of the third node on the path $\pi$ with branching character $b_{5}$ . Also for the subtries rooted at the nodes $u_{1},u_{2},\ldots,u_{6}$ in $\mathcal{T}_{\mathcal{S}}$ , the decomposition is recursively applied to produce the children of the root in $\mathcal{T}^{c}_{\mathcal{S}}$ .

Given a keyword $K$ , the retrieval on $\mathcal{T}_{\mathcal{S}}$ can be simulated with a traversal of $\mathcal{T}^{c}_{\mathcal{S}}$ starting at its root: Let $u$ denote the currently visited node in $\mathcal{T}^{c}_{\mathcal{S}}$ . On visiting $u$ , we compare the path string $L_{u}$ with the characters of $K$ . If we find a mismatch at $L_{u}[i]$ with $b:=K[i]\neq L_{u}[i]$ , we descend to the child with branching character $({b,i})$ and drop the first $i+1$ characters of $K$ .

When storing the characters of each path string $L_{u}$ in consecutive memory locations, the number of random accesses involved in the retrieval on $\mathcal{T}^{c}_{\mathcal{S}}$ is bounded by $\mathcal{O}(h)$ , where $h$ is the height of $\mathcal{T}^{c}_{\mathcal{S}}$ . The following property regarding the height is satisfied by construction.

Property 1.

The height of $\mathcal{T}^{c}_{\mathcal{S}}$ cannot be larger than that of $\mathcal{T}_{\mathcal{S}}$ .

Centroid Path Decomposition

A way to improve this height bound in the static case is the centroid path decomposition (Ferragina et al., 2008). Given an inner node $u$ in $\mathcal{T}_{\mathcal{S}}$ , the heavy child of $u$ is the child whose subtrie has the most leaves (ties are broken arbitrarily). Given a node $u$ , the centroid path is the path from $u$ to a leaf obtained by descending only to heavy children. The centroid path decomposition yields the following property by always choosing centroid paths in the decomposition.

Property 2 ((Ferragina et al., 2008)).

Through the centroid path decomposition, the height of $\mathcal{T}^{c}_{\mathcal{S}}$ is bounded by $\mathcal{O}(\log n)$ .

Key-Value Mapping

We can implement the key-value mapping through $\mathcal{T}^{c}_{\mathcal{S}}$ because there is a 1-to-1 correlation between nodes in $\mathcal{T}^{c}_{\mathcal{S}}$ and keywords in $\mathcal{S}$ . A simple approach is to store the associated values in an array $A$ such that $A[u]$ stores the value associated with node $u$ . If we assign each of the $n$ nodes in $\mathcal{T}^{c}_{\mathcal{S}}$ a unique id from the range $[0,n)$ , then $A$ has no vacant entry (i.e. $|A|=n$ ). Another approach is to embed the value of $K_{i}$ at the end of $L_{u}$ , where the node $u$ corresponds to the keyword $K_{i}$ . This approach can be used without considering the assignment of node ids. In our experiments, we used the latter approach.

3. Dynamic Path-Decomposed Trie

Although the centroid path decomposition gives a logarithmic upper bound on the height of $\mathcal{T}^{c}_{\mathcal{S}}$ (cf. Section 2), it can be adapted only in static settings because we have to know the complete topology of $\mathcal{T}_{\mathcal{S}}$ a priori to determine the centroid paths. As a matter of fact, previous data structures embracing the path decomposition (Grossi and Ottaviano, 2014; Hsu and Ottaviano, 2013; Ferragina et al., 2008) consider only static applications.

In this section, we present the incremental path decomposition, which is a novel procedure to construct a dynamic path-decomposed trie, which we call DynPDT in the following. Our procedure incrementally chooses666We actually do not construct $\mathcal{T}_{\mathcal{S}}$ , but represent it with the DynPDT $\mathcal{T}^{c}_{\mathcal{S}}$ a node-to-leaf path in $\mathcal{T}_{\mathcal{S}}$ and directly updates the DynPDT $\mathcal{T}^{c}_{\mathcal{S}}$ on inserting a new keyword of $\mathcal{S}$ . This incrementally chosen path is not a centroid path in general. Thus, the incremental path decomposition does not necessarily satisfy Property 2 but always satisfies Property 1.

In this section, we drop the technical detail of storing the values to ease the explanation of DynPDT, for which we omit the second argument in the insert operation $\textsf{insert}(K)$ of a new keyword $K$ .

3.1. Incremental Path Decomposition

In the following, we simulate a dynamic trie $\mathcal{T}_{\mathcal{S}}$ by DynPDT $\mathcal{T}^{c}_{\mathcal{S}}$ . Suppose that $\mathcal{T}_{\mathcal{S}}$ is non-empty. On inserting a new keyword $K\not\in\mathcal{S}$ into $\mathcal{T}_{\mathcal{S}}$ , we proceed as follows:

(1)

First traverse $\mathcal{T}_{\mathcal{S}}$ from the root by matching characters of $K$ until reaching the deepest node $u$ whose string label $X$ is a prefix of $K$ . 2. (2)

Decompose $K$ into $K=XbY$ for $b\in\mathcal{A}$ and $Y\in\mathcal{A}^{*}$ , which is possible since $K\not\in\mathcal{S}$ and $K[|K|-1]=\texttt{\$ }$. 3. (3)

Finally, insert a new child $v$ of $u$ with branching character $b$ and append, from node $v$ , new nodes corresponding to the suffix $Y$ .

In other words, the task of $\textsf{insert}(K)$ on $\mathcal{T}_{\mathcal{S}}$ is to create a new node-to-leaf path $\pi$ representing the suffix $Y$ . We call that path $\pi$ the incremental path of the keyword $K$ . We simulate $\textsf{insert}(K)$ by creating a new node in $\mathcal{T}^{c}_{\mathcal{S}}$ whose label is the path label of this incremental path $\pi$ :

•

If $\mathcal{S}=\emptyset$ , create the root $u_{1}$ and associate the keyword $K$ with $u_{1}$ by $L_{u_{1}}\leftarrow K$ .

•

Otherwise ( $\mathcal{S}\neq\emptyset$ ), retrieve the keyword $K$ from the root $u_{1}$ in three steps after setting variables $u\leftarrow u_{1}$ and $S\leftarrow K$ :

(1)

Compare $S$ with $L_{u}$ . If $S=L_{u}$ , terminate because $K$ is already inserted; otherwise, proceed with Step 2. 2. (2)

Find $i$ such that $S[0,i)=L_{u}[0,i)$ and $S[i]\neq L_{u}[i]$ ( $i$ exists since $K\not\in\mathcal{S}$ and $K[|K|-1]=\texttt{\$ } $), and search the child of$ u $with branching character$ ({S[i],i}) $. If found, go back to Step 1 after setting the variable$ u $to this child and$ S $to the remaining suffix$ S[i+1,|S|)$; otherwise, proceed with Step 3. 3. (3)

Insert $K$ into $\mathcal{S}$ by creating a new child $v$ of $u$ with branching character $({S[i],i})$ , and store the remaining suffix in $v$ by $L_{v}\leftarrow S[i+1,|S|)$ .

Example 3.1 (Construction).

Figure 3 illustrates the construction process of DynPDT $\mathcal{T}^{c}_{\mathcal{S}}$ when inserting the keywords $K_{1}=\texttt{technology\$ } $,$ K_{2}=\texttt{technics$} $,$ K_{3}=\texttt{technique$} $, and$ K_{4}=\texttt{technically$} $in this order, where the$ i $-th created node is denoted by$ u_{i} $. The process begins with an empty trie$ \mathcal{T}^{c}_{\mathcal{S}}$.

(a)

In the first insertion $\textsf{insert}(K_{1})$ , we create the root $u_{1}$ and associate $K_{1}$ with $L_{u_{1}}$ , that is, $L_{u_{1}}$ becomes technology $. The resulting$ \mathcal{T}^{c}{\mathcal{S}} $for$ \mathcal{S}={K{1}}$ is shown in Figure 3a. 2. (b)

In the second insertion $\textsf{insert}(K_{2})$ , we define a string variable $S$ initially set to $S\leftarrow K_{2}$ . We try to retrieve $K_{2}$ in $\mathcal{T}^{c}_{\mathcal{S}}$ by comparing $S$ with $L_{u_{1}}$ , but fail as there is a mismatching character i at position 5 with $S[0,5)=L_{u_{1}}[0,5)=\texttt{techn}$ and $S[5]=\texttt{i}\neq\texttt{o}=L_{u_{1}}[5]$ . Based on this mismatch result, we search the child of $u_{1}$ with branching character $({\texttt{i},5})$ . However, since there is no such child, we add a new child $u_{2}$ to $u_{1}$ with branching character $({\texttt{i},5})$ and associate the remaining suffix $S[6,|S|)=\texttt{cs\$ } $with$ L_{u_{2}} $. The resulting$ \mathcal{T}^{c}{\mathcal{S}} $for$ \mathcal{S}={K{1},K_{2}}$ is shown in Figure 3b. 3. (c)

In the third insertion $\textsf{insert}(K_{3})$ , we initially set the string variable $S$ to $S\leftarrow K_{3}$ and then compare $S$ with $L_{u_{1}}$ in the same manner as the second insertion. Since $S[0,5)=L_{u_{1}}[0,5)=\texttt{techn}$ and $S[5]=\texttt{i}\neq\texttt{o}=L_{u_{1}}[5]$ , we descend to child $u_{2}$ with branching character $({\texttt{i},5})$ . After updating $S\leftarrow S[6,|S|)=\texttt{que\$ } $, we subsequently compare$ S $with$ L_{u_{2}} $to obtain the mismatch character q at position 0 with$ S[0]=\texttt{q}\neq\texttt{c}=L_{u}[0] $. We search the child with branching character$ ({\texttt{q},0}) $, but there is no such child; thus, we create the child$ u_{3} $and set$ L_{u_{3}} $to be the remaining suffix$ S[1,|S|)=\texttt{ue$} $. The resulting$ \mathcal{T}^{c}{\mathcal{S}} $for$ \mathcal{S}={K{1},K_{2},K_{3}}$ is shown in Figure 3c. 4. (d)

The fourth insertion $\textsf{insert}(K_{4})$ is also conducted in the same manner. The final trie $\mathcal{T}^{c}_{\mathcal{S}}$ is shown in Figure 3d.

3.2. Dictionary Operations

It is left to define the operations lookup and delete to make DynPDT a keyword dictionary. Similar to insert, the operation lookup can be performed by traversing $\mathcal{T}^{c}_{\mathcal{S}}$ from the root. After matching all the characters of $K$ , $\textsf{lookup}(K)$ returns the value associated with the last visited node. It returns $\bot$ on a mismatch.

Example 3.2 (Retrieval).

We provide an example for a successful and an unsuccessful search. Both examples are similar to the construction described in Example 3.1.

(1)

We consider $\textsf{lookup}(\texttt{technically\$ }) $for the$ \mathcal{T}^{c}{\mathcal{S}} $in Figure [3d](#S3.F3.sf4). We define a string variable$ S $initially set to$ S\leftarrow\texttt{technically$} $, and compare$ S $with$ L{u_{1}} $to retrieve (a part of) the keyword from the root. Since$ S[0,5)=L_{u_{1}}[0,5)=\texttt{techn} $and$ S[5]=\texttt{i}\neq\texttt{o}=L_{u_{1}}[5] $, we descend to child$ u_{2} $with branching character$ ({\texttt{i},5}) $. Subsequently, we update$ S $to be the remaining suffix as$ S\leftarrow S[6,|S|)=\texttt{cally$} $and descend to child$ u_{4} $with branching character$ ({\texttt{a},1}) $since$ S[0,1)=L_{u_{1}}[0,1)=\texttt{c} $and$ S[1]=\texttt{a}\neq\texttt{s}=L_{u_{2}}[1] $. Finally, we update$ S\leftarrow S[2,|S|)=\texttt{lly$} $and compare$ S $with$ L_{u_{4}} $. As both match, we return the value stored in$ u_{4}$. 2. (2)

We consider $\textsf{lookup}(\texttt{technical\$ }) $for the$ \mathcal{T}^{c}{\mathcal{S}} $in Figure [3d](#S3.F3.sf4). In the same manner as in the above case, we reach node$ u{4} $with the prefix technica and subsequently compare$ S=\texttt{l$} $and$ L_{u_{4}} $. Since$ S[0,1)=L_{u_{4}}[0,1)=\texttt{c} $and$ S[1]=\texttt{$}\neq L_{u_{4}}[1]=\texttt{l} $, we search a child with branching character$ ({\texttt{$},1}) $; however, there is no such child. As a result,$ \textsf{lookup}(\texttt{technical$}) $returns$ \bot$.

The operation delete can be implemented by introducing deletion flags for each node (i.e., for each keyword), a trick that is also used in hashing with open addressing (Knuth, 1998, Chapter 6.4, Algorithm L). In other words, $\textsf{delete}(K)$ retrieves $K$ and sets the deletion flag for the node corresponding to $K$ . However, this approach additionally needs one bit for each node. Another approach is to set the value associated with the deleted keyword to $\bot$ as an invalid value. This approach does not need additional space for the deletion flags. Although these approaches do not free up space after deletion, the space is reused for keywords inserted subsequently if the new keywords share sufficiently long prefixes with the deleted ones.

3.3. Fixing the Alphabet

In practice, a critical problem of DynPDT is that the domain of the edge labels $\mathcal{B}$ in $\mathcal{T}^{c}_{\mathcal{S}}$ and the longest length of all node labels $\Lambda$ are not constant in general. We tackle this problem by limiting the size of $\mathcal{B}$ . To this end, we introduce a new parameter $\lambda$ to forcibly fix the alphabet as $\mathcal{B}=\mathcal{A}\times\{0,1,\ldots,\lambda-1\}$ in advance. Within this limitation, suppose that we want to create an edge labeled $({c,i})$ from node $u$ with $i\geq\lambda$ . As this label is not in $\mathcal{B}$ , we create dummy nodes called step nodes with a special character $\phi$ by repeating the following procedure until $i$ becomes less than $\lambda$ : add a new child $v$ of $u$ with branching character $\phi$ and recursively set $u\leftarrow v$ and $i\leftarrow i-\lambda$ . $L_{u}$ is the empty string if $u$ is a step node.

Example 3.3 (Step Node).

We consider $\textsf{insert}(\texttt{technological\$ }) $for$ \mathcal{T}^{c}{\mathcal{S}} $in Figure [3d](#S3.F3.sf4) with$ \lambda=8 $. We set$ S\leftarrow\texttt{technological$} $and compare$ S $with$ L{u_{1}} $. Since$ S[0,9)=L_{u_{1}}[0,9)=\texttt{technolog} $and$ S[9]=\texttt{i}\neq\texttt{y}=L_{u_{1}}[9] $, we try to create the edge label$ ({\texttt{i},9}) $; however, as$ i\geq\lambda $, we instead create a step child$ u_{5} $with branching character$ \phi $, descend to this child, and set$ i\leftarrow i-\lambda=1 $. Since$ i $becomes less than$ \lambda $, we define a child$ u_{6} $of the step node$ u_{5} $with branching character$ ({\texttt{i},1}) $and associate the remaining suffix$ S[10,|S|)=\texttt{cal$} $with$ L_{u_{6}} $. The resulting DynPDT$ \mathcal{T}^{c}_{\mathcal{S}}$ is depicted in Figure 4.

This solution creates additional nodes depending on $\lambda$ . When $\lambda$ is too small, many step nodes are created and extra node traversals are involved. When $\lambda$ is too large, the alphabet size $|\mathcal{B}|$ becomes large and the space usage can increase significantly. Therefore, it is necessary to determine a suitable $\lambda$ . In Section 6, we empirically determine 32 and 64 to be favorable values for $\lambda$ .

3.4. Representation Scheme

To use standard trie techniques, we split up $\mathcal{T}^{c}_{\mathcal{S}}$ into two parts:

(1)

a (standard) trie structure $\mathcal{T}_{\mathcal{D}}$ for a set of strings $\mathcal{D}\subset\mathcal{B}^{*}$ to represent $\mathcal{T}^{c}_{\mathcal{S}}$ with the difference that it assigns a node to a unique id instead of its node label, and 2. (2)

an associative array that maps the ids of the nodes of $\mathcal{T}_{\mathcal{D}}$ to their corresponding node labels, called node label map (NLM).

For example, in Figure 4, the trie $\mathcal{T}_{\mathcal{D}}$ built on the string set $\mathcal{D}=\{({\texttt{i},5})({\texttt{q},0}),({\texttt{i},5})({\texttt{a},1}),\phi({\texttt{i},1})\}$ and the NLM stores node labels $L_{u_{1}},L_{u_{2}},\ldots,L_{u_{6}}$ to be accessed by the respective node ids $u_{1},u_{2},\ldots,u_{6}$ .

Node-Label-Map

NLM dynamically manages node labels depending on the node ids assigned. As explained in Section 1, we use the m-Bonsai (Poyias et al., 2018) and FK-hash (Fischer and Köppl, 2017) representations for $\mathcal{T}_{\mathcal{D}}$ . Moreover, we design the NLM data structures for m-Bonsai and FK-hash individually, which we respectively present in Sections 4 and 5.

Trie Representation $\mathcal{T}_{\mathcal{D}}$

To discuss the representation approaches in the next sections, we define $\mathcal{T}_{\mathcal{D}}$ to be a dynamic trie with $n$ nodes whose edge labels are characters drawn from the alphabet $\mathcal{B}$ of size $\sigma=|\mathcal{A}|\cdot\lambda$ . Although the number of nodes $n$ depends on $\lambda$ , we write $n:=n(\lambda)$ for simplicity. $\mathcal{T}_{\mathcal{D}}$ supports the following operations:

•

$\textsf{addchild}(u,c)$ adds a new child of $u$ with branching character $c\in\mathcal{B}$ and returns its id.

•

$\textsf{getchild}(u,c)$ returns the id of the child $v$ of $u$ with branching character $c\in\mathcal{B}$ if $v$ exists, or returns $\bot$ otherwise.

Motivation for m-Bonsai and FK-hash

We briefly review some common trie representations and point out their suitability for $\mathcal{T}_{\mathcal{D}}$ . The simplest representation is a list trie (Askitis, 2007, Chapter 2.3.2), which transforms an arbitrary trie to its first-child next-sibling representation. In this representation, each node of the list trie stores its branching character, a pointer to its first child, and a pointer to its next sibling. The list trie represents $\mathcal{T}_{\mathcal{D}}$ in $2n\log{n}+n\log{\sigma}$ bits and supports addchild and getchild in $\mathcal{O}(\sigma)$ time; however, the operation time becomes problematic if $\sigma=|\mathcal{A}|\cdot\lambda$ is large. Another representation is a ternary search trie (TST) (Bentley and Sedgewick, 1997) that reduces the time complexity of the list trie to $\mathcal{O}(\log\sigma)$ ; however, the space usage grows to $3n\log{n}+n\log{\sigma}$ bits. A well-known time- and space-efficient representation is the double array (Aoe, 1989). Its space usage is $2n\log n$ bits in the best case, while supporting getchild in $\mathcal{O}(1)$ time; however, a double array for a large alphabet tends to be sparse in practice. Actually, we are only aware of dynamic double-array implementations handling byte characters (e.g., (Yoshinaga and Kitsuregawa, 2014; Kanda et al., 2018)). Judy (Baskins, 2002) and ART (adaptive radix tree) (Leis et al., 2013) are trie representations that dynamically choose suitable data structures for the trie topology; however, both are also designed for byte characters. As each trie node is associated with an id, compact tries like the z-fast trie (Belazzougui et al., 2010) representing only $O(|\mathcal{D}|)$ nodes explicitly become inefficient with this requirement.

Compared to these trie representations, m-Bonsai and FK-hash have better complexities. m-Bonsai can represent $\mathcal{T}_{\mathcal{D}}$ in $cn(\log\sigma+\mathcal{O}(1))$ bits of expected space for a constant $c>1$ , while supporting getchild and addchild in $\mathcal{O}(1)$ expected time (Poyias et al., 2018). Compared to that, FK-hash needs $cn\log n$ additional bits of expected space, but supports faster insertions in practice.

A straightforward solution to provide the NLM for m-Bonsai and FK-hash is to store the node labels as satellite data in the respective hash table. However, by doing so, we would waste space for each unoccupied entry in the hash table. In the following, we present efficient solutions for the NLM tailored to m-Bonsai and FK-hash.

4. Representation Based on m-Bonsai

This section presents our approach based on m-Bonsai (Poyias et al., 2018). m-Bonsai represents trie nodes as entries in a closed hash table that, spoken informally, compactify the stored keys with compact hashing (Knuth, 1998).

Outline

We present a plain and a compact form of the $\mathcal{T}_{\mathcal{D}}$ representation based on m-Bonsai. We refer to the former as PBT (Plain m-Bonsai Trie), which is a non-compact variant of m-Bonsai. PBT can be useful for fast implementation although it has not been considered in any applications yet. We refer to the latter as CBT (Compact m-Bonsai Trie) as it uses the original m-Bonsai implementation. We describe PBT and CBT in Sections 4.1 and 4.2, respectively. In both variants, we maintain a hash table $H$ of size $m$ with the load factor $\alpha=n/m\leq 1$ to store $n$ nodes. In Section 4.3, we propose a linear-time growing algorithm based on the approach of Arroyuelo et al. (Arroyuelo et al., 2017). Finally, in Section 4.4, we propose NLM data structures designed for PBT and CBT.

4.1. Plain Trie Representation

PBT uses a hash function $h:\mathbb{N}\rightarrow\mathbb{N}$ . Trie nodes are elements in the hash table. As their locations in the hash table are fixed unless the hash table is rebuilt, we use these locations as node ids. In other words, the id of a node located at $H[u]$ is $u$ . $\textsf{addchild}(u,c)$ is performed as follows. We first compose the hash key $k=({u,c})\in\{0,1,\ldots,m-1\}\times\mathcal{B}$ and then compute its initial address $i=h(k)\bmod m$ .777This paper defines $a\bmod b$ as $a-b\cdot\lfloor{a/b}\rfloor$ . Let $i^{\prime}$ be the first vacant address from $i$ determined by linear probing. We create the new child by $H[i^{\prime}]\leftarrow k$ . That is, the id of the new child becomes $i^{\prime}$ . getchild can be also computed in the same manner. If $h$ is fully independent and uniformly random, the operations can be performed in $\mathcal{O}(1)$ expected time. PBT uses $m\lceil{\log(m\sigma)}\rceil$ bits of space.

Practical Implementation

The table size $m$ is a power of two in order to quickly compute the modulo operation of $h(k)\bmod m$ by using the bitwise AND operation $h(k)\&(m-1)$ (Migliore et al., 2019, Section 4.4). We set the maximum load factor to $\hat{\alpha}:=0.9$ . If $\alpha$ reaches $\hat{\alpha}$ during an update, we double the size of the hash table by the growing algorithm described in Section 4.3. We set the initial capacity of the hash table to $m=2^{16}$ . Our hash function $h$ is a XorShift hash function888http://xorshift.di.unimi.it/splitmix64.c. derived from (Steele Jr et al., 2014).

4.2. Compact Trie Representation

CBT reduces the space usage of PBT with the compact hashing technique (Knuth, 1998). Locating nodes on a compact hash table is identical to PBT with the difference that CBT uses a bijective transform $h:\{0,1,\ldots,m\sigma-1\}\rightarrow\{0,1,\ldots,m\sigma-1\}$ that maps a key $k$ to its hash value $h(k)\bmod m$ and its quotient $\lfloor{h(k)/m}\rfloor$ . Instead of $k$ , the compact hash table stores only its quotient $\lfloor{h(k)/m}\rfloor$ in $H[i^{\prime}]$ . The hash value $h(k)$ can be restored from the initial address $i=h(k)\bmod m$ and the quotient $H[i^{\prime}]=\lfloor{h(k)/m}\rfloor$ , where $i^{\prime}$ is the first empty slot at or after the initial address $i$ . The original key $k$ can also be restored from the hash value $h(k)$ since $h$ is bijective. Therefore, addchild and getchild can be performed in the same manner as PBT if the corresponding initial address $i$ can be identified from the location $i^{\prime}$ .

The remaining problem is how to identify the corresponding initial address $i$ from $i^{\prime}$ . Poyias et al. (Poyias et al., 2018) solved this problem by introducing a displacement array $D$ such that $D[i^{\prime}]$ keeps the number of probes from $i$ to $i^{\prime}$ , that is, $D[i^{\prime}]=(i^{\prime}-i)\bmod m$ . Given a location $i^{\prime}$ , one can compute the corresponding initial address $i$ with $(i^{\prime}-D[i^{\prime}])\bmod m$ . Although a value in $D$ is at most $m-1$ , the average value becomes small if $h$ is fully independent and uniformly random and the load factor $\alpha$ is small. Poyias et al. (Poyias et al., 2018) demonstrated that $D$ can be represented in $\mathcal{O}(m)$ bits using CDRW (Compact Dynamic ReWritable) arrays. As $H$ takes $m\lceil{\log{\sigma}}\rceil$ bits for the quotients, CBT can represent $\mathcal{T}_{\mathcal{D}}$ in $m\log\sigma+\mathcal{O}(m)$ expected bits of space.

Practical Representation of the Displacement Array

The representation of $D$ with the CDRW array seems impractical. Poyias et al. (Poyias et al., 2018) gave an alternative practical representation, where $D$ is represented by three data structures $D_{1}$ , $D_{2}$ and $D_{3}$ as follows.

(1)

$D_{1}$ is a simple array of length $m$ in which each element uses $\Delta_{1}$ bits for a constant $\Delta_{1}>1$ . 2. (2)

$D_{2}$ is a compact hash table (CHT) described by Cleary (Cleary, 1984), which stores keys from $\mathcal{U}=\{0,1,\ldots,m-1\}$ and values from $\{0,1,\ldots,2^{\Delta_{2}}-1\}$ for a constant $\Delta_{2}>1$ . The keys are stored in a closed hash table of length $m^{\prime}<m$ through the compact hashing technique (Knuth, 1998), where $m^{\prime}$ is a power of two (a property that is in common with $m$ ). In detail, the hash table consists of

•

a bijective transform $h:\mathcal{U}\rightarrow\mathcal{U}$ ,

•

an integer array $Q$ of length $m^{\prime}$ to store the quotients of the keys (i.e., entry indices of $D$ ) representable in $\log(m/m^{\prime})$ bits,

•

an integer array $F$ of length $m^{\prime}$ to store displacement values of $D$ representable in $\Delta_{2}$ bits, and

•

two bit arrays each of length $m^{\prime}$ storing the displacement values of the quotients in $Q$ (not to be confused with the displacement values stored in $F$ ).

On inserting a key $k\in\mathcal{U}$ , we store its quotient $\lfloor{h(k)/m^{\prime}}\rfloor$ in the first vacant slot in $Q$ starting at the initial address $h(k)\bmod m^{\prime}$ . The collisions in $Q$ are therefore resolved with linear probing. However, this collision resolution poses the same problem as in CBT, as additional displacement information is required to restore the initial address of a stored quotient in $Q$ . Cleary solves this problem by using two bit arrays (see (Cleary, 1984)). Finally, $F[i]$ stores the value associated with the key whose quotient is stored in $Q[i]$ . Since $F$ uses $m^{\prime}\Delta_{2}$ bits of space, $D_{2}$ uses $m^{\prime}\log(m/m^{\prime})+m^{\prime}\Delta_{2}+2m^{\prime}$ bits of space in total. 3. (3)

$D_{3}$ is a standard associative array that maps keys from $\mathcal{U}$ to values from $\mathcal{U}$ . In our implementation, $D_{3}$ is a closed hash table with linear probing. Given $m^{\prime\prime}$ is the capacity of $D_{3}$ , $D_{3}$ takes $2m^{\prime\prime}\log m$ bits.

The representation of the entry $D[i]$ for an integer $i$ depends on its actual value:

(1)

If $D[i]<2^{\Delta_{1}}-1$ , then we store $D[i]$ in the $\Delta_{1}$ bits of $D_{1}[i]$ . 2. (2)

If $2^{\Delta_{1}}-1\leq D[i]<2^{\Delta_{1}}+2^{\Delta_{2}}$ , we represent $D[i]$ by the key-value pair $({i,D[i]-2^{\Delta_{1}}})$ stored in $D_{2}$ . 3. (3)

Finally, if $D[i]\geq 2^{\Delta_{1}}+2^{\Delta_{2}}$ , we represent $D[i]$ by the key-value pair $({i,D[i]})$ stored in $D_{3}$ .

In the experiments, we set $\Delta_{1}=4$ and $\Delta_{2}=7$ . We set the initial capacities of $D_{2}$ and $D_{3}$ to $m^{\prime}=2^{12}$ and $m^{\prime\prime}=2^{6}$ , respectively. We set the maximum load factor of $D_{2}$ and $D_{3}$ to 0.9. If the actual load factor of $D_{2}$ (resp. $D_{3}$ ) reaches the maximum load factor 0.9, we double the size of $D_{2}$ (resp. of $D_{3}$ ).

Design of the Bijective Transform

Since we assume that $m$ , $m^{\prime}$ , and $\sigma$ are powers of two, the bijective transform is $h:\{0,1,\ldots,2^{z}-1\}\rightarrow\{0,1,\ldots,2^{z}-1\}$ for some $z$ . We design this function as the concatenation of two bijective functions $h=h_{1}\circ h_{2}$ , where $h_{1}(x)=x\oplus\lfloor{x/2^{a}}\rfloor$ for an integer $a$ larger than $\lfloor{z/2}\rfloor$ and $h_{2}(x)=xp\bmod 2^{z}$ for a large prime $p$ smaller than $2^{z}$ . $h_{1}$ is based on the XorShift random number generators (Marsaglia, 2003), where the inverse function $h^{-1}_{1}$ is given by $h^{-1}_{1}(x)=h_{1}(x)$ . The inverse function $h^{-1}_{2}$ of $h_{2}$ is given by $h^{-1}_{2}(x)=xp^{-1}\bmod 2^{z}$ , where $p^{-1}\in\{1,2,\ldots,2^{z}-1\}$ is the multiplicative inverse of $p$ such that $pp^{-1}\bmod 2^{z}=1$ (see (Köppl et al., 2020) for details). By construction, the inverse function $h^{-1}$ of $h$ is $h^{-1}=h^{-1}_{2}\circ h^{-1}_{1}$ . Our hash function is inspired by the SplitMix algorithm (Steele Jr et al., 2014).

4.3. Linear-Time Growing Algorithm

If the load factor $\alpha$ of hash table $H$ of length $m$ reaches the maximum load factor $\hat{\alpha}$ , we create a new hash table $H^{\prime}$ (and a new displacement array $D^{\prime}$ for CBT) of length $2m$ and relocate all nodes to $H^{\prime}$ . Since a node depends on the position of its parent in $H$ , we can relocate a node only after having relocated all its ancestors. This can be done in a top-down traversal (e.g., in BFS or DFS order) of the tree during which all children of a node are successively selected. However, because selecting all children of a node is performed by checking getchild for all possible characters in $\mathcal{B}$ , the relocation based on a top-down traversal needs $\mathcal{O}(n\sigma)$ expected time and is therefore only for tiny alphabets practical. Here we describe a bottom-up approach that is based on the approach by Arroyuelo et al. (Arroyuelo et al., 2017). This approach, called growing algorithm, runs in $\mathcal{O}(n)$ expected time. A pseudo code of it is shown in Algorithm 1.

Given a trie $\mathcal{T}_{\mathcal{D}}$ with a hash table $H$ of length $m$ , the algorithm constructs an equivalent trie $\mathcal{T}^{\prime}_{\mathcal{D}}$ with a hash table $H^{\prime}$ of length $2m$ . To explain the algorithm, we define two operations $\textsf{getedge}(u)$ returning the branching character of node $u$ and $\textsf{getparent}(u)$ returning the parent id of node $u$ . They can be computed in constant time because $H[u]$ explicitly stores the branching character and the parent id as the hash key in PBT. CBT can also restore the hash key from $H[u]$ and $D[u]$ .

In the growing algorithm, we initially define two auxiliary arrays Map and Done: Map is an integer array and Done is a bit array, each of length $m$ . We store in $\textsf{Done}[u]$ a 1 after relocating the node stored in $H[u]$ . We keep the invariant that whenever $\textsf{Done}[u]=\texttt{1}$ , then $\textsf{Map}[u]$ stores the position in $H^{\prime}$ of the node stored in $H[u]$ . All bits in Done are initialized by 0 except for the root. We scan $H$ from left to right and perform the following steps for each non-vacant slot $i$ . We first set $u$ to $i$ and $\pi$ to an empty string, and then climb up the path from the node $u$ to the root. We prematurely stop when encountering a node $v$ with $\textsf{Done}[v]=\texttt{1}$ . In this case, all ancestors of $v$ have already been relocated such that there is no need to visit them again. Subsequently, we walk down the computed path $\pi$ while relocating the visited nodes. Since we do not reprocess already visited nodes, we can perform the node relocation in $\mathcal{O}(m)+\mathcal{O}(n)=\mathcal{O}(n)$ expected time, with $n=\hat{\alpha}\cdot m$ for a constant loaf factor $\hat{\alpha}$ .

Extra Working Space

Algorithm 1 maintains the auxiliary arrays Map of $m\lceil{\log{(2m)}}\rceil$ bits, Done of $m$ bits and $\pi$ of $h\lceil{\log\sigma}\rceil$ bits, where $h$ is the height of $\mathcal{T}_{\mathcal{D}}$ . Thus, the extra working space is $m\lceil{\log{m}}\rceil+2m+h\lceil{\log\sigma}\rceil$ bits if we create the auxiliary arrays naively. However, the working space of Map can be shared with $H$ because $H[i]$ for $\textsf{Done}[i]=\texttt{1}$ is no longer needed. In PBT, the working space of Map can be fully placed in $H$ because the space of $H$ is $m\lceil{\log(m\sigma)}\rceil$ bits and $\sigma$ is at least $2$ in practice.999Even for $\sigma=1$ , a simple bit array suffices. Based on this in-place approach, the extra working space of Algorithm 1 is only $m+h\lceil{\log\sigma}\rceil$ bits, taking account for Done and $\pi$ in PBT. In practice, the space of $\pi$ is negligible because $h$ is bounded by the maximum length of keywords in $\mathcal{S}$ and $h\ll m$ .

In CBT, $H$ uses only $m\lceil{\log\sigma}\rceil$ bits. As $\sigma\ll m$ in most scenarios, it is difficult to completely store Map in $H$ ; however, we can also use the space of $D_{1}$ , which is $m\Delta_{1}$ bits. If $\lceil{\log{(2m)}}\rceil\leq\lceil{\log\sigma}\rceil+\Delta_{1}$ , Map can be fully placed in $H$ and $D$ ; otherwise, the extra working space of $m(\lceil{\log{(2m)}}\rceil-\lceil{\log\sigma}\rceil-\Delta_{1})$ bits for Map is needed in addition to that of Done and $\pi$ .

4.4. NLM Data Structures

In m-Bonsai, the node ids are values drawn from the universe $[0,m)$ whose randomness depend on the used hash function. As the task of an NLM data structure is to map node ids to their respective node labels, an appropriate NLM data structure for m-Bonsai is a dynamic associative array that stores node label strings $L_{i}$ for arbitrary integer keys $i\in[0,m)$ . In what follows, we first present a plain approach and then show how to compactify it.

Plain NLM

The simplest approach is to use a pointer array $P$ of length $m$ such that $P[i]$ stores the pointer to $L_{i}$ or $\bot$ if no node with id $i$ exists. We refer to the approach as PLM (Plain Label Map). Figure 5a shows an example of PLM. Given a node of id $i$ , PLM can obtain $L_{i}$ through $P[i]$ in $\mathcal{O}(1)$ time. However, $P$ takes $mw=\mathcal{O}(m\log m)$ bits, where the word size is $w=\Theta(\log m)$ . This space consumption is obviously large.

Sparse NLM

We present an alternative compact approach that reduces the pointer overhead of PLM in a manner similar to Google’s sparse hash table (Google Inc., 2005). In this approach, we divide the node labels into groups of $\ell=\Theta(w)$ labels over the ids. That is, the first group consists of $L_{0},L_{1}\ldots,L_{\ell-1}$ , the second group consists of $L_{\ell},L_{\ell+1},\ldots,L_{2\ell-1}$ , and so on. Moreover, we introduce a bitmap $B$ such that $B[i]=\texttt{1}$ iff $L_{i}$ exists. We concatenate all node labels $L_{i}$ with $B[i]=\texttt{1}$ of the same group together, sorted in the id order. The length of $P$ becomes $\lceil{m/\ell}\rceil$ by maintaining, for each group, a pointer to its concatenated label string. We refer to the approach as SLM (Sparse Label Map).

With the array $P$ and the bitmap $B$ , we can access $L_{i}$ as follows: If $B[i]=\texttt{0}$ , we are done since $L_{i}$ does not exist in this case; otherwise, we obtain the concatenated label string storing $L_{i}$ from $P[g]$ , where $g=\lfloor{i/\ell}\rfloor$ . Given $j=\sum_{k=0}^{i\bmod\ell}B_{g}[k]$ for the bit chunk $B_{g}:=B[g\ell,(g+1)\ell)$ , $L_{i}$ is the $j$ -th node label of the concatenated label string. As $\ell=\Theta(w)$ , counting the occurrences of 1s in chunk $B_{g}$ is supported in constant time using the popcount operation (González et al., 2005). It is left to explain how to search $L_{i}$ in the respective concatenated label string. For that we present two representations of the concatenated label strings:

(1)

If the node labels are straightforwardly concatenated (e.g., the second group in Figure 5a is cal $ue$ in $\ell=4$ ), we can sequentially count the $delimiters to find the$ (j-1) $-th delimiter marking the ending of the$ (j-1) $-th stored string, after which$ L_{i} $starts. We can therefore extract$ L_{i} $in$ \mathcal{O}(\ell\Lambda) $time, where$ \Lambda$ again denotes the maximum length of all node labels. 2. (2)

We can shorten the scan time with the skipping technique used in array hashing (Askitis and Zobel, 2005). This technique puts its length in front of each node label via some prefix encoding such as VByte (Williams and Zobel, 1999). Note that we can omit the terminators of each node label. The skipping technique allows us to jump ahead to the start of the next node label; therefore, the scan is supported in $\mathcal{O}(\ell)$ time. Figure 5b shows an example of SLM with the skipping technique.

Regarding the space usage of SLM, $P$ and $B$ use $w\lceil{m/\ell}\rceil$ and $m$ bits, respectively. For $\ell=\Theta(w)$ , the total space usage becomes $\mathcal{O}(m)$ bits, which is smaller than $mw$ bits in PLM; however, the access time is $\mathcal{O}(w)=\mathcal{O}(\log m)$ .

5. Representation Based on FK-hash

This section presents our DynPDT representation approaches based on FK-hash (Fischer and Köppl, 2017). The basic idea of FK-hash is the same as that of m-Bonsai. The difference is that FK-hash incrementally assigns node ids and explicitly stores them as values in the hash table, while m-Bonsai uses the locations of the stored elements of the hash table as node ids. Although FK-hash uses more space than m-Bonsai, the assignment of node ids simplifies the growing algorithm.

Outline

In the same manner as m-Bonsai, we consider a plain and a compact representation based on FK-hash. In Section 5.1 we present both representations. In Section 5.2 we propose NLM data structures designed for FK-hash.

5.1. Trie Representations

Like m-Bonsai, FK-hash locates nodes on a closed hash table $H$ of length $m$ , but does not use the addresses of $H$ as node ids. FK-hash incrementally assigns node ids from zero and explicitly stores them in an integer array $M$ of length $m$ . In other words, when creating the $u$ -th node by storing it in $H[i]$ , its node id is $u$ , which is stored in $M[i]$ . In a way similar to m-Bonsai, $\textsf{addchild}(u,c)$ is performed as follows: We compose the key $k=({u,c})$ , hash it with $h$ , and then search the first vacant slot $H[i^{\prime}]$ from $i=h(k)\bmod m$ by linear probing. Given $u_{\text{max}}$ is the currently largest node id, we assign the id $v=u_{\text{max}}+1$ to the new child, and set $H[i^{\prime}]=k$ and $M[i^{\prime}]=v$ . The displacement information $i^{\prime}-i$ is maintained analogously to m-Bonsai.

In the same manner as m-Bonsai, we can think of two representations depending on whether $H$ is compactified or not. The non-compact one is referred to as PFKT (Plain FK-hash Trie). The compact one is referred to as CFKT (Compact FK-hash Trie). Compared to PBT and CBT, PFKT and CFKT keep an additional integer array $M$ and require $m\lceil{\log{n}}\rceil$ additional bits of space.

Table Growing

An advantage of FK-hash is that growing the hash table is done in the same manner as in standard closed hash tables. In detail, $H$ can be enlarged by scanning nodes on $H$ from left to right and relocating the nodes in a new hash table $H^{\prime}$ of length $2m$ . The growing algorithm takes $\mathcal{O}(m)$ expected time. This time complexity is identical to that of Algorithm 1; however, the growing algorithm of FK-hash is faster in practice because of its simplicity. In addition, no auxiliary data structure is needed like Map and Done used by Algorithm 1.

5.2. NLM Data Structures

Like in Section 4.4, we introduce PLM and SLM adapted to FK-hash. Figure 6 shows an example for each of them. Although PLM in FK-hash is basically identical to that in m-Bonsai, SLM can be simplified as follows.

In m-Bonsai, it is necessary to identify whether $L_{i}$ exists and the rank of $L_{i}$ in the group because node ids are randomly assigned; therefore, we introduced a bitmap $B$ of length $m$ and utilized the popcount operation. In FK-hash, however, such a bitmap is not needed because node ids are incrementally assigned. Put simply, a node label $L_{i}$ is stored in the group of id $g=\lfloor{i/\ell}\rfloor$ and located at the $(i\bmod\ell)$ -th position in the group. When using the skipping technique, care has to be taken for the step nodes whose node labels are empty. For each of them, we put the length 0 in its corresponding concatenated label string. For example, we put a ’0’ in the second concatenated label string for the step node $u_{5}$ in Figure 6b. Finally, we can insert a new node label by appending it to the last concatenated label string.

6. Experiments

In this section we evaluate the practical performance of DynPDT. The source code for our experiments are available at https://github.com/kampersanda/dictionary_bench.

6.1. Setup

We conducted all experiments on one core of a quad-core Intel Xeon CPU E5-2680 v2 clocked at 2.80 Ghz in a machine with 256 GB of RAM, running the 64-bit version of CentOS 6.10 based on Linux 2.6. We implemented our data structures in C++17. We compiled the source code with g++ (version 7.3.0) in optimization mode -O3. We used 4-byte integers for the values associated with the keywords.

Datasets

Our benchmarks are based on the following eight real-world datasets:

•

GeoNames consists of 7 million different names for the geographic points provided by the GeoNames database.101010http://download.geonames.org/export/dump/ Managing such geographic identifiers within a limited resource is essential in modern geographic information systems as described in (Martínez-Prieto et al., 2016). We obtained the geographic names by extracting the asciiname column of the GeoNames dump in the same manner as (Martínez-Prieto et al., 2016).

•

AOL consists of 10 million different search queries in the AOL database, which is a huge collection of 20 million search queries from 650,000 users sampled over three months.111111http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/ The dataset contains keywords written in natural English, which has been often used to benchmark search algorithms such as (Grossi and Ottaviano, 2014).

•

Wiki consists of 14 million different page titles from the English Wikipedia dump at September 2018.121212https://dumps.wikimedia.org/enwiki/ As the dataset contains various special characters encoded in UTF-8, the alphabet size is larger than that of AOL. It is also a well-used dataset to benchmark search algorithms such as (Grossi and Ottaviano, 2014; Arz and Fischer, 2018; Kanda et al., 2017a).

•

DNA consists of all 12-mers (i.e., substrings of length 12) found in the DNA dataset from the Pizza&Chili corpus.131313http://pizzachili.dcc.uchile.cl/texts/dna/ Among the used datasets, it has the smallest alphabet and the shortest keywords. The number of keywords is 15 million. In bioinformatics, popular alignment software need to manage such keywords within limited space as described in (Martínez-Prieto et al., 2016).

•

LUBMS consists of 53 million different URIs extracted from the RDF dataset generated by the Lehigh University Benchmark (Guo et al., 2005) for 1,600 universities.141414The dataset is distributed under the name ‘DS5’ at https://exascale.info/projects/web-of-data-uri/. Modern RDF systems (Wylot et al., 2011; Wylot et al., 2014) encode URIs in a huge set into unique integers by using a dynamic keyword dictionary. The dataset is evaluated in (Mavlyutov et al., 2015) to analyze the performances of RDF systems.

•

LUBML consists of 230 million different URIs extracted from the RDF dataset generated by the Lehigh University Benchmark (Guo et al., 2005) for 7,000 universities.151515Although this dataset is not distributed, one can obtain the identical dataset through the LUBM data generator (called UBA) at http://swat.cse.lehigh.edu/projects/lubm/. The dataset is a larger version of LUBMS. It is also evaluated in (Mavlyutov et al., 2015).

•

UK consists of 40 million different URLs obtained from a 2005 crawl of the .uk domain performed by UbiCrawler (Boldi et al., 2004).161616http://law.di.unimi.it/webdata/uk-2005/ URLs are traditionally used to benchmark search algorithms for long strings such as (Grossi and Ottaviano, 2014; Arz and Fischer, 2018; Kanda et al., 2017a; Askitis and Sinha, 2010). Also, the modern Web crawler (Ueda et al., 2013) manages a huge set of URLs by using a dynamic keyword dictionary.

•

WebBase consists of 118 million different URLs of a 2001 crawl performed by the WebBase crawler (Hirai et al., 2000).171717http://law.di.unimi.it/webdata/webbase-2001/ The dataset is larger than UK and also used in previous experiments of keyword dictionaries such as (Grossi and Ottaviano, 2014).

Table 1 summarizes relevant statistics for each dataset.

6.2. Average Height

We evaluate the average height of the DynPDT $\mathcal{T}^{c}_{\mathcal{S}}$ built on our datasets. The average height of $\mathcal{T}^{c}_{\mathcal{S}}$ is the arithmetic mean of the heights of all nodes over the number of nodes, omitting step nodes in the calculation. Although the average height is an important measure related to the average number of random accesses, we cannot a priori predict the average height of DynPDT because this number depends on the insertion order of the keywords. To reason about the quality of the average height, we study it in relation to the following known lower and upper bounds on it: The lower bound is the average height of the path-decomposed trie created by the centroid path decomposition (Alexandre, 2016, Corollary 3). The upper bound is the average height of the path-decomposed trie created by always choosing the child whose subtrie has the fewest number of leaves.

Table 2 shows the experimental results of the average heights of $\mathcal{T}^{c}_{\mathcal{S}}$ and $\mathcal{T}_{\mathcal{S}}$ for all the datasets. To analyze the performance of DynPDT in our experiments, we constructed DynPDT dictionaries by inserting keywords in random order. For that, we shuffled the dataset with the Fisher–Yates shuffle algorithm (Durstenfeld, 1964). Naturally, the actual average heights of $\mathcal{T}^{c}_{\mathcal{S}}$ are between their lower and upper bounds, and those of $\mathcal{T}_{\mathcal{S}}$ are the same as AveLen. The upper bounds are more than twice as large as the lower bounds for AOL, UK, and WebBase; however, the upper bounds were up to 5.4x smaller than the average heights of $\mathcal{T}_{\mathcal{S}}$ due to the path decomposition, especially for long keywords such as URIs. Therefore, the incremental path decomposition can make dynamic keyword dictionaries more cache-friendly, especially for long keywords even if the insertion order is inconvenient and the average height is close to the upper bound.

6.3. Parameter for Step Nodes

The parameter $\lambda$ influences the number of step nodes. We analyze the space and time performance of DynPDT when varying the parameter $\lambda$ . In this experiment, we constructed DynPDT dictionaries for each parameter $\lambda\in\{4,8,16,\ldots,1024\}$ on the datasets Wiki, LUBMS and UK, and observed the working space and the construction time. For the DynPDT representation, we tested the combination of CFKT and SLM with $\ell=16$ , referred to as PDT-CFK in the following. As described in Section 6.2, the dictionary was constructed by inserting keywords in random order. The working space was measured by checking the maximum resident set size (RSS) required during the online construction.

Table 3 shows the experimental results for construction. Since $\lambda$ has a direct impact on $\sigma$ , which influences the space usage of $H$ , the working space depends on the value of $\lambda$ . Although this dependency looks like $\lambda$ and the taken space are in direct correlation, for Wiki and UK, the working spaces for $\lambda=4$ (i.e., 0.36 GiB and 1.22 GiB respectively) were not the smallest. For Wiki, the reason for this is that many step nodes raised the load factor $\alpha$ and involved an additional enlargement of the hash table. Specifically, the enlargements were conducted nine times with $\lambda=4$ , although they were conducted eight times with $\lambda\geq 8$ . For UK, this reason is that the high load factor $\alpha$ caused by a huge number of step nodes raised the average displacement value stored in $D$ and involved the use of $D_{2}$ and $D_{3}$ , although no additional enlargement was conducted. Regarding the time performance, this huge number of step nodes slowed down the construction. Therefore, a too small parameter $\lambda$ can involve large space requirements and long construction times. On the other hand, when $16\leq\lambda$ , the working space and construction time do not significantly vary.

From this observation, we derive two facts for $\lambda$ : On the one hand, the most important recommendation is not to choose a parameter $\lambda$ that is too small. On the other hand, choosing a large parameter $\lambda$ is not a significant problem because the space and time performance do not significantly decrease as $\lambda$ grows. For example, when $\lambda=32$ on Wiki, the proportion of step nodes is 0.12%; however, even with a larger parameter $\lambda$ such as 512 or 1024, the working space and construction time are almost the same. Table 4 shows Steps for each parameter $\lambda$ and the average length of the node labels (denoted by AveNLL) for all the datasets. Even for long keywords like URLs (i.e., UK), AveNLL is bounded by 18.0 and Steps is within 1% of all nodes when $\lambda=64$ . Among the tested values for $\lambda$ , we suggest setting $\lambda$ to 32 or 64 for keywords whose length is not much longer than that of the URL datasets.

6.4. Comparison among DynPDT Representations

We compared the performance of our DynPDT representations, for which we benchmarked the following six combinations:

•

PDT-PB is the combination of PBT and PLM,

•

PDT-SB is the combination of PBT and SLM,

•

PDT-CB is the combination of CBT and SLM,

•

PDT-PFK is the combination of PFKT and PLM,

•

PDT-SFK is the combination of PFKT and SLM, and

•

PDT-CFK is the combination of CFKT and SLM.

We evaluated the working space during the construction and the running times of insert and lookup. Like in Section 6.3, we constructed each dictionary and measured its working space. To measure the lookup time, we chose 1 million random keywords from each dataset. The running times are the average of 10 runs. For SLM, we tested $\ell\in\{8,16,32,64\}$ . For $\lambda$ , we chose the smallest value among those from Table 4 where Steps is less than 1%.

Figure 7 shows the experimental results for GeoNames and WebBase. Regarding the representations using SLM, the working space is the largest but the running times are the shortest with $\ell=8$ , and vice versa with $\lambda=64$ . In other words, for each representation in the plots, the rightmost and lowest result is the one with $\ell=8$ , and the leftmost and highest result is the one with $\ell=64$ .

We observe that

•

SLM significantly reduces the working space of PLM. Compared to PDT-PB, PDT-SB is 57–65% smaller for GeoNames and 46–56% smaller for WebBase. Compared to PDT-PFK, PDT-SFK is 56–61% smaller for GeoNames and 47–52% smaller for WebBase.

•

Regarding the representations based on m-Bonsai, the insert time of SLM is slower than that of PLM because inserting a new node label into the group is costly. When $\ell=8$ , the insertion of PDT-SB is 29-163% slower than that of PDT-PB; however, the lookup times are competitive.

•

Regarding the representations based on FK-hash, SLM with $\ell=8$ is competitive to PLM with respect to the insert time because the update algorithm is simple. Also, the lookup times are competitive.

•

The time performance of SLM with large group sizes ( $\ell=32$ or $64$ ) is worse than that of SLM with small group sizes ( $\ell=8$ or $16$ ). For example, for GeoNames, PDT-SB with $\ell=64$ is 19% smaller but 81–105% slower than PDT-SB with $\ell=8$ .

•

The compact trie representations CBT and CFKT are more lightweight but slower than the plain representations PBT and PFKT; however, the differences are small. For example, PDT-SB is 12% smaller but 8–11% slower than PDT-CB for GeoNames.

•

The representations based on m-Bonsai are smaller than those based on FK-hash. Also regarding the lookup time, the m-Bonsai representations are faster. However, regarding the insert time, the FK-hash representations are faster because the growing algorithm is simple.

6.5. Comparison with Existing Data Structures

We compare the performance of DynPDT with existing data structures. We exhaustively tested existing implementations of dynamic keyword dictionaries such as open-source dynamic hash containers (Gregory, 2016; Tessil, 2017c, 2016) and recent dynamic trie indexes (Tsuruta et al., 2020; Takagi et al., 2016). However, compared to DynPDT, most of them consumed significantly more space. For our benchmarks, we selected the following four space-efficient implementations:181818All the experimental results are shown in Appendix A.

•

ArrayHash is a cache-conscious hash table with string keys (Askitis and Zobel, 2005).

•

HAT is a hybrid data structure of the burst trie (Heinz et al., 2002) and ArrayHash (Askitis and Sinha, 2010).

•

Judy is a trie-based dictionary implementation developed at Hewlett-Packard Research Labs (Baskins, 2002).

•

Cedar developed by Yoshinaga (Yoshinaga and Kitsuregawa, 2014) is an efficient dictionary implementation based on dynamic double-array tries (Aoe, 1989).

For ArrayHash and HAT, we used Tessil’s implementations (Tessil, 2017a, b). From the three implementation variations of Cedar, we took one based on a reduced trie (Yoshinaga and Kitsuregawa, 2014) and one based on prefix trie (Aoe, 1989), and denote them by Cedar-R and Cedar-P, respectively. Cedar-R is suitable for short keywords191919We cannot be more concrete here since the efficiency of the heuristics of these data structures do not merely depend on the keyword lengths., whereas Cedar-P is suitable for the general case.

We evaluated the working space and the running times in the same manner as Section 6.4. Figure 8 shows the experimental results for the four datasets GeoNames, AOL, Wiki, and DNA consisting of short keywords. Figure 9 shows the experimental results for the four datasets LUBMS, LUBML, UK, and WebBase consisting of long keywords. For our methods, we only plot the results of PDT-SB, PDT-CB, PDT-SFK and PDT-CFK, setting $\ell$ to 8, 16, or 32. To keep focus on the competitive contestants in the plots, we omitted some weaker instances, namely the DynPDT dictionaries with $\ell=64$ and the dictionaries with PLM. The former are too slow, while the latter take too much working space. Only for DNA, we plotted the results of Cedar-R instead of Cedar-P because Cedar-R is superior on that instance. For LUBML and WebBase,we were not able to run our experiments with Cedar because the resulting number of trie nodes becomes too large to be representable in Cedar based on 32-bit pointers. For the long keywords (Figure 9), we omitted the results of ArrayHash because its working space is too large. For example, ArrayHash is 143% larger than HAT for LUBMS.

Based on Figure 8 showing the evaluation for short keywords, we can state the following observations:

•

The DynPDT dictionaries are the smallest. PDT-CB for $\ell=32$ is 25–48% smaller than the existing smallest data structures (Cedar-R for DNA and HAT for the others). PDT-CFK with $\ell=32$ is 29–39% smaller than HAT for the datasets except DNA.

•

Regarding the insert time, HAT is the fastest. Except for DNA, the DynPDT dictionaries based on FK-hash, PDT-SFK and PDT-CFK, are competitive to the other data structures.

•

Regarding the lookup time, ArrayHash is the fastest. Except for DNA, the DynPDT dictionaries based on m-Bonsai, PDT-SB and PDT-CB, are competitive to Judy.

•

For DNA consisting of short keywords, the DynPDT dictionaries are not efficient because the merits of the path decomposition applied to a trie with only short paths become negligible to the additional burden of representing the trie with two separate data structures, one for its path-decomposed trie topology and one for its node labels.

Based on Figure 9 showing the evaluation for long keywords, we can state the following observations:

•

The DynPDT dictionaries are the smallest for all the datasets. When $\ell=32$ , PDT-CB is 49–60% smaller than Cedar-P for LUBMS and UK, and is 64–68% smaller than Judy for LUBML and WebBase. When $\ell=32$ , PDT-CFK is 42–49% smaller than Cedar-P for LUBMS and UK, and is 58–59% smaller than Judy for LUBML and WebBase.

•

Regarding the insert time, PDT-SFK is competitive to the other data structures.

•

Regarding the lookup time, HAT is the fastest although its working space is large. Compared to PDT-SB with $\ell=8$ , HAT is 40–78% faster but 48–61% larger.

•

In many cases, the DynPDT dictionaries outperform Judy and Cedar-P. For example, PDT-SFK with $\ell=8$ is 48% smaller and 4–25% faster than Judy for WebBase. PDT-CB with $\ell=8$ is 48% smaller and 15–35% faster than Cedar-P for LUBMS.

Summary

Throughout all dataset instances, DynPDT is the smallest data structure. Especially for long keywords such as URIs, our dictionaries are space-efficient and fast thanks to the path decomposition; however, they are not efficient for extremely short keywords because the path decomposition does not work well on such instances. In summary, DynPDT is useful for in-memory applications handling massive datasets consisting of long keywords.

For example, the RDF database system Diplodocus (Wylot et al., 2011; Wylot et al., 2014) encodes every URI as an integer number through a dynamic keyword dictionary because the fixed-size integers can be handled more efficiently than the original strings having variable lengths. Since the encoding time is a significant part of the query execution time on the Diplodocus system, Mavlyutov et al. (Mavlyutov et al., 2015) experimentally compared a series of dynamic keyword dictionaries. Actually, LUBMS and LUBML of our datasets are exactly those evaluated in (Mavlyutov et al., 2015). They concluded that HAT is a good data structure taking aspects like working space and time performance into account.202020Judy and Cedar were not evaluated in (Mavlyutov et al., 2015). However, as demonstrated in our experiments, our DynPDT dictionaries can maintain the URI datasets in space up to 74% smaller than HAT, while keeping competitive insertion times. Although DynPDT’s slow lookup time is a drawback compared to HAT, maintaining massive RDF database systems in main-memory is essential, and we believe that DynPDT’s high memory efficiency will contribute to the future of Semantic Web applications.

7. Conclusion

We presented a novel data structure for dynamic keyword dictionaries — called DynPDT — which is applicable to scalable string data processing. For that, we applied path decomposition and utilized the recent hash-based trie representations m-Bonsai and FK-hash. We demonstrated with experiments on real-world massive datasets that the memory footprint of DynPDT is the smallest within a careful selection of efficient dynamic keyword dictionaries. It is especially efficient for long keywords due to the path decomposition approach.

Our results pave new ways for major improvements in various existing systems because the dynamic keyword dictionary problem is a common task in applications such as vocabulary accumulation for inverted-index construction (Heinz et al., 2002), RDF database systems (Wylot et al., 2011; Wylot et al., 2014), in-memory OLTP (online transaction processing) database systems (Leis et al., 2013), Web crawlers (Ueda et al., 2013), and search engines (Brazil Inc., 2019; Busch et al., 2012). DynPDT can contribute to those systems especially by reducing their memory requirements. Although we have put the focus on the keyword dictionary problem in this paper, DynPDT as a general data structure is of independent interest, being useful for applications handling dynamic tries. An interesting application is the LZD compression (Goto et al., 2015; Badkobeh et al., 2017), a variation of the LZ78 compression (Ziv and Lempel, 1978). Since the LZD algorithm maintains long factors (or strings) in a dynamic trie, we are confident that the incremental path decomposition on such a trie will have performance benefits.

Our future plans for DynPDT are as follows.

•

The burst trie developed by Heinz et al. (Heinz et al., 2002) maintains sparse subtries in a trie in dynamic containers of strings by collapsing the subtries. DynPDT would be suited as an alternative container representation to enhance the memory efficiency of the burst trie.

•

In our experiments, we implemented the second data structure of the displacement array $D_{2}$ through the CHT by Cleary (Cleary, 1984), following the original m-Bonsai approach (Poyias et al., 2018). Recently, Köppl et al. (Köppl et al., 2020) developed space-efficient hash tables with separate chaining and compact hashing. Although the CHT needs additional displacement information (i.e., two bit arrays), his hash tables do not need such additional information. We expect that his hash tables are suitable representations of $D_{2}$ .

Acknowledgements.

We thank Kazuya Tsuruta for kindly providing us the implementations used in (Tsuruta et al., 2020). We thank the anonymous reviewers for their helpful comments. A part of this work was supported by JSPS KAKENHI Grant Numbers 17J07555 and JP18F18120.

Appendix A Experimental Results

Within the same setting as in Section 6.5, we present an extended evaluation including the following contestants:

•

STLHash is the hash table std::unordered_map of the C++ standard library.

•

GoogleDense is the hash table implementation google::dense_hash_map of Google (Google Inc., 2005).

•

Sparsepp is Gregory Popovitch’s space-efficient hash container implementation derived from Google’s sparse hash table (Gregory, 2016).

•

Hopscotch is Tessil’s hash table implementation using hopscotch hashing (Tessil, 2016).

•

Robin is Tessil’s hash table implementation using robin hood hashing (Tessil, 2017c).

•

ART is Armon Dadgar’s implementation (Dadgar, 2012) of the adaptive radix tree (Leis et al., 2013).

Further, we include the following implementations, which are also used and studied in the experimental section of (Tsuruta et al., 2019):

•

PCT-Bit is a packed compact trie using bit parallelism (Takagi et al., 2016).

•

PCT-Hash is a packed compact trie using additionally STLHash as a dictionary in each micro trie (Takagi et al., 2016).

•

ZFT is Tsuruta’s C++ implementation of the z-fast trie (Belazzougui et al., 2010).

•

CTrie++ is a trie (Tsuruta et al., 2019) combining aspects of the z-fast trie with the packed compact trie.

Table 5 shows the results for the datasets consisting of short keywords (i.e., GeoNames, AOL, Wiki and DNA). Table 6 shows the results for the datasets consisting of long keywords (i.e., LUBMS, LUBML, UK and WebBase). In these tables, Space is the working space in GiB, Insert is the average insertion time in microseconds, and Lookup is the average lookup time in microseconds. For SLM of DynPDT, the results with $\ell=16$ are shown. Concerning PCT-Bit, PCT-Hash, ZFT and CTrie++, we could not obtain some results for large datasets because the resulting trie was too large to fit into RAM.

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Alexandre (2016) Daigle Alexandre. 2016. Optimal path-decomposition of tries . Ph.D. Dissertation. University of Waterloo.
3Aoe (1989) Jun’ichi Aoe. 1989. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering 15, 9 (1989), 1066–1077. https://doi.org/10.1109/32.31365 · doi ↗
4Arroyuelo et al . (2017) Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Rajeev Raman. 2017. LZ 78 compression in low main memory space. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE) . 38–50. https://doi.org/10.1007/978-3-319-67428-5_4 · doi ↗
5Arroyuelo et al . (2016) Diego Arroyuelo, Pooya Davoodi, and Srinivasa Rao Satti. 2016. Succinct dynamic cardinal trees. Algorithmica 74, 2 (2016), 742–777. https://doi.org/10.1007/s 00453-015-9969-x · doi ↗
6Arz and Fischer (2018) Julian Arz and Johannes Fischer. 2018. Lempel–Ziv-78 compressed string dictionaries. Algorithmica 80, 7 (2018), 2012–2047. https://doi.org/10.1007/s 00453-017-0348-7 · doi ↗
7Askitis (2007) Nikolas Askitis. 2007. Efficient data structures for cache architectures . Ph.D. Dissertation. RMIT University.
8Askitis and Sinha (2010) Nikolas Askitis and Ranjan Sinha. 2010. Engineering scalable, cache and space efficient tries for strings. The VLDB Journal 19, 5 (2010), 633–660. https://doi.org/10.1007/s 00778-010-0183-9 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Dynamic Path-Decomposed Tries

Abstract.

1. Introduction

1.1. Space-Efficient Dynamic Tries

Theoretical Discussion

Hash Tries

Compacted Tries

1.2. Our Contribution

1.3. Paper Structure

2. Preliminaries

2.1. Keyword Dictionary

2.2. Tries

2.3. Path Decomposition

Example 2.1 (Path-Decomposed Trie).

Property 1.

Centroid Path Decomposition

Property 2 ((Ferragina et al., 2008)).

Key-Value Mapping

3. Dynamic Path-Decomposed Trie

3.1. Incremental Path Decomposition

Example 3.1 (Construction).

3.2. Dictionary Operations

Example 3.2 (Retrieval).

3.3. Fixing the Alphabet

Example 3.3 (Step Node).

3.4. Representation Scheme

Node-Label-Map

Trie Representation TD\mathcal{T}_{\mathcal{D}}TD​

Motivation for m-Bonsai and FK-hash

4. Representation Based on m-Bonsai

Outline

4.1. Plain Trie Representation

Practical Implementation

4.2. Compact Trie Representation

Practical Representation of the Displacement Array

Design of the Bijective Transform

4.3. Linear-Time Growing Algorithm

Extra Working Space

4.4. NLM Data Structures

Plain NLM

Sparse NLM

5. Representation Based on FK-hash

Outline

5.1. Trie Representations

Table Growing

5.2. NLM Data Structures

6. Experiments

6.1. Setup

Datasets

6.2. Average Height

6.3. Parameter for Step Nodes

6.4. Comparison among DynPDT Representations

6.5. Comparison with Existing Data Structures

Summary

7. Conclusion

Acknowledgements.

Appendix A Experimental Results

Trie Representation $\mathcal{T}_{\mathcal{D}}$