Entropy Bounds for Grammar-Based Tree Compressors

Danny Hucke; Markus Lohrey; and Louisa Seelbach Benkner

arXiv:1901.03155·cs.DS·May 21, 2020

Entropy Bounds for Grammar-Based Tree Compressors

Danny Hucke, Markus Lohrey, and Louisa Seelbach Benkner

PDF

TL;DR

This paper extends the concept of empirical entropy to binary trees and demonstrates that grammar-based tree compression can achieve encoding sizes close to this entropy measure, generalizing previous string compression results.

Contribution

It introduces a new entropy measure for trees and shows that grammar-based tree encodings can be bounded by this measure, extending string compression theories to trees.

Findings

01

Tree entropy bounds are established for grammar-based tree compressors.

02

Binary encodings of trees are shown to be near the entropy limit.

03

Generalization of string compression results to tree structures.

Abstract

The definition of $k^{t h}$ -order empirical entropy of strings is extended to node labelled binary trees. A suitable binary encoding of tree straight-line programs (that have been used for grammar-based tree compression before) is shown to yield binary tree encodings of size bounded by the $k^{t h}$ -order empirical entropy plus some lower order terms. This generalizes recent results for grammar-based string compression to grammar-based tree compression.

Tables1

Table 1. Table 1. Experimental results for XML tree structures, where n 𝑛 n denotes the number of nodes and σ 𝜎 \sigma denotes the number of node labels.

XML document	$n$	$σ$	$w := (2 + \log_{2} σ) n$	$H_{1} / w$	$H_{2} / w$	$H_{4} / w$	$H_{8} / w$
Baseball	28 306	46	212 961.9447	2.9818 %	1.2547 %	0.6739 %	0.6662 %
DBLP	3 332 130	35	23 755 697.8193	10.9775 %	8.7407 %	8.2134 %	6.7270 %
DCSD-Normal	2 242 699	50	17 142 868.6330	4.2437 %	2.2481 %	1.7517 %	1.3038 %
EnWikiNew	404 652	20	2 558 180.8475	9.5317 %	3.0760 %	3.0759 %	2.9378 %
EnWikiQuote	262 955	20	1 662 382.6021	9.4270 %	3.1014 %	3.1014 %	3.1006 %
EnWikiVersity	495 839	20	3 134 658.5046	8.8952 %	2.3753 %	2.3753 %	2.3750 %
EXI-Array	226 523	47	1 711 288.1304	0.2506 %	0.2495 %	0.2492 %	0.2483 %
EXI-factbook	55 453	199	534 379.7451	2.2034 %	0.9450 %	0.8132 %	0.8092 %
EXI-Invoice	15 075	52	116 084.1288	0.0484 %	0.0268 %	0.0139 %	0.0098 %
EXI-Telecomp	177 634	39	1 294 135.1377	1.5405 %	0.0044 %	0.0034 %	0.0021 %
EXI-weblog	93 435	12	521 830.9713	0.0032 %	0.0028 %	0.0028 %	0.0028 %
Lineitem	1 022 976	18	6 311 685.1983	0.0003 %	0.0003 %	0.0003 %	0.0003 %
Mondial	22 423	23	146 277.8297	11.1285 %	9.2940 %	8.4702 %	7.7679 %
NASA	476 646	61	3 780 154.2290	7.7424 %	4.4588 %	3.8898 %	3.8054 %
Shakespeare	179 690	22	1 160 695.2676	11.9140 %	10.8416 %	10.6368 %	10.4765 %
SwissProt	2 977 031	85	25 035 017.5080	12.1892 %	10.5249 %	9.2455 %	8.1204 %
TCSD-Normal	2 749 751	24	18 107 007.2213	8.5450 %	8.4004 %	8.2862 %	8.2472 %
Treebank	2 437 666	250	24 293 253.5140	30.8912 %	23.0825 %	19.2444 %	13.4058 %
USHouse	6 712	43	49 845.0890	21.0500 %	18.2164 %	12.6572 %	9.3754 %
XMark1	167 865	74	1 378 079.8892	12.1610 %	9.5101 %	9.2271 %	8.4281 %
XMark2	1 666 315	74	13 679 535.2849	12.2125 %	9.5634 %	9.3259 %	8.9400 %

Equations220

∣ B (G_{t}) ∣ \leq H_{k} (t) + O (k n lo g \overset{σ}{^} / lo g_{\overset{σ}{^}} n) + O (n lo g lo g_{\overset{σ}{^}} n / lo g_{\overset{σ}{^}} n) + σ,

∣ B (G_{t}) ∣ \leq H_{k} (t) + O (k n lo g \overset{σ}{^} / lo g_{\overset{σ}{^}} n) + O (n lo g lo g_{\overset{σ}{^}} n / lo g_{\overset{σ}{^}} n) + σ,

H (p) = a \in A \sum - p (a) lo g_{2} p (a) = a \in A \sum p (a) lo g_{2} (1/ p (a)) .

H (p) = a \in A \sum - p (a) lo g_{2} p (a) = a \in A \sum p (a) lo g_{2} (1/ p (a)) .

H (p) = a \in A \sum - p (a) lo g_{2} p (a) \leq a \in A \sum - p (a) lo g_{2} q (a);

H (p) = a \in A \sum - p (a) lo g_{2} p (a) \leq a \in A \sum - p (a) lo g_{2} q (a);

D (p ∣ ∣ q) = a \in A \sum p (a) \cdot lo g_{2} (p (a) / q (a)) .

D (p ∣ ∣ q) = a \in A \sum p (a) \cdot lo g_{2} (p (a) / q (a)) .

p_{\overline{a}} (a) = \frac{∣ { i ∣ 1 \leq i \leq l , a _{i} = a } ∣}{n} .

p_{\overline{a}} (a) = \frac{∣ { i ∣ 1 \leq i \leq l , a _{i} = a } ∣}{n} .

H (\overline{a}) = n \cdot H (p_{\overline{a}}) = - i = 1 \sum l lo g_{2} p_{\overline{a}} (a_{i}) .

H (\overline{a}) = n \cdot H (p_{\overline{a}}) = - i = 1 \sum l lo g_{2} p_{\overline{a}} (a_{i}) .

i = 1 \sum l - lo g_{2} p_{\overline{a}} (a_{i}) \leq i = 1 \sum l - lo g_{2} q (a_{i}) .

i = 1 \sum l - lo g_{2} p_{\overline{a}} (a_{i}) \leq i = 1 \sum l - lo g_{2} q (a_{i}) .

a lo g_{2} (\frac{b}{a}) \geq i = 1 \sum l a_{i} lo g_{2} (\frac{b _{i}}{a _{i}}) .

a lo g_{2} (\frac{b}{a}) \geq i = 1 \sum l a_{i} lo g_{2} (\frac{b _{i}}{a _{i}}) .

C_{k} \sim \frac{4 ^{k}}{π k ^{\frac{3}{2}}},

C_{k} \sim \frac{4 ^{k}}{π k ^{\frac{3}{2}}},

L = (Σ {0, 1})^{*} = {a_{1} i_{1} \dots a_{n} i_{n} ∣ n \geq 0, a_{k} \in Σ, i_{k} \in {0, 1} for all 1 \leq k \leq n} .

L = (Σ {0, 1})^{*} = {a_{1} i_{1} \dots a_{n} i_{n} ∣ n \geq 0, a_{k} \in Σ, i_{k} \in {0, 1} for all 1 \leq k \leq n} .

h_{k} (v) = ℓ_{k} ((□ 0)^{k} h (v)) \in L_{k},

h_{k} (v) = ℓ_{k} ((□ 0)^{k} h (v)) \in L_{k},

V_{z} (t) = {v \in V (t) ∣ h_{k} (v) = z}

V_{z} (t) = {v \in V (t) ∣ h_{k} (v) = z}

Prob_{P} (s) = v \in V (s) \prod P_{h (v)} (λ_{s} (v)) .

Prob_{P} (s) = v \in V (s) \prod P_{h (v)} (λ_{s} (v)) .

T_{n + 1}^{'} = T_{n}^{'} \cup {a (t_{1}, t_{2}) ∣ a \in Σ, t_{1}, t_{2} \in T_{n}^{'}} .

T_{n + 1}^{'} = T_{n}^{'} \cup {a (t_{1}, t_{2}) ∣ a \in Σ, t_{1}, t_{2} \in T_{n}^{'}} .

t \in c [T] \sum Prob_{P^{'}} (t)

t \in c [T] \sum Prob_{P^{'}} (t)

(a)

(a)

c \in C_{n} \sum t \in c [T] \sum Prob_{P^{'}} (t) \leq (n + 1) t \in T \sum Prob_{P^{'}} (t) = n + 1.

c \in C_{n} \sum t \in c [T] \sum Prob_{P^{'}} (t) \leq (n + 1) t \in T \sum Prob_{P^{'}} (t) = n + 1.

Prob_{P} (s) = z \in L_{k} \prod v \in V_{z} (s) \prod P_{z} (λ (v)),

Prob_{P} (s) = z \in L_{k} \prod v \in V_{z} (s) \prod P_{z} (λ (v)),

m_{z}^{t} = ∣ V_{z} (t) ∣

m_{z}^{t} = ∣ V_{z} (t) ∣

m_{z, \tilde{a}}^{t} = ∣ {v \in V_{z} (t) ∣ λ (v) = \tilde{a}} ∣.

m_{z, \tilde{a}}^{t} = ∣ {v \in V_{z} (t) ∣ λ (v) = \tilde{a}} ∣.

P_{z}^{t} (\tilde{a}) = \frac{m _{z, \tilde{a}}^{t}}{m _{z}^{t}}

P_{z}^{t} (\tilde{a}) = \frac{m _{z, \tilde{a}}^{t}}{m _{z}^{t}}

H_{k} (t) = z \in L_{k} \sum m_{z}^{t} H (P_{z}^{t}) .

H_{k} (t) = z \in L_{k} \sum m_{z}^{t} H (P_{z}^{t}) .

0 \leq H_{k} (t) \leq (2 n - 1) lo g_{2} (2 σ) = (2 n - 1) (1 + lo g_{2} σ)

0 \leq H_{k} (t) \leq (2 n - 1) lo g_{2} (2 σ) = (2 n - 1) (1 + lo g_{2} σ)

H_{k} (t) = z \in L_{k} \sum H (w (t, z)),

H_{k} (t) = z \in L_{k} \sum H (w (t, z)),

H_{k} (t) \leq - lo g_{2} Prob_{P} (t)

H_{k} (t) \leq - lo g_{2} Prob_{P} (t)

- lo g_{2} Prob_{P} (t)

- lo g_{2} Prob_{P} (t)

r (A_{0}) = a (A_{1}, A_{2} (b)), r (A_{1}) = A_{2} (A_{2} (b)), r (A_{2}) = b (x, a) .

r (A_{0}) = a (A_{1}, A_{2} (b)), r (A_{1}) = A_{2} (A_{2} (b)), r (A_{2}) = b (x, a) .

ρ (A_{i}) = ⎩ ⎨ ⎧ A_{j} α A_{j} A_{k} a α if r (A_{i}) = A_{j} (α) if r (A_{i}) = A_{j} (A_{k} (x)) if r (A_{i}) = a (α, x) or a (x, α)

ρ (A_{i}) = ⎩ ⎨ ⎧ A_{j} α A_{j} A_{k} a α if r (A_{i}) = A_{j} (α) if r (A_{i}) = A_{j} (A_{k} (x)) if r (A_{i}) = a (α, x) or a (x, α)

H (G) = H (ω_{G}) .

H (G) = H (ω_{G}) .

r (A_{0}) = A_{1} (A_{2}), r (A_{1}) = a (x, A_{3}), r (A_{2}) = A_{4} (A_{3}),

r (A_{0}) = A_{1} (A_{2}), r (A_{1}) = a (x, A_{3}), r (A_{2}) = A_{4} (A_{3}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Entropy Bounds for Grammar-Based Tree Compressors

Danny Hucke

,

Markus Lohrey

and

Louisa Seelbach Benkner

Universität Siegen, Germany

{hucke,lohrey,seelbach}@eti.uni-siegen.de

Abstract.

The definition of $k^{th}$ -order empirical entropy of strings is extended to node-labeled binary trees. A suitable binary encoding of tree straight-line programs (that have been used for grammar-based tree compression before) is shown to yield binary tree encodings of size bounded by the $k^{th}$ -order empirical entropy plus some lower order terms. This generalizes recent results for grammar-based string compression to grammar-based tree compression.

Keywords. Grammar-based compression, binary trees, empirical entropy, lossless compression

This work has been supported by the DFG research project LO 748/10-1 (QUANT-KOMP)

1. Introduction

Grammar-based string compression.

The idea of grammar-based compression is based on the fact that in many cases a word $w$ can be succinctly represented by a context-free grammar that produces exactly $w$ . Such a grammar is called a straight-line program (SLP) for $w$ . In the best case, one gets an SLP of size $\Theta(\log n)$ for a word of length $n$ , where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces for a given word $w$ an SLP $\mathcal{G}_{w}$ for $w$ , where, of course, $\mathcal{G}_{w}$ should be smaller than $w$ . Grammar-based compressors can be found at many places in the literature. Probably the best known example is the classical LZ78-compressor of Lempel and Ziv [31]. Indeed, it is straightforward to transform the LZ78-representation of a word $w$ into an SLP for $w$ . Other well-known grammar-based compressors are Bisection [20], Sequitur [27], and Repair [21], just to mention a few.

Recently, several upper bounds on the compression perfomance of grammar-based compressors in terms of higher order empirical entropy have been shown. For this, the choice of a concrete binary encoding $B(\mathcal{G})$ of an SLP $\mathcal{G}$ is crucial. Kieffer and Yang [19] came up with such a binary encoding $B$ and proved that under certain assumptions on the grammar-based compressor $w\mapsto\mathcal{G}_{w}$ , the combined compressor $w\mapsto B(\mathcal{G}_{w})$ yields a universal code with respect to the family of finite-state information sources over finite alphabets. More precisely, it is needed that the size of the SLP $\mathcal{G}_{w}$ is bounded by $\mathcal{O}(|w|/\log_{\hat{\sigma}}|w|)$ where $\sigma$ is the size of the underlying alphabet and $\hat{\sigma}=\max\{2,\sigma\}$ . This upper bound is met by all grammar-based compressors that produce so-called irreducible SLPs [19], which is the case for e.g. LZ78, Bisection, and Repair after a small modification of the latter. In their recent paper [28], Navarro and Ochoa used the binary encoding $B(\mathcal{G}_{w})$ from [19] in order to prove for every word $w$ over an alphabet of size $\sigma$ the upper bound $|B(\mathcal{G}_{w})|\leq|w|H_{k}(w)+o(|w|\log\hat{\sigma})$ for every $k\in o(\log_{\hat{\sigma}}|w|)$ . Here, $H_{k}(w)$ is the $k^{th}$ -order empirical entropy of $w$ , and the grammar-based compressor $w\mapsto\mathcal{G}_{w}$ must satisfy the upper bound $|\mathcal{G}_{w}|\leq\mathcal{O}(|w|/\log_{\hat{\sigma}}|w|)$ . Similar but weaker upper bounds for more practical binary SLP-encodings have been shown in [12, 26].

Grammar-based tree compression.

Grammar-based compression has been generalized from strings to trees by means of linear context-free tree grammars generating exactly one tree [3]. Such grammars are also known as tree straight-line programs, TSLPs for short, see [23] for a survey. TSLPs can be seen as a proper generalization of SLPs and DAGs (directed acyclic graphs, which are a widely used compact representation of trees). Whereas DAGs only have the ability to share repeated subtrees of a tree, TSLPs can also share repeated tree patterns with a hole (so-called contexts). In [10], the authors presented a linear time algorithm that computes for a given binary tree $t$ of size $n$ a TSLP $\mathcal{G}_{t}$ of size $\mathcal{O}(n/\log_{\hat{\sigma}}n)$ where $\sigma$ is the size of the underlying set of node labels and $\hat{\sigma}=\max\{2,\sigma\}$ . An alternative algorithm with the same asymptotic size bound can be found in [11]. TSLPs have been also extended to so-called forest straight-line programs (FSLPs) which allow to compress unranked node-labeled trees [14]. FSLPs are very similar to top DAGs [2] and also meet the size bound $\mathcal{O}(n/\log_{\hat{\sigma}}n)$ for unranked trees of size $n$ . The reader should notice that the $\mathcal{O}(n/\log_{\hat{\sigma}}n)$ -bound cannot be achieved by DAGs: the smallest DAG for an unlabeled binary tree of size $n$ may still contain $n$ edges.

Entropy bounds for grammar-based tree compressors.

In this paper we first consider node-labeled binary trees: every node has a label from a finite set $\Sigma$ of size $\sigma$ and every non-leaf node has a left and a right child. For unlabeled binary trees the results of Kieffer and Yang on universal grammar-based compressors have been extended to trees in [16, 30]. Whereas the universal tree encoder from [30] is based on DAGs (and needs a certain assumption on the average DAG size with respect to the input distribution), the encoder from [16] uses TSLPs of size $\mathcal{O}(n/\log n)$ . For this, a binary encoding of TSLPs similar to the one for SLPs from [19] is proposed. In this paper we extend the binary TSLP-encoding from [16] to node-labeled binary trees and prove an entropy bound similar to the one from [28] for strings. To do this, we first have to come up with a reasonable higher order entropy for binary node-labeled trees (we just speak of binary trees in the following). Several notions of tree entropy can be found in the literature, but all are tailored towards unranked trees and do not yield nontrivial results for the special case of unlabeled binary trees.

•

The $k^{th}$ -order label entropy from [6] is based on the empirical probability that a node $v$ is labeled with a certain symbol conditioned on the $k$ first labels from the parent node of $v$ to the root of the tree.

•

The tree entropy from [18] is the $0^{th}$ -order entropy of the node degrees.

•

Recently, two combinations of the two previous entropy measures were proposed in [13]. The first combination is based on the empirical probability that a node $v$ is labeled with a certain symbol conditioned on (i) the $k$ first labels from the parent node of $v$ to the root and (ii) the node degree of $v$ . The second combination uses the empirical probability that a node $v$ has a certain degree conditioned on (i) the $k$ first labels from the parent node of $v$ to the root and (ii) the node label of $v$ .

Tree entropy [18] is not useful in the context of binary trees, since a binary tree with $n$ leaves has $n-1$ nodes of degree $2$ , which shows that the tree entropy divided by the number of nodes ( $2n-1$ ) converges to $1$ when $n$ increases. On the other hand, the $k^{th}$ -order label entropy [6] is not useful for unlabeled trees. For the special case of unlabeled binary trees, also the combinations of [13] do not lead to useful entropy measures.

Our first contribution is the definition of a reasonable entropy measure for binary trees that can be also used for the unlabeled case. For this we define the $k$ -history of a node $v$ in a binary tree $t$ by taking the last $k$ edges on the unique path from the root to $v$ . For each edge $(v_{1},v_{2})$ traversed on this path we write down the node label of $v_{1}$ and a [math] (resp., $1$ ) if $v_{2}$ is a left (resp., right) child of $v_{1}$ . Thus, the $k$ -history of a node is a word of length $2k$ that alternatingly consists of symbols from $\Sigma$ and directions that are encoded by [math] or $1$ . For nodes at depth smaller than $k$ we pad the history with [math]’s and a default node label $\Box\in\Sigma$ in order to get length exactly $k$ .111This is an ad hoc decision to make the definitions easier. In the appendix we discuss different approaches of how to deal with nodes of depth smaller than $k$ , and prove that they asymptotically lead to the same entropy measure. For each $k$ -history $h$ we then consider the joint probability distribution $P^{t}_{h}$ of the node degree (either [math] or $2$ ) and the node label, conditioned on the history $h$ . Thus, $P^{t}_{h}(a,i)$ is the probability that a randomly chosen node among the nodes with history $h$ is labeled with the symbol $a$ and has $i\in\{0,2\}$ children. The $k^{th}$ -order empirical entropy of $t$ , $H_{k}(t)$ for short, is then the sum of the entropies of these distributions $P_{h}^{t}$ (the sum is taken over all histories $h$ ) weighted with the number of nodes with history $h$ . This definition is similar to the definition of the $k^{th}$ order empirical entropy of a string.

Our main result states that

[TABLE]

where $t$ is a binary tree with $n$ leaves, the grammar-based compressor $t\mapsto\mathcal{G}_{t}$ produces TSLPs of size $\mathcal{O}(n/\log_{\hat{\sigma}}n)$ for binary trees of size $n$ with $\sigma$ many node labels and $\hat{\sigma}=\max(2,\sigma)$ . Moreover, $B$ is an extension of the binary TSLP-encoding described in [16] from unlabeled binary trees to labeled binary trees (Section 3.3). If $k\leq o(\log_{\hat{\sigma}}n)$ then this bound can be simplified to $|B(\mathcal{G}_{t})|\leq H_{k}(t)+o(n\log\hat{\sigma})$ . The assumption $k\leq o(\log_{\hat{\sigma}}n)$ can be also found in [28]. In fact, Gagie argued in [9] that the $k^{th}$ -order empirical entropy for strings stops being a reasonable complexity measure for almost all strings of length $n$ over alphabets of size $\sigma$ when $k\geq\log_{\hat{\sigma}}n$ .

Our definition of $k^{th}$ -order empirical entropy does not capture all regularities that can be exploited in grammar-based compression: Take for instance a complete unlabeled binary tree $t_{n}$ of height $n$ (all paths from the root to a leaf have length $n$ ). This tree has $2^{n}$ leaves and is very well compressible: its minimal DAG has only $n+1$ nodes, hence there also exists a TSLP of size $n+1$ for $t_{n}$ . But for every fixed $k$ the $k^{th}$ -order empirical entropy of $t_{n}$ divided by $n$ converges to $2$ (the trivial upper bound) for $n\to\infty$ . If $n\gg k$ then for every $k$ -history $z$ the number of leaves with $k$ -history $z$ is roughly the same as the number of internal nodes with $k$ -history $z$ . Hence, although $t_{n}$ is highly compressible with TSLPs (and even DAGs), its $k^{th}$ -order empirical entropy is close to the maximal value. However, this phenomenon occurs for grammar-based string compression and the well-established higher-order empirical entropy of strings as well; see Section 6.

In Section 5 we present a simple extension of our entropy notion to node-labeled unranked trees. In an unranked tree the number of children of a node is arbitrary. Unranked trees are important in the area of XML, where the hierarchical structure of a document is represented by a node-labeled unranked tree. For such a tree $t$ we define the $k^{th}$ -order empirical entropy as the $k^{th}$ -order empirical entropy of the first-child next-sibling (fcns for short) encoding of $t$ . The fcns-encoding of $t$ is a binary tree which contains all nodes of $t$ . If a node $v$ of $t$ has the first (i.e., left-most) child $v_{1}$ and the right sibling $v_{2}$ then $v_{1}$ (resp., $v_{2}$ ) is the left (resp., right) child of $v$ in the fcns-encoding of $t$ . If $v$ has no child or no right sibling then one adds dummy leaves to the fcns-encoding in order to obtain a full binary tree. Our choice of defining the $k^{th}$ -order empirical entropy of an unranked tree via the fcns-encoding is motivated by the fact that in XML document trees the label of a node $v$ usually depends on the labels of the ancestors and the labels of the left siblings of $v$ . This information is contained in the history of $v$ in the fcns-encoding.

We present experimental results with real XML document trees showing that in these cases the $k^{th}$ -order empirical entropy is indeed very small compared to the worst-case bit size. An unranked tree with $n$ nodes and $\sigma$ node labels can be encoded with $2n+\log_{2}(\sigma)n$ bits [15]. Up to low order terms, this is optimal. Table 1 shows the values of the $k^{th}$ -order empirical entropy (for $k=1,2,4,8$ ) divided by $2n+\log_{2}(\sigma)n$ for several real XML trees (that were also used in other experiments for XML compression [24, 25]). For $k=4$ , these quotients never exceed 20% and for $k=8$ all quotients are bounded by 13.5%.

Our experimental results combined with our entropy bound (1) for grammar-based compression are in accordance with the fact that grammar-based tree compressors yield excellent compression ratios for XML document trees, see e.g. [24]. Some of the XML documents from our experiments were also used in [24], where the performance of TreeRePair (currently the best grammar-based tree compressor from a practical point of view) on XML document trees was tested. An interesting observation is that those XML trees, for which our $k$ -th order empirical entropy is large are indeed those XML trees with the worst compression ratio for TreeRePair in [24] (this is in particular the Treebank document from Table 1).

In a forthcoming paper we will compare our definition of the $k^{th}$ -order empirical entropy of trees with the above mentioned tree entropies from [6, 13, 18] for binary as well as unranked trees and both from a theoretical as well as experimental perspective. A short version of this paper can be found in [17].

2. Preliminaries

In this section, we introduce some basic definitions concerning information theory (Section 2.1) and binary trees (Section 2.2).

With $\mathbb{N}$ we denote the natural numbers including [math]. We use the standard $\mathcal{O}$ -notation. If $b>0$ is a constant, then we just write $\mathcal{O}(\log n)$ for $\mathcal{O}(\log_{b}n)$ . We make the convention that $0\cdot\log(0)=0$ and $0\cdot\log(x/0)=0$ for $x\geq 0$ . For the unit interval $\{r\in\mathbb{R}\mid 0\leq r\leq 1\}$ we write $[0,1]$ .

Let $w=a_{1}a_{2}\cdots a_{l}\in\Gamma^{*}$ be a word over an alphabet $\Gamma$ . With $|w|=l$ we denote the length of $w$ . The empty word is denoted by $\varepsilon$ . For $a\in\Gamma$ we denote with $|w|_{a}=|\{i\mid 1\leq i\leq l,a_{i}=a\}|$ the number of occurrences of $a$ in $w$ .

2.1. Empirical distributions and empirical entropy

Let $A$ be a finite set. A probability distribution on $A$ is a mapping $p:A\to[0,1]$ such that $\sum_{a\in A}p(a)=1$ . For a probability distribution $p$ on $A$ we define its Shannon entropy

[TABLE]

We have $0\leq H(p)\leq\log_{2}|A|$ . A well-known generalization of Shannon’s inequality states that for every probability distribution $p$ on $A$ and any mapping $q:A\to[0,1]$ such that $\sum_{a\in A}q(a)\leq 1$ we have

[TABLE]

see [1] for a proof. Shannon’s inequality is the special case where $q$ is a probability distribution as well. The Kullback-Leibler divergence between two probability distributions $p,q$ on $A$ (see [5, Section 2.3]) is defined as

[TABLE]

It is known that $D(p\,|\!|\,q)\geq 0$ for all $p,q$ (this follows from Shannon’s inequality) and $D(p\,|\!|\,q)=0$ if and only if $p=q$ .

Let $\overline{a}=(a_{1},a_{2},\ldots,a_{l})$ be a tuple of elements that are from some (not necessarily finite) set $S$ . The empirical distribution $p_{\overline{a}}:\{a_{1},a_{2},\ldots,a_{l}\}\to[0,1]$ of $\overline{a}$ is defined by

[TABLE]

We use this (and the following) definition also for words over some alphabet by identifying a word $w=a_{1}a_{2}\cdots a_{l}$ with the tuple $(a_{1},a_{2},\ldots,a_{l})$ . The unnormalized empirical entropy of $\overline{a}$ is

[TABLE]

From (2) it follows that for a tuple $\overline{a}=(a_{1},a_{2},\ldots,a_{l})$ with $a_{1},\ldots,a_{l}\in S$ and real numbers $q(a)\geq 0$ ( $a\in S$ ) with $\sum_{a\in\{a_{1},\ldots,a_{l}\}}q(a)\leq 1$ we have

[TABLE]

We also need the famous log-sum inequality, see e.g. [5, Theorem 2.7.1] (recall our conventions $0\cdot\log(0)=0$ and $0\cdot\log(x/0)=0$ for $x\geq 0$ ):

Lemma 1.

Let $a_{1},a_{2},\dots,a_{l},b_{1},b_{2},\dots,b_{l}\geq 0$ be real numbers. Moreover, let $a=\sum_{i=1}^{l}a_{i}$ and $b=\sum_{i=1}^{l}b_{i}$ . Then

[TABLE]

2.2. Trees, tree processes, and tree entropy

2.2.1. Trees and contexts

Let $\Sigma$ denote a finite non-empty alphabet of size $|\Sigma|=\sigma$ . Later, we will need a fixed distinguished symbol from $\Sigma$ that we will denote with $\Box\in\Sigma$ . We will also need the value $\hat{\sigma}=\max\{2,\sigma\}$ . With $\mathcal{T}(\Sigma)$ we denote the set of labeled binary trees over the alphabet $\Sigma$ . Formally, it is inductively defined as the smallest set of terms over $\Sigma$ such that

•

$\Sigma\subseteq\mathcal{T}(\Sigma)$ and

•

if $t_{1},t_{2}\in\mathcal{T}(\Sigma)$ and $a\in\Sigma$ , then $a(t_{1},t_{2})\in\mathcal{T}(\Sigma)$ .

If e.g. $\Sigma=\{a,b\}$ , then $a\in\mathcal{T}(\Sigma)$ is the binary tree with a single node labeled by $a$ and $a(b(b(a,b),a),a(b,a))\in\mathcal{T}(\Sigma)$ is the binary tree depicted on the left of Figure 1.

A tree encoder is an injective mapping $E:\mathcal{T}(\Sigma)\to\{0,1\}^{*}$ such that the range $E(\mathcal{T}(\Sigma))$ is prefix-free, i.e., there do not exist $t,t^{\prime}\in\mathcal{T}(\Sigma)$ with $t\neq t^{\prime}$ such that $E(t)$ is a prefix of $E(t^{\prime})$ .

With $|t|$ we denote the number of leaves of $t$ , which can be inductively defined by $|a|=1$ and $|a(t_{1},t_{2})|=|t_{1}|+|t_{2}|$ for $a\in\Sigma$ and $t_{1},t_{2}\in\mathcal{T}(\Sigma)$ . Note that $2|t|-1$ is the number of occurrences of symbols from $\Sigma$ in $t$ . Let $\mathcal{T}_{n}(\Sigma)=\{t\in\mathcal{T}(\Sigma)\mid|t|=n\}$ for $n\geq 1$ . Note that $\mathcal{T}_{1}(\Sigma)=\Sigma$ . We have $|\mathcal{T}_{n}(\Sigma)|=\sigma^{2n-1}C_{n-1}$ , where $C_{k}$ is the $k^{\text{th}}$ Catalan number. These numbers satisfy the following well-known asymptotic estimate

[TABLE]

see e.g. [8]. In fact, we have $C_{k}\leq 4^{k}$ for all $k\geq 0$ and hence $|\mathcal{T}_{n}(\Sigma)|\leq(2\sigma)^{2n}$ .

A context is a labeled binary tree, where exactly one leaf is labeled with the special symbol $x\notin\Sigma$ (called the parameter); all other nodes are labeled with symbols from $\Sigma$ . Formally, the set of contexts $\mathcal{C}(\Sigma)$ is the smallest set such that

•

$x\in\mathcal{C}(\Sigma)$ and

•

if $a\in\Sigma$ , $c\in\mathcal{C}(\Sigma)$ and $t\in\mathcal{T}(\Sigma)$ then also $a(c,t),a(t,c)\in\mathcal{C}(\Sigma)$ .

If e.g. $\Sigma=\{a,b\}$ , then $x\in\mathcal{C}(\Sigma)$ is the context with a single node labeled by the parameter $x$ and $a(b(b(a,b),x),a(b,a))\in\mathcal{T}(\Sigma)$ is the context depicted on the right of Figure 1. For a tree or context $t\in\mathcal{T}(\Sigma)\cup\mathcal{C}(\Sigma)$ and a context $c\in\mathcal{C}(\Sigma)$ , we denote by $c[t]$ the tree or context which results from $c$ by replacing the unique occurrence of the parameter $x$ by $t$ . For example $c=a(a,x)$ and $t=b(a,a)$ yield $c[t]=a(a,b(a,a))$ (with $\Sigma=\{a,b\}$ ). For a context $c$ we define $|c|$ inductively by $|x|=0$ and $|a(c,t)|=|a(t,c)|=|t|+|c|$ for $c\in\mathcal{C}(\Sigma)$ and $t\in\mathcal{T}(\Sigma)$ . In other words, $|c|$ is the number of leaves of $c$ , where the unique occurrence of the parameter $x$ is not counted. Note that $|c|=|c[a]|-1$ , where $a\in\Sigma$ is arbitrary. We define $\mathcal{C}_{n}(\Sigma)=\{c\in\mathcal{C}(\Sigma)\mid|c|=n\}$ for $n\in\mathbb{N}$ . Since the set $\Sigma$ will not change in this paper, we use the abbreviations $\mathcal{T}$ , $\mathcal{T}_{n}$ , $\mathcal{C}$ , and $\mathcal{C}_{n}$ for $\mathcal{T}(\Sigma)$ , $\mathcal{T}_{n}(\Sigma)$ , $\mathcal{C}(\Sigma)$ , and $\mathcal{C}_{n}(\Sigma)$ , respectively.

Occasionally, we will consider a binary tree or context as a graph with nodes and edges in the usual way, where each node is labeled with a symbol from $\Sigma$ (or $x$ in the case of a context). Note that $t\in\mathcal{T}_{n}\cup\mathcal{C}_{n}$ has $2n-1$ nodes in total: $n$ leaves and $n-1$ internal nodes.

It is convenient to define a node $v$ of $s\in\mathcal{T}\cup\mathcal{C}$ as a bit string that describes the path from the root to the node ([math] means left, $1$ means right). Formally, we define the node set $V(s)\subseteq\{0,1\}^{*}$ of $s\in\mathcal{T}\cup\mathcal{C}$ by

•

$V(a)=\{\varepsilon\}$ for every $a\in\Sigma$ ,

•

$V(x)=\emptyset$ and

•

$V(a(s_{0},s_{1}))=\{iw\mid i\in\{0,1\},w\in V(s_{i})\}\cup\{\varepsilon\}$ for every $a\in\Sigma$ .

Note that for a context $c\in\mathcal{C}$ , the set $V(c)$ does not contain the unique node in $c$ labeled with the parameter $x$ . We use this definition due to better readability of the paper since we mostly need the set of nodes without the parameter node. Also, it is still possible to uniquely determine from $V(c)$ the path to the parameter $x$ due to the following properties: For a tree $t\in\mathcal{T}$ we have $w0\in V(t)$ if and only if $w1\in V(t)$ for all $w\in\{0,1\}^{*}$ since each node has zero or two children. The only context $c$ which fulfills this property is $c=x$ , i.e., the parameter node is the only node of $c$ and $V(c)=\emptyset$ . For all other contexts $c\in\mathcal{C}$ this property is violated since there exists a unique $w\in\{0,1\}^{*}$ such that $w0\in V(c)$ (respectively, $w1\in V(c)$ ) and $w1\notin V(c)$ (respectively, $w0\notin V(c)$ ). In this case the parameter node is $w1$ (respectively, $w0$ ). Alternatively, the parameter node of a context $c$ is the single node in the set $V(c[a])\setminus V(c)$ for a symbol $a\in\Sigma$ . We denote this node with $\omega(c)\in\{0,1\}^{*}$ . In other words: $V(c[a])\setminus V(c)=\{\omega(c)\}$ .

Example 1.

Consider the tree $t=a(b(b(a,b),a),a(b,a))$ with $\Sigma=\{a,b\}$ depicted on the left of Figure 1.We have $V(t)=\{\varepsilon,0,1,00,01,10,11,000,001\}$ . For the context $c=a(b(b(a,b),x),a(b,a))$ depicted on the right of Figure 1, we have $t=c[a]$ and $\omega(c)=01$ .

Consider a tree or context $s$ and let $v\in V(s)$ . The leaves of $s$ are those strings in $V(s)$ that are maximal with respect to the prefix relation. The length $|v|$ is the depth of the node $v$ in $s$ and the depth of $s$ is the maximal depth of a node in $V(s)$ (the depth of $s=x$ is not defined but also not needed). Let $\lambda_{s}:V(s)\rightarrow\Sigma\times\{0,2\}$ denote the function mapping a node $v$ to the pair $(a,i)$ where $a\in\Sigma$ is the label of $v$ and $i\in\{0,2\}$ is the number of children of $v$ . We can define this function inductively as follows:

•

$\lambda_{a}(\varepsilon)=(a,0)$ for $a\in\Sigma$ ,

•

$\lambda_{s}(\varepsilon)=(a,2)$ for $s=a(s_{0},s_{1})$ with $a\in\Sigma$ and $s_{0},s_{1}\in\mathcal{T}\cup\mathcal{C}$ ,

•

$\lambda_{s}(iw)=\lambda_{s_{i}}(w)$ for $s=a(s_{0},s_{1})$ with $a\in\Sigma$ , $s_{0},s_{1}\in\mathcal{T}\cup\mathcal{C}$ and $iw\in V(s)$ .

Note that in the last case, if $s$ is a context, we cannot have $s_{i}=x$ because we must have $w\in V(s_{i})$ . In the following, we will omit the subscript $s$ in $\lambda_{s}(v)$ if $s$ is clear from the context.

2.2.2. Histories

We now come to the crucial notion of the history of a node $v$ in a tree or context. Intuitively, the history of $v$ records all information that can be obtained by walking from the root of the tree/context straight down to the node $v$ . First, we define the set of histories as

[TABLE]

For an integer $k\geq 0$ , let $\mathcal{L}_{k}=\{w\in\mathcal{L}\mid|w|=2k\}$ and let $\ell_{k}:\mathcal{L}\rightarrow\mathcal{L}_{k}$ denote the partial function mapping a history $z\in\mathcal{L}$ with $|z|\geq 2k$ to the suffix of $z$ of length $2k$ , i.e., $\ell_{k}(a_{1}i_{1}\cdots a_{n}i_{n})=a_{n-k+1}i_{n-k+1}\cdots a_{n}i_{n}$ (the function $\ell_{0}$ maps a string to the empty string).

For a tree $t$ and a node $v\in V(t)$ (resp., a context $c$ and a node $v\in V(c)\cup\{\omega(c)\}$ ), we inductively define its history $h(v)\in\mathcal{L}$ (in $t$ ) by

•

$h(\varepsilon)=\varepsilon$ and

•

$h(wi)=h(w)ai$ for $i\in\{0,1\}$ and $wi\in V(t)$ (resp., $wi\in V(c)\cup\{\omega(c)\}$ ).

Here, $a$ is the symbol that labels the node $w$ , i.e., $\lambda(w)=(a,2)$ . That is, in order to obtain $h(v)$ , while walking downwards in the tree from the root node to the node $v$ we alternately concatenate symbols from $\Sigma$ with binary numbers in $\{0,1\}$ such that the symbol from $\Sigma$ corresponds to the label of the current node and the binary number [math] (resp., $1$ ) states that we move on to the left (resp. right) child node. Note that the symbol that labels $v$ is not part of the history of $v$ . The $k$ -history of a tree node $v\in V(t)$ is

[TABLE]

i.e., the suffix of length $2k$ of the word $(\Box 0)^{k}h(v)$ , where $\Box$ is a fixed dummy symbol in $\Sigma$ (the choice is arbitrary). This means that if $|v|\geq k$ then $h_{k}(v)$ describes the last $k$ directions and node labels along the path from the root to node $v$ . If $|v|<k$ , we pad the history of $v$ with $\Box$ ’s and zeros such that $h_{k}(v)\in\mathcal{L}_{k}$ . In the appendix, we discuss other reasonable approaches of how to deal with nodes of depth smaller than $k$ . For $z\in\mathcal{L}_{k}$ we denote with

[TABLE]

the set of nodes in $t$ with $k$ -history $z$ .

Example 2.

Consider the tree $t=a(b(b(a,b),a),a(b,a))$ from Example 1 and let $\Box=a\in\Sigma$ . Then, $h(001)=h_{3}(001)=a0b0b1$ and $h_{4}(10)=a0a0a1a0$ .

2.2.3. Tree processes

A tree process is an infinite tuple $\mathcal{P}=(P_{z})_{z\in\mathcal{L}}$ where every $P_{z}$ is a probability distribution on $\Sigma\times\{0,2\}$ . With $\mathcal{P}$ we associate the function $\mathsf{Prob}_{\mathcal{P}}:\mathcal{T}\cup\mathcal{C}\to[0,1]$ with

[TABLE]

We are mainly interested in this definition for the case that $s$ is a tree, but for technical reasons we also have to allow contexts. Note that if $c$ is a context, then the parameter node of $c$ is not in $V(c)$ and therefore does not contribute to $\mathsf{Prob}_{\mathcal{P}}(c)$ .

A tree process can be used to randomly construct a tree from $\mathcal{T}$ as follows: In a top-down way we determine for every tree node its label (from $\Sigma$ ) and its number of children, where this decision depends on the history of the tree node. We start at the root node, whose history is the empty word $\varepsilon$ . If we have reached a tree node $v$ with history $z\in\mathcal{L}$ then we use the probability distribution $P_{z}$ to randomly choose a pair $(a,i)\in\Sigma\times\{0,2\}$ . We assign the label $a\in\Sigma$ to $v$ . If $i=0$ then $v$ becomes a leaf, otherwise the process continues at the two children $v0$ and $v1$ (whose history is well-defined). Note that in this way we may produce infinite trees with non-zero probability (e.g. if $P_{z}(a,2)=1$ for some $a\in\Sigma$ ). Therefore, we only obtain an inequality instead of an equality in the following lemma (recall that $\mathcal{T}$ only contains finite trees).

Lemma 2.

Let $\mathcal{P}$ be a tree process. Then $\sum_{t\in\mathcal{T}}\mathsf{Prob}_{\mathcal{P}}(t)\leq 1$ .

Proof.

Define the set of trees $\mathcal{T}^{\prime}_{n}$ inductively by $\mathcal{T}^{\prime}_{1}=\mathcal{T}_{1}$ and

[TABLE]

We have $\mathcal{T}^{\prime}_{n}\subsetneq\mathcal{T}^{\prime}_{n+1}$ and $\mathcal{T}=\bigcup_{n\geq 1}\mathcal{T}^{\prime}_{n}$ . It then suffices to show $\sum_{t\in\mathcal{T}^{\prime}_{n}}\mathsf{Prob}_{\mathcal{P}}(t)\leq 1$ for every $n\geq 1$ . This follows easily from the definition of $\mathsf{Prob}_{\mathcal{P}}(t)$ and the inductive definition of $\mathcal{T}^{\prime}_{n}$ . ∎

Lemma 2 cannot be extended to contexts, but the following bound will suffice for our purpose.

Lemma 3.

Let $\mathcal{P}$ be a tree process. We have $\sum_{c\in\mathcal{C}_{n}}\mathsf{Prob}_{\mathcal{P}}(c)\leq n+1$ for every $n\geq 1$ .

Proof.

In order to bound $\sum_{c\in\mathcal{C}_{n}}\mathsf{Prob}_{\mathcal{P}}(c)$ , we first represent the probability of each context $c\in\mathcal{C}_{n}$ as a sum of probabilities of trees. So fix a context $c\in\mathcal{C}_{n}$ for the first part of the proof. Note first that in general no tree $t$ exists such that $\mathsf{Prob}_{\mathcal{P}}(c)\leq\mathsf{Prob}_{\mathcal{P}}(t)$ (or even $\mathsf{Prob}_{\mathcal{P}}(c)=\mathsf{Prob}_{\mathcal{P}}(t)$ ) since $\omega(c)$ (the parameter node of $c$ ) does not contribute to the probability of the context $c$ . For example, the tree $c[a]$ ( $a\in\Sigma$ ) which results from $c$ by replacing the parameter node by an $a$ -labeled leaf node has probability $\mathsf{Prob}_{\mathcal{P}}(c)\cdot P_{h(\omega(c))}(a,0)\leq\mathsf{Prob}_{\mathcal{P}}(c)$ . In order to bound $\mathsf{Prob}_{\mathcal{P}}(c)$ , the idea is to replace the parameter node by all possible trees and not only by a single node. So consider the set $c[\mathcal{T}]=\{c[t]\mid t\in\mathcal{T}\}$ of all trees that arise from $c$ by replacing the parameter by an arbitrary tree. Unfortunately, the total probability $\sum_{t\in c[\mathcal{T}]}\mathsf{Prob}_{\mathcal{P}}(t)$ can still be strictly smaller than $\mathsf{Prob}_{\mathcal{P}}(c)$ since there might be infinite trees with positive probability with respect to $\mathcal{P}$ . To get rid of this problem, we fix an element $a\in\Sigma$ and modify $\mathcal{P}$ to a tree process $\mathcal{P}^{\prime}=(P_{z}^{\prime})_{z\in\mathcal{L}}$ such that (i) $P_{z}^{\prime}=P_{z}$ for $|z|\leq 2n$ and (ii) $P_{z}^{\prime}(a,0)=1$ and $P_{z}^{\prime}(a^{\prime},i)=0$ for every $(a^{\prime},i)\in\Sigma\times\{0,2\}\setminus\{(a,0)\}$ and $|z|>2n$ . The tree process $\mathcal{P}^{\prime}$ is created such that all nodes $v$ of depth $|v|\leq n$ contribute the probability $P_{h(v)}(\lambda(v))$ as before and all nodes of depth $n+1$ in a tree are $a$ -labeled leaves with probability $1$ . Note first that for each context $c\in\mathcal{C}_{n}$ and each node $v\in V(c)$ we have $|v|\leq n$ and thus $P^{\prime}_{h(v)}(\lambda(v))=P_{h(v)}(\lambda(v))$ . Secondly, all trees of depth larger than $n+1$ have probability [math] with respect to $\mathcal{P}^{\prime}$ (including infinite trees). Hence, we get $\sum_{t\in\mathcal{T}}\mathsf{Prob}_{\mathcal{P^{\prime}}}(t)=1$ . We obtain

[TABLE]

We claim that $(a)$ equals $1$ . To see this, consider the tree process $\mathcal{P^{\prime\prime}}=(P_{z}^{\prime\prime})_{z\in\mathcal{L}}$ with $P_{z}^{\prime\prime}=P_{h(\omega(c))z}^{\prime}$ . Also for $\mathcal{P^{\prime\prime}}$ only finite trees have non-zero probability and thus $\sum_{t\in\mathcal{T}}\mathsf{Prob}_{\mathcal{P^{\prime\prime}}}(t)=1$ . We have

[TABLE]

It follows that $\mathsf{Prob}_{\mathcal{P}}(c)=\sum_{t\in c[\mathcal{T}]}\mathsf{Prob}_{\mathcal{P^{\prime}}}(t)$ . In the second part of the proof it remains to bound $\sum_{c\in\mathcal{C}_{n}}\mathsf{Prob}_{\mathcal{P}}(c)=\sum_{c\in\mathcal{C}_{n}}\sum_{t\in c[\mathcal{T}]}\mathsf{Prob}_{\mathcal{P}^{\prime}}(t)$ . The key point here is that for each tree $t\in\mathcal{T}$ there are at most $n+1$ different contexts $c\in\mathcal{C}_{n}$ such that $t\in c[\mathcal{T}]$ . Note that for a tree $t$ , the number of different contexts $c\in\mathcal{C}_{n}$ such that $t\in c[\mathcal{T}]$ is exactly the number of nodes $v\in V(t)$ such that replacing the subtree rooted at $v$ by the parameter $x$ yields a context $c$ with $|c|=n$ . This is the same as the number of subtrees of $t$ with $|t|-n$ leaves. Since different subtrees in $t$ of equal size do not share nodes, we can bound the number of subtrees with $|t|-n$ leaves by $|t|/(|t|-n)$ . We can assume that $|t|>n$ since otherwise there is no context $c\in\mathcal{C}_{n}$ such that $t\in c[\mathcal{T}]$ . So we have $|t|=n+k$ for some $k>0$ and the number of subtrees of $t$ with $|t|-n$ leaves is at most $(n+k)/k=n/k+1\leq n+1$ . We get

[TABLE]

This concludes the proof of the lemma. ∎

A $k^{th}$ -order tree process is a tree process $\mathcal{P}=(P_{z})_{z\in\mathcal{L}}$ such that $P_{z}=P_{z^{\prime}}$ if $\ell_{k}((\Box 0)^{k}z)=\ell_{k}((\Box 0)^{k}z^{\prime})$ . Thus, the probability distribution that is chosen for a certain tree node depends only on the $2k$ last symbols of the history of the node (where histories are padded with $\Box 0$ on the left to reach length $2k$ for the fixed symbol $\Box\in\Sigma$ ). We will identify the $k^{th}$ -order tree process $\mathcal{P}=(P_{z})_{z\in\mathcal{L}}$ with the finite tuple $(P_{z})_{z\in\mathcal{L}_{k}}$ ; it contains all information about $\mathcal{P}$ . Note that for a $k^{th}$ -order tree process $\mathcal{P}$ we can compute $\mathsf{Prob}_{\mathcal{P}}(s)$ for a tree or context $s$ as

[TABLE]

where the empty product (which arises in case $V_{z}(s)=\emptyset$ ) is $1$ .

2.2.4. Higher-order entropy of a tree

Let us fix $k\geq 0$ . We define the $k^{th}$ -order (unnormalized) empirical entropy $H_{k}(t)$ of a tree $t\in\mathcal{T}_{n}$ as follows: For $z\in\mathcal{L}_{k}$ let

[TABLE]

be the number of nodes of $t$ with $k$ -history $z$ and for $\tilde{a}\in\Sigma\times\{0,2\}$ let

[TABLE]

We then define the empirical $k^{th}$ -order tree process $\mathcal{P}^{t}=(P^{t}_{z})_{z\in\mathcal{L}_{k}}$ by

[TABLE]

for all $\tilde{a}\in\Sigma\times\{0,2\}$ and all $z\in\mathcal{L}_{k}$ with $m^{t}_{z}>0$ . If $m^{t}_{z}=0$ , then we can define $P^{t}_{z}$ as an arbitrary distribution. Then

[TABLE]

Note that

[TABLE]

since $0\leq H(P^{t}_{z})\leq\log_{2}(2\sigma)$ and $\sum_{z\in\mathcal{L}_{k}}m^{t}_{z}=2n-1$ . This upper bound on the entropy matches the information theoretic bound for the worst-case output length of any tree encoder on $\mathcal{T}_{n}$ . Using the asymptotic bound (6) for the Catalan numbers, one sees that for any tree encoder there must exist a tree $t\in\mathcal{T}_{n}$ which is encoded with $2\log_{2}(2\sigma)n-o(n)=2(\log_{2}\sigma+1)n-o(n)$ bits. The $k^{th}$ -order empirical entropy $H_{k}(t)$ is a lower bound on the coding length of a tree encoder that encodes for each node the relevant information (the label of the node and the binary information whether the node is a leaf or internal) depending on the $k$ -history of the node.

Example 3.

Let $t$ denote the binary tree $t=a(b(b(a,b),a),a(b,a))$ as depicted on the left of Figure1. In order to compute the first order empirical entropy $H_{1}(t)$ of $t$ , we have to consider $k$ -histories of $t$ with $k=1$ : Let $\Box=a$ . It follows that $V_{a0}(t)=\{\varepsilon,0,10\}$ , $V_{b0}(t)=\{00,000\}$ , $V_{a1}(t)=\{1,11\}$ and $V_{b1}(t)=\{01,001\}$ . Thus, we have $m_{a0}^{t}=3$ and $m_{a1}^{t}=m_{b0}^{t}=m_{b1}^{t}=2$ . Next, for each $k$ -history $z$ , we consider $\lambda(v)$ for $v\in V_{z}(t)$ : For $z=a0$ , we have $\lambda(\varepsilon)=(a,2)$ , $\lambda(0)=(b,2)$ and $\lambda(10)=(b,0)$ . Hence, $m_{a0,(a,2)}^{t}=m_{a0,(b,0)}^{t}=m_{a0,(b,2)}^{t}=1$ and $H(P_{a0}^{t})=\log_{2}(3)$ . Analogously, we find $H(P_{b0}^{t})=H(P_{a1}^{t})=H(P_{b1}^{t})=1/2\log_{2}(2)+1/2\log_{2}(2)=1$ . Altogether, this yields $H_{1}(t)=3\cdot\log_{2}(3)+2\cdot 1+2\cdot 1+2\cdot 1$ which is roughly $9.3$ .

One can define $H_{k}(t)$ alternatively in the following way: Take a $k$ -history $z\in\mathcal{L}_{k}$ and enumerate the set $V_{z}(t)$ in an arbitrary way as $v_{1},v_{2},\ldots,v_{j}$ . Define the string $w(t,z)=\lambda(v_{1})\lambda(v_{2})\cdots\lambda(v_{j})\in(\Sigma\times\{0,2\})^{*}$ . We have

[TABLE]

where the empirical entropy $H(w(t,z))$ is defined according to (4).

The following lemma and its proof are very similar to a corresponding statement for the $k^{th}$ -order empirical entropy of strings, see [9].

Theorem 1.

Let $t\in\mathcal{T}$ . For every $k^{th}$ -order tree process $\mathcal{P}=(P_{z})_{z\in\mathcal{L}_{k}}$ with $\mathsf{Prob}_{\mathcal{P}}(t)>0$ we have

[TABLE]

with equality if and only if $P^{t}_{z}=P_{z}$ for all $z\in\mathcal{L}_{k}$ with $m^{t}_{z}>0$ .

Proof.

We have

[TABLE]

with equality in the last line if and only if $P^{t}_{z}=P_{z}$ for all $z\in\mathcal{L}_{k}$ with $m_{z}^{t}>0$ . ∎

3. Tree straight-line programs and compression of binary trees

We now introduce tree straight-line programs and use them for the compression of binary trees.

3.1. General tree straight-line programs

Let $V$ be a finite alphabet of symbols, where each symbol $A\in V$ has an associated rank [math] or $1$ (we also speak of a ranked alphabet). The elements of $V$ are called nonterminals. We assume that $V$ contains at least one nonterminal of rank [math] and that $V$ is disjoint from the set $\Sigma\cup\{x\}$ , which are the labels used for binary trees and contexts. We use $V_{0}$ (resp., $V_{1}$ ) for the set of nonterminals of rank [math] (resp., of rank $1$ ). The idea is that nonterminals from $V_{0}$ (resp., $V_{1}$ ) derive to trees from $\mathcal{T}$ (resp., contexts from $\mathcal{C}$ ). We denote by $\mathcal{T}_{V}(\Sigma)$ the set of trees over $\Sigma\cup V$ , i.e., each node in a tree $t\in\mathcal{T}_{V}(\Sigma)$ is labeled with a symbol from $\Sigma\cup V$ such that nodes labeled by symbols from $\Sigma$ have zero or two children and if a node is labeled by a symbol from $V$ , then the number of children of this node corresponds to the rank of its label (a formal definition follows). With $\mathcal{C}_{V}(\Sigma)$ we denote the corresponding set of all contexts, i.e., the set of trees over $\Sigma\cup\{x\}\cup V$ , where the parameter symbol $x$ occurs exactly once and at a leaf position. Formally, we define $\mathcal{T}_{V}(\Sigma)$ and $\mathcal{C}_{V}(\Sigma)$ as the smallest sets of formal expressions with the following conditions, where here and in the rest of the paper we use the abbreviations $\mathcal{T}_{V}$ for $\mathcal{T}_{V}(\Sigma)$ and $\mathcal{C}_{V}$ for $\mathcal{C}_{V}(\Sigma)$ :

•

$\Sigma\cup V_{0}\subseteq\mathcal{T}_{V}$ and $x\in\mathcal{C}_{V}$ ,

•

if $a\in\Sigma$ , $A\in V_{1}$ and $t_{1},t_{2}\in\mathcal{T}_{V}$ then $A(t_{1}),a(t_{1},t_{2})\in\mathcal{T}_{V}$ , and

•

if $a\in\Sigma$ , $A\in V_{1}$ , $s\in\mathcal{C}_{V}$ and $t\in\mathcal{T}_{V}$ then $A(s),a(s,t),a(t,s)\in\mathcal{C}_{V}$ .

If e.g. $\Sigma=\{a,b\}$ , $V_{0}=\{A\}$ and $V_{1}=\{B\}$ , then $B(a(b(A,b),B(a)))\in\mathcal{T}_{V}$ and $B(a(b(A,b),B(x)))\in\mathcal{C}_{V}$ as depicted in Figure 2. Note that $\mathcal{T}(\Sigma)\subseteq\mathcal{T}_{V}(\Sigma)$ and $\mathcal{C}(\Sigma)\subseteq\mathcal{C}_{V}(\Sigma)$ .

A tree straight-line program $\mathcal{G}$ , or TSLP for short, is a tuple $(V,A_{0},r)$ , where $A_{0}\in V_{0}$ is the start nonterminal and $r:V\to(\mathcal{T}_{V}\cup\mathcal{C}_{V})$ is a function which assigns to each nonterminal its unique right-hand side. It is required that if $A\in V_{0}$ (resp., $A\in V_{1}$ ), then $r(A)\in\mathcal{T}_{V}$ (resp., $r(A)\in\mathcal{C}_{V}$ ). Furthermore, the binary relation $\{(A,B)\in V\times V\mid B\text{ occurs in }r(A)\}$ has to be acyclic. These conditions ensure that exactly one tree is derived from the start nonterminal $A_{0}$ by using the rewrite rules $A\to r(A)$ for $A\in V$ . To define this formally, we define $\mathsf{val}_{\mathcal{G}}(t)\in\mathcal{T}$ for $t\in\mathcal{T}_{V}$ and $\mathsf{val}_{\mathcal{G}}(t)\in\mathcal{C}$ for $t\in\mathcal{C}_{V}$ inductively by the following rules:

•

$\mathsf{val}_{\mathcal{G}}(a)=a$ for $a\in\Sigma$ and $\mathsf{val}_{\mathcal{G}}(x)=x$ ,

•

$\mathsf{val}_{\mathcal{G}}(a(t_{1},t_{2}))=a(\mathsf{val}_{\mathcal{G}}(t_{1}),\mathsf{val}_{\mathcal{G}}(t_{2}))$ for $a\in\Sigma$ and $t_{1},t_{2}\in\mathcal{T}_{V}\cup\mathcal{C}_{V}$ (and $t_{1}\in\mathcal{T}_{V}$ or $t_{2}\in\mathcal{T}_{V}$ since there is at most one parameter in $a(t_{1},t_{2})$ ),

•

$\mathsf{val}_{\mathcal{G}}(A)=\mathsf{val}_{\mathcal{G}}(r(A))$ for $A\in V_{0}$ ,

•

$\mathsf{val}_{\mathcal{G}}(A(s))=\mathsf{val}_{\mathcal{G}}(r(A))[\mathsf{val}_{\mathcal{G}}(s)]$ for $A\in V_{1}$ and $s\in\mathcal{T}_{V}\cup\mathcal{C}_{V}$ (note that $\mathsf{val}_{\mathcal{G}}(r(A))$ is a context $c$ , so we can build $c[\mathsf{val}_{\mathcal{G}}(s)]$ ).

The tree defined by $\mathcal{G}$ is $\mathsf{val}(\mathcal{G})=\mathsf{val}_{\mathcal{G}}(A_{0})\in\mathcal{T}$ .

Example 4.

Let $\Sigma=\{a,b\}$ and $\mathcal{G}=(\{A_{0},A_{1},A_{2}\},A_{0},r)$ be a TSLP such that $A_{0},A_{1}\in V_{0},A_{2}\in V_{1}$ and

[TABLE]

We get $\mathsf{val}_{\mathcal{G}}(A_{2})=b(x,a)$ , $\mathsf{val}_{\mathcal{G}}(A_{1})=b(b(b,a),a)$ and $\mathsf{val}({\mathcal{G}})=\mathsf{val}_{\mathcal{G}}(A_{0})=a(b(b(b,a),a),b(b,a))$ .

3.2. Tree straight-line programs in normal form

In this section, we will use TSLPs in a certain normal form, which we introduce first.

A TSLP $\mathcal{G}=(V,A_{0},r)$ is in normal form if the following conditions hold:

•

$V=\{A_{0},A_{1},\ldots,A_{m-1}\}$ for some $m\in\mathbb{N}$ , $m\geq 1$ .

•

For every $A_{i}\in V_{0}$ , the right-hand side $r(A_{i})$ is an expression of the form $A_{j}(\alpha)$ , where $A_{j}\in V_{1}$ and $\alpha\in V_{0}\cup\Sigma$ .

•

For every $A_{i}\in V_{1}$ the right-hand side $r(A_{i})$ is an expression of the form $A_{j}(A_{k}(x))$ , $a(\alpha,x)$ , or $a(x,\alpha)$ , where $A_{j},A_{k}\in V_{1}$ , $a\in\Sigma$ and $\alpha\in V_{0}\cup\Sigma$ .

•

For every $A_{i}\in V$ define the word $\rho(A_{i})\in(V\cup\Sigma)^{*}$ as follows:

[TABLE]

Let $\rho_{\mathcal{G}}=\rho(A_{0})\rho(A_{1})\cdots\rho(A_{m-1})\in(\Sigma\cup\{A_{1},A_{2},\ldots,A_{m-1}\})^{*}$ . Then we require that $\rho_{\mathcal{G}}$ is of the form $\rho_{\mathcal{G}}=A_{1}u_{1}A_{2}u_{2}\cdots A_{m-1}u_{m-1}$ with $u_{i}\in(\Sigma\cup\{A_{1},A_{2},\ldots,A_{i}\})^{*}$ .

•

$\mathsf{val}_{\mathcal{G}}(A_{i})\neq\mathsf{val}_{\mathcal{G}}(A_{j})$ for $i\neq j$

We also allow the TSLP $\mathcal{G}_{a}=(\{A_{0}\},A_{0},A_{0}\mapsto a)$ for every $a\in\Sigma$ in order to get the singleton tree $a$ . In this case, we set $\rho_{\mathcal{G}_{a}}=\rho(A_{0})=a$ .

Let $\mathcal{G}=(V,A_{0},r)$ be a TSLP in normal form with $V=\{A_{0},A_{1},\ldots,A_{m-1}\}$ for the further definitions. We define the size of $\mathcal{G}$ as $|\mathcal{G}|=|V|=m$ . Thus $2|\mathcal{G}|$ is the length of $\rho_{\mathcal{G}}$ . Let $\omega_{\mathcal{G}}$ be the word obtained from $\rho_{\mathcal{G}}$ by removing the first (i.e., left-most) occurrence of $A_{i}$ from $\rho_{\mathcal{G}}$ for every $1\leq i\leq m-1$ . Thus, if $\rho_{\mathcal{G}}=A_{1}u_{1}A_{2}u_{2}\cdots A_{m-1}u_{m-1}$ with $u_{i}\in(\Sigma\cup\{A_{1},A_{2},\ldots,A_{i}\})^{*}$ , then $\omega_{\mathcal{G}}=u_{1}u_{2}\cdots u_{m-1}$ . Note that $|\omega_{\mathcal{G}}|=|\rho_{\mathcal{G}}|-m+1=m+1$ . The entropy $H(\mathcal{G})$ of the normal form TSLP $\mathcal{G}$ is defined as the empirical unnormalized entropy of the word $\omega_{\mathcal{G}}$ (see (4)):

[TABLE]

Example 5.

Let $\Sigma=\{a,b\}$ and $\mathcal{G}=(\{A_{0},A_{1},A_{2},A_{3},A_{4}\},A_{0},r)$ be the normal form TSLP with $A_{0},A_{2},A_{3}\in V_{0},A_{1},A_{4}\in V_{1}$ and

[TABLE]

We have $\mathsf{val}(\mathcal{G})=a(b(b(b,a),a),b(b,a))$ , $\rho_{\mathcal{G}}=A_{1}A_{2}aA_{3}A_{4}A_{3}A_{4}bba$ ( $u_{1}=u_{3}=\varepsilon$ , $u_{2}=a$ , $u_{4}=A_{3}A_{4}bba$ ), $|\mathcal{G}|=5$ and $\omega_{\mathcal{G}}=aA_{3}A_{4}bba$ .

The derivation tree $T_{\mathcal{G}}$ of the normal form TSLP $\mathcal{G}$ is a binary tree with node labels from $V\cup\Sigma$ . The root is labeled with $A_{0}$ . Nodes labeled with a symbol from $\Sigma$ are the leaves of $T_{\mathcal{G}}$ . A node $v$ that is labeled with a nonterminal $A_{i}$ has $|\rho(A_{i})|=2$ many children. If $\rho(A_{i})=\alpha\beta$ with $\alpha,\beta\in V\cup\Sigma$ , then the left child of $v$ is labeled with $\alpha$ and the right child is labeled with $\beta$ . For every node $u$ of $T_{\mathcal{G}}$ we define the tree or context $s_{u}=\mathsf{val}_{\mathcal{G}}(\alpha)$ where $\alpha\in V\cup\Sigma$ is the label of $u$ . If $\alpha\in V_{0}\cup\Sigma$ then $s_{u}\in\mathcal{T}$ and if $\alpha\in V_{1}$ then $s_{u}\in\mathcal{C}$ . An initial subtree of the derivation tree $T_{\mathcal{G}}$ is a tree that can be obtained from $T_{\mathcal{G}}$ as follows: Take a subset $U$ of the nodes of $T_{\mathcal{G}}$ and remove from $T_{\mathcal{G}}$ all proper descendants of nodes from $U$ , i.e., all nodes that are located strictly below a node from $U$ .

Example 6.

Let $\mathcal{G}$ be the normal form TSLP from Example 5. The derivation tree $T_{\mathcal{G}}$ is shown in Figure 3 on the left; an initial subtree $T^{\prime}$ of it is shown on the right.

Lemma 4.

Let $\mathcal{G}$ be a TSLP in normal form with $t=\mathsf{val}(\mathcal{G})$ . Let $T^{\prime}$ be an initial subtree of $T_{\mathcal{G}}$ and let $v_{1},\ldots,v_{l}$ be the sequence of all leaves of $T^{\prime}$ (in left-to-right order). Then $2|t|\geq\sum_{i=1}^{l}|s_{v_{i}}|$ .

Proof.

Let $u$ be a node of $T_{\mathcal{G}}$ and let $T_{u}$ be the subtree of $T_{\mathcal{G}}$ rooted in $u$ . Then, the nodes of $s_{u}$ are in a one-to-one correspondence with the leaves of $T_{u}$ , that is, if $s_{u}\in\mathcal{T}$ , we have $2|s_{u}|-1=|T_{u}|$ and if $s_{u}\in\mathcal{C}$ , we have $2|s_{u}|=|T_{u}|$ (recall that $|T_{u}|$ is the number of leaves of $T_{u}$ ). Thus, $2|s_{u}|-1\leq|T_{u}|$ . Since $T^{\prime}$ is an initial subtree of $T_{\mathcal{G}}$ we get $2|t|-1=2|\mathsf{val}(\mathcal{G})|-1=|T_{\mathcal{G}}|=\sum_{i=1}^{l}|T_{v_{i}}|\geq\sum_{i=1}^{l}(2|s_{v_{i}}|-1)$ . Since $|s_{v_{i}}|\geq 1$ we get $2|t|\geq\sum_{i=1}^{l}2|s_{v_{i}}|-l+1\geq\sum_{i=1}^{l}|s_{v_{i}}|+1$ and the statement follows. ∎

A grammar-based tree compressor is an algorithm $\psi$ that produces for a given tree $t\in\mathcal{T}$ a TSLP $\mathcal{G}_{t}$ in normal form such that $t=\mathsf{val}(\mathcal{G}_{t})$ . It is not hard to show that every TSLP can be transformed with a linear size increase into a normal form TSLP that derives the same tree. For example, the TSLP from Example 4 is transformed into the normal form TSLP described in Example 5. We will not use this fact, since all we need is the following theorem from [10] (recall that $\hat{\sigma}=\max\{2,\sigma\}$ ):

Theorem 2.

There exists a grammar-based compressor $\psi$ (working in linear time) with $\max_{t\in\mathcal{T}_{n}}|\mathcal{G}_{t}|\leq\mathcal{O}(n/\log_{\hat{\sigma}}n)$ .

3.3. Binary coding of TSLPs in normal form

In this section we fix a binary encoding for normal form TSLPs. This encoding is similar to the one for TSLPs producing unlabeled binary trees [16] (which in turn is based on the encoding for SLPs from [19] and the encoding of DAGs from [30]). Let $\mathcal{G}=(V,A_{0},r)$ be a TSLP in normal form with $m=|V|=|\mathcal{G}|$ nonterminals. We define the type $\mathsf{type}(A_{i})\in\{0,1,2,3\}$ of a nonterminal $A_{i}\in V$ as follows:

[TABLE]

We define the binary word $B(\mathcal{G})=w_{0}w_{1}w_{2}w_{3}w_{4}$ , where the words $w_{i}\in\{0,1\}^{+}$ , $0\leq i\leq 4$ , are defined as follows:

•

$w_{0}=0^{m-1}1$

•

$w_{1}=a_{0}b_{0}a_{1}b_{1}\cdots a_{m-1}b_{m-1}$ , where $a_{j}b_{j}$ is the 2-bit binary encoding of $\mathsf{type}(A_{j})$ . Note that $|w_{1}|=2m$ .

•

Let $\rho_{\mathcal{G}}=A_{1}u_{1}A_{2}u_{2}\cdots A_{m-1}u_{m-1}$ with $u_{i}\in(\Sigma\cup\{A_{1},A_{2},\ldots,A_{i}\})^{*}$ . Then $w_{2}=10^{|u_{1}|}10^{|u_{2}|}\cdots 10^{|u_{m-1}|}$ . Note that $|w_{2}|=2m$ .

•

For $1\leq i\leq m-1$ let $k_{i}=|\rho_{\mathcal{G}}|_{A_{i}}\geq 1$ be the number of occurrences of the nonterminal $A_{i}$ in the word $\rho_{\mathcal{G}}$ . Moreover, fix a total ordering on $\Sigma$ . For $1\leq i\leq\sigma$ , let $a_{i}$ denote the $i^{th}$ symbol in $\Sigma$ according to this ordering and let $l_{i}=|\rho_{\mathcal{G}}|_{a_{i}}\geq 0$ be the number of occurences of the symbol $a_{i}$ in the word $\rho_{\mathcal{G}}$ . Then $w_{3}=0^{k_{1}-1}10^{k_{2}-1}1\cdots 0^{k_{m-1}-1}10^{l_{1}}10^{l_{2}}1\cdots 0^{l_{\sigma}}1$ . Note that $|w_{3}|=2m+\sigma$ .

•

The word $w_{4}$ encodes the word $\omega_{\mathcal{G}}$ using the well-known enumerative encoding [4]. Every nonterminal $A_{i}$ , $1\leq i\leq m-1$ , has $\eta(A_{i}):=k_{i}-1$ occurrences in $\omega_{\mathcal{G}}$ . Every symbol $a_{i}\in\Sigma$ , $1\leq i\leq\sigma$ , has $\eta(a_{i})=l_{i}$ occurences in $\omega_{\mathcal{G}}$ . Let $S$ be the set of words over the alphabet $\Sigma\cup\{A_{1},\ldots,A_{m-1}\}$ with $\eta(a_{i})$ occurrences of $a_{i}\in\Sigma$ ( $1\leq i\leq\sigma$ ) and $\eta(A_{i})$ occurrences of $A_{i}$ ( $1\leq i\leq m-1$ ). Hence,

[TABLE]

Let $v_{0},v_{1},\ldots,v_{|S|-1}$ be the lexicographic enumeration of the words from $S$ with respect to the alphabet order $a_{1},\dots,a_{\sigma},A_{1},\ldots,A_{m-1}$ . Then $w_{4}$ is the binary encoding of the unique index $i$ such that $\omega_{\mathcal{G}}=v_{i}$ , where $|w_{4}|=\lceil\log_{2}|S|\rceil$ (leading zeros are added to the binary encoding of $i$ to obtain the length $\lceil\log_{2}|S|\rceil$ ).

Example 7.

Consider the normal from TSLP $\mathcal{G}$ from Example 5. We have $w_{0}=00001$ , $w_{1}=0011000011$ , $w_{2}=1101100000$ and $w_{3}=110101001001$ . To compute $w_{4}$ , note first that there are $|S|=180$ words with two occurrences of $a$ and $b$ and one occurrence of $A_{3}$ and $A_{4}$ . It follows that $|w_{4}|=\lceil\log_{2}(180)\rceil=8$ . Furthermore, with the canonical ordering on $\Sigma=\{a,b\}$ , the order of the alphabet is $a,b,A_{3},A_{4}$ . The word $\omega_{\mathcal{G}}=aA_{3}A_{4}bba$ is the lexicographically largest word in $S$ starting with $aA_{3}$ . There are 132 words in $S$ that are lexicographically larger than $aA_{3}A_{4}bba$ , namely all words in $S$ that start with $b$ (60 words), $A_{3}$ (30 words), $A_{4}$ (30 words), or $aA_{4}$ (12 words). Hence $\omega_{\mathcal{G}}=aA_{3}A_{4}bba$ is the $48^{th}$ word in $S$ in lexicographic order, i.e., $\omega_{G}=v_{47}$ and thus $w_{4}=00101111$ .

The following lemma generalizes a result from [16]:

Lemma 5.

The set of code words $B(\mathcal{G})$ , where $\mathcal{G}$ ranges over all TSLPs in normal form, is a prefix code.

Proof.

Let $B(\mathcal{G})=w_{0}w_{1}w_{2}w_{3}w_{4}$ with $w_{i}$ defined as above. We show how to recover the TSLP $\mathcal{G}$ , given the alphabet $\Sigma$ and the ordering on $\Sigma$ . From $w_{0}$ we can determine $m=|V|$ and the factors $w_{1}$ , $w_{2}$ , and $w_{3}$ of $B(\mathcal{G})$ . Hence, we can determine the type of every nonterminal from $w_{1}$ . The types allow to compute $\mathcal{G}$ from the word $\rho_{G}$ . Hence, it remains to determine $\rho_{\mathcal{G}}$ . To compute $\rho_{\mathcal{G}}$ from $w_{2}$ , one only needs $\omega_{\mathcal{G}}$ . For this, one determines the frequencies $\eta(A_{1}),\ldots,\eta(A_{m-1}),\eta(a_{1}),\dots,\eta(a_{\sigma})$ of the symbols in $\omega_{\mathcal{G}}$ from $w_{3}$ . Using these frequencies one computes the size $|S|$ from (11) and the length $\lceil\log_{2}|S|\rceil$ of $w_{4}$ . From $w_{4}$ , one can finally compute $\omega_{\mathcal{G}}$ . ∎

Note that $|B(\mathcal{G})|\leq 7|\mathcal{G}|+\sigma+|w_{4}|$ . By using the well-known bound on the code length of enumerative encoding [5, Theorem 11.1.3], we get:

Lemma 6.

For the length of the binary coding $B(\mathcal{G})$ we have

[TABLE]

4. Entropy bounds for binary encoded TSLPs

For this section we fix a grammar-based tree compressor $\psi:t\mapsto\mathcal{G}_{t}$ such that $\max_{t\in\mathcal{T}_{n}}|\mathcal{G}_{t}|\in\mathcal{O}(n/\log_{\hat{\sigma}}n)$ ; see Theorem 2. Let $\gamma>0$ be a concrete constant such that

[TABLE]

for every tree $t\in\mathcal{T}_{n}$ and $n$ large enough. We allow that the alphabet size $\sigma$ grows with $n$ , i.e., $\sigma=\sigma(n)$ is a function in the tree size $n$ such that $1\leq\sigma(n)\leq 2n-1$ (a binary tree $t\in\mathcal{T}_{n}$ has $2n-1$ nodes).

We then consider the tree encoder $E_{\psi}:\mathcal{T}\to\{0,1\}^{*}$ defined by $E_{\psi}(t)=B(\mathcal{G}_{t})$ .

Lemma 7.

Let $k\geq 0$ , $t\in\mathcal{T}_{n}$ with $n\geq 2$ and let $\mathcal{P}=(P_{w})_{w\in\mathcal{L}_{k}}$ be a $k^{th}$ -order tree process with $\mathsf{Prob}_{\mathcal{P}}(t)>0$ . We have

[TABLE]

Proof.

Let $m=|\mathcal{G}_{t}|=|V|$ be the size of $\mathcal{G}_{t}$ . Let $T=T_{\mathcal{G}_{t}}$ be the derivation tree of $\mathcal{G}_{t}$ . We define an initial subtree $T^{\prime}$ as follows: If $v_{1}$ and $v_{2}$ are non-leaf nodes of $T$ that are labeled with the same nonterminal and $v_{1}$ comes before $v_{2}$ in preorder (depth-first left-to-right), then we remove from $T$ all proper descendants of $v_{2}$ . Thus, for every $A_{i}\in V$ there is exactly one non-leaf node in $T^{\prime}$ that is labeled with $A_{i}$ . For the TSLP from Example 5, the tree $T^{\prime}$ is shown in Figure 3 on the right.

Recall the definition of the words $\rho_{\mathcal{G}_{t}}$ and $\omega_{\mathcal{G}_{t}}$ from Section 3.2. The word $\rho_{\mathcal{G}_{t}}$ can be obtained by writing down for every node $v$ of $T^{\prime}$ the labels of $v$ ’s children and then concatenating these labels. Moreover, the word $\omega_{\mathcal{G}_{t}}$ is obtained by writing down (in the right order) the labels of the leaves of $T^{\prime}$ . Note that $T^{\prime}$ has $m$ non-leaf nodes and $m+1$ leaves. Let $v_{1},v_{2},\ldots,v_{m+1}$ be the sequence of all leaves of $T^{\prime}$ (w.l.o.g. in preorder) and let $\alpha_{i}\in\Sigma\cup\{A_{1},\ldots,A_{m-1}\}$ be the label of $v_{i}$ . Let $\overline{\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{m+1})$ . Then $\overline{\alpha}$ is a permutation of $\omega_{\mathcal{G}_{t}}$ . We therefore have $|\omega_{\mathcal{G}_{t}}|_{\alpha}=|\overline{\alpha}|_{\alpha}$ for every $\alpha\in\Sigma\cup\{A_{1},\ldots,A_{m-1}\}$ . Hence, $p_{\overline{\alpha}}$ and $p_{\omega_{\mathcal{G}_{t}}}$ are the same empirical distributions. For the TSLP from Example 5 we get $\overline{\alpha}=(a,b,a,b,A_{4},A_{3})$ . Let $s_{i}=\mathsf{val}_{\mathcal{G}_{t}}(\alpha_{i})\in\mathcal{T}\cup(\mathcal{C}\setminus\{x\})$ . Since $\mathsf{val}_{\mathcal{G}_{t}}(A_{i})\neq\mathsf{val}_{\mathcal{G}_{t}}(A_{j})$ for all $i\neq j$ ( $\mathcal{G}_{t}$ is in normal form) and $\mathsf{val}_{\mathcal{G}_{t}}(A_{i})\notin\Sigma$ for all $i$ (this holds for every normal form TSLP that produces a tree of size at least two), the tuple $\overline{s}=(s_{1},s_{2},\ldots,s_{m+1})$ satisfies for all $1\leq i\leq m+1$ :

[TABLE]

We define from $\mathcal{P}$ for every $z\in\mathcal{L}_{k}$ a modified tree process $\mathcal{P}_{z}=(P_{z,w})_{w\in\mathcal{L}}$ by setting

[TABLE]

for all $\tilde{a}\in\Sigma\times\{0,2\}$ . Note that the $k^{th}$ -order tree process $\mathcal{P}$ is obtained for $z=(\Box 0)^{k}$ for the fixed padding symbol $\Box\in\Sigma$ . We define a mapping $\tau:\mathcal{T}\cup\mathcal{C}\rightarrow[0,1]$ by

[TABLE]

for every $s\in\mathcal{T}\cup\mathcal{C}$ . Thus, for every $s\in(\mathcal{T}\cup\mathcal{C})\setminus\mathcal{T}_{1}$ , the function $\tau$ maximizes the values of the function $\mathsf{Prob}_{\mathcal{P}}$ associated with the $k^{th}$ -order tree process $\mathcal{P}=(P_{w})_{w\in\mathcal{L}_{k}}$ by choosing an optimal $k$ -history for the nodes of $s$ whose history is of length smaller than $2k$ . We show that $\tau$ satisfies

[TABLE]

In order to prove (16), first note that by definition of the tree/context $s_{u}$ , for each node $u$ of the derivation tree $T$ , the tree/context $s_{u}$ corresponds to a subtree/subcontext or a single inner node of the binary tree $t$ . We define a function $\chi$ which maps a node $u$ of the derivation tree $T$ to a node $\chi(u)\in V(t)\subseteq\{0,1\}^{*}$ : Intuitively, $\chi(u)$ is the root of the subtree/subcontext, respectively, the inner node of $t$ which corresponds to $s_{u}$ . Formally, $\chi$ is defined inductively as follows: For the root node $u$ of $T$ , we set $\chi(u)=\varepsilon$ . Furthermore, let $u$ be a non-leaf node of $T$ which is labeled with the non-terminal $A_{i}$ and for which $\chi(u)$ has been defined. Let $u_{1}$ be the left child and $u_{2}$ be the right child of $u$ in $T$ . We define $\chi(u_{1})=\chi(u)$ . The node $\chi(u_{2})$ is defined as follows:

(i)

If $r(A_{i})=A_{j}(\alpha)$ with $A_{j}\in V_{1}$ and $\alpha\in V\cup\Sigma$ , then we set $\chi(u_{2})=\chi(u)\omega(s_{u_{1}})$ (recall that $\omega(s_{u_{1}})\neq\varepsilon$ is the position of the parameter $x$ in the context $s_{u_{1}}=\mathsf{val}_{\mathcal{G}}(A_{j})$ ). 2. (ii)

If $r(A_{i})=a(\alpha,x)$ (respectively, $r(A_{i})=a(x,\alpha)$ ) for $a\in\Sigma$ and $\alpha\in\Sigma\cup V_{0}$ , then we define $\chi(u_{2})=\chi(u)0$ (respectively, $\chi(u_{2})=\chi(u)1$ ).

This yields a well-defined function $\chi$ mapping a node $u$ of $T$ to a node $\chi(u)\in V(t)$ . Let us define

[TABLE]

Then, the mapping

[TABLE]

is bijective. The definition of the sets $V_{u}$ implies that if two nodes $u$ and $v$ of $T$ are not in an ancestor-descendant relationship, then $V_{u}\cap V_{v}=\emptyset$ . Since the nodes $v_{1},\dots,v_{m+1}$ are the leaves of the initial subtree $T^{\prime}$ and hence not in an ancestor-descendant relationship, the sets $V_{i}:=V_{v_{i}}$ are disjoint subsets of $V(t)$ . For the TSLP from Example 5, the node sets $V_{1},V_{2},V_{3},V_{4},V_{5}$ and $V_{6}$ corresponding to the six leaves of the initial subtree depicted in Figure 3 (right) are shown in Figure 4. Note that if $s_{i}\notin\mathcal{T}_{1}$ then the bijection from (17) also preserves the $\lambda$ -mapping in the following sense:

[TABLE]

for every $w\in V(s_{i})$ . However, if $s_{i}\in\mathcal{T}_{1}$ then this statement can be wrong since the number of children is not preserved in general: If $s_{i}\in\mathcal{T}_{1}$ , then $s_{i}$ might correspond to a single inner node of $t$ . In this case, we have $V_{i}=\{\chi(v_{i})\}$ , $V(s_{i})=\{\varepsilon\}$ and $\lambda_{t}(\chi(v_{i}))=(a,2)$ for some $a\in\Sigma$ , but $\lambda_{s_{i}}(\varepsilon)=(a,0)$ . For example, in the TSLP from Example 5, the left-most leaf node of its initial subtree depicted in Figure 3 corresponds to the root node of the tree $\mathsf{val}(\mathcal{G})$ (see Figure 4). We define

[TABLE]

In our running example, we have $(s_{1},s_{2},s_{3},s_{4},s_{5},s_{6})=(a,b,a,b,b(x,a),b(b,a))$ and hence $\mathcal{I}:=\{5,6\}$ .

The history $h(\chi(v_{i})w)$ of a node $\chi(v_{i})w\in V_{i}$ with $w\in V(s_{i})$ in the tree $t$ is the concatenation of the history $h(\chi(v_{i}))$ of $\chi(v_{i})$ in $t$ and the history $h(w)$ of $w$ in the tree/context $s_{i}$ . Thus, if $i\in\mathcal{I}$ , we have

[TABLE]

For the inequality in the last line, note that every $k$ -history $\ell_{k}(zh(\chi(v_{i}))h(w))$ for $z\in\mathcal{L}_{k}$ is also of the form $\ell_{k}(z^{\prime}h(w))$ for some $z^{\prime}\in\mathcal{L}_{k}$ .

We can now show (16). Since $t\in\mathcal{T}_{n}$ with $n\geq 2$ we have

[TABLE]

Next, we define the function $\xi:\mathcal{T}\cup\mathcal{C}\setminus\{x\}\rightarrow[0,1]$ as follows:

[TABLE]

We get

[TABLE]

where ( $\ast$ ) follows from Lemmas 2 and 3 and $|\mathcal{L}_{k}|=2^{k}\sigma^{k}$ and ( $\ast\ast$ ) follows from the well-known fact that $\sum_{r\geq 1}r^{-2}=\pi^{2}/6$ . In particular, we have $\sum_{s\in\{s_{1},\ldots,s_{m+1}\}}\xi(s)\leq 1.$ Thus, with Shannon’s inequality (5), we obtain:

[TABLE]

With $\mathcal{I}_{0}=\{i\mid 1\leq i\leq m+1,s_{i}\in\mathcal{T}\}$ and $\mathcal{I}_{1}=\{i\mid 1\leq i\leq m+1,s_{i}\in\mathcal{C}\}$ we obtain

[TABLE]

by definition of $\xi.$ Using logarithmic identities, we get

[TABLE]

Using $|\mathcal{I}_{0}|+|\mathcal{I}_{1}|=m+1\leq 2m=2|\mathcal{G}_{t}|$ , $\log_{2}(\pi^{2}/6)|\mathcal{I}_{1}|\leq|\mathcal{I}_{1}|$ and $|s_{i}|+1\leq 2|s_{i}|$ , we obtain

[TABLE]

Equation (16) and $\tau(t)\geq\mathsf{Prob}_{\mathcal{P}}(t)$ yield

[TABLE]

Let us bound the sum $\sum_{i=1}^{m+1}\log_{2}|s_{i}|$ : Using Jensen’s inequality and Lemma 4 (which yields $\sum_{i=1}^{m+1}|s_{i}|\leq 2n$ ), we get

[TABLE]

and thus

[TABLE]

To bound the term $|\mathcal{G}_{t}|\log_{2}(n/|\mathcal{G}_{t}|)$ recall that for $n$ large enough we have $|\mathcal{G}_{t}|\leq\gamma\cdot n/\log_{\hat{\sigma}}n=\gamma\cdot n\cdot\log\hat{\sigma}/\log n$ by (12). Here, $\gamma$ is a constant. Since $\sigma\leq 2n-1$ there is a constant $\gamma^{\prime}\geq 1$ with $\gamma\cdot n/\log_{\hat{\sigma}}n\leq\gamma^{\prime}n$ . Since for every fixed $z\geq 1$ , the function $\phi(x)=x\log_{2}\left(\frac{z}{x}\right)$ is monotonically increasing for $0<x\leq\frac{z}{e}$ (where $e$ is Euler’s number), we get

[TABLE]

With (20) we get

[TABLE]

which proves the lemma. ∎

Theorem 3.

For every $t\in\mathcal{T}_{n}$ and every $k\geq 0$ we have

[TABLE]

Proof.

Let $\mathcal{P}=(P_{w})_{w\in\{0,1\}^{k}}$ be a $k^{th}$ -order tree process with $P(t)>0$ . Lemmas 6 and 7 yield

[TABLE]

where the last equality uses the bound $|\mathcal{G}_{t}|\in\mathcal{O}(n/\log_{\hat{\sigma}}n)$ . Finally, by taking for $\mathcal{P}$ be the empirical $k^{th}$ -order tree process $\mathcal{P}^{t}$ , we get

[TABLE]

from Theorem 1. ∎

5. Extension to unranked trees

So far, we have only considered binary trees. In this section, we consider unranked, ordered trees, where the number of children of a node (also called its degree) can be any natural number and the children of every node are totally ordered. As before, each node is labeled by an element of some finite alphabet $\Sigma$ . Let us denote by $\mathcal{U}(\Sigma)$ (or simply $\mathcal{U}$ ) the set of all such trees. For technical reasons we also define forests which are ordered sequences of trees from $\mathcal{U}$ . The set of forests is denoted with $\mathcal{F}$ . The sets $\mathcal{U}$ and $\mathcal{F}$ can be inductively defined as the smallest sets of strings over the alphabet $\Sigma\cup\{(,)\}$ such that the following conditions hold:

•

$\varepsilon\in\mathcal{F}$ (this is the empty forest),

•

if $a\in\Sigma$ and $f\in\mathcal{F}$ then $a(f)\in\mathcal{U}$ ,

•

if $t\in\mathcal{U}$ and $f\in\mathcal{F}$ then $tf\in\mathcal{F}$ .

The singleton tree $a()$ (which is obtained by taking $f=\varepsilon$ in the second point) is usually written as $a$ . Note that $\mathcal{U}\subseteq\mathcal{F}$ and that $\mathcal{F}=\mathcal{U}^{*}$ . The size $|f|$ of $f\in\mathcal{F}$ is the number of occurrences of $\Sigma$ -labels in $f$ ; formally: $|\varepsilon|=0$ , $|a(f)|=1+|f|$ and $|tf|=|t|+|f|$ for $a\in\Sigma$ , $t\in\mathcal{U}$ , and $f\in\mathcal{F}$ .

The first-child/next-sibling encoding transforms a forest $f\in\mathcal{F}$ into a binary tree $\text{fcns}(f)\in\mathcal{T}$ . It is defined inductively as follows (recall that $\Box\in\Sigma$ is a fixed distinguished symbol in $\Sigma$ ):

•

$\text{fcns}(\varepsilon)=\Box$ and

•

$\text{fcns}(a(f)g)=a(\text{fcns}(f),\text{fcns}(g))$ for $f,g\in\mathcal{F}$ and $a\in\Sigma$ .

Thus, the left (resp., right) child of a node in $\text{fcns}(f)$ is the first child (resp., right sibling) of the node in $f$ or a $\Box$ -labeled leaf if it does not exist.

Example 8.

If $f=a(bc)d(e)$ then

[TABLE]

see also Figure 5.

Note that if $t\in\mathcal{U}$ , $|t|=n$ then $\text{fcns}(t)$ is a binary tree with $n$ internal nodes. Hence we have $|\text{fcns}(t)|=n+1$ (which is the number of leaves of $\text{fcns}(t)$ ). We define the $k^{th}$ -order empirical entropy of an unranked tree $t\in\mathcal{U}$ as $H_{k}(t)=H_{k}(\text{fcns}(t))$ . Note that this definition is independent of the choice of the symbol $\Box\in\Sigma$ . From Theorem 3, we immediately obtain:

Theorem 4.

For every $t\in\mathcal{U}$ with $|t|=n$ and every $k\geq 0$ we have

[TABLE]

The above definition of the $k^{th}$ -order empirical entropy of an unranked tree can be also applied to binary trees $t$ (a binary tree can be viewed as a particular unranked tree). This yields $H_{k}(\text{fcns}(t))$ and leads to the question how this value relates to $H_{k}(t)$ (the $k^{th}$ -order empirical entropy of $t$ as defined before in (10)). In one direction, we have the following bound:

Lemma 8.

Let $t\in\mathcal{T}(\Sigma)$ denote a binary tree with first-child next-sibling encoding $\text{fcns}(t)\in\mathcal{T}(\Sigma)$ . Then $H_{2k}(\text{fcns}(t))\leq H_{k-1}(t)$ for $1\leq k\leq|t|$ .

The somewhat technical proof of Lemma 8 can be found in Appendix B. In contrast to Lemma 8, there are families of binary trees $t_{n}$ where $H_{k}(\text{fcns}(t_{n}))$ is exponentially smaller than $H_{k}(t_{n})$ for every $n\geq 1$ and $k\geq 2$ : Define $t_{n}$ inductively by $t_{1}=a$ and $t_{n}=a(c,t_{n-1})$ if $n$ is even and $t_{n}=a(b,t_{n-1})$ if $n$ is odd. Thus, $t_{n}$ denotes a right-degenerate binary tree of size $n$ , whose inner nodes and right-most leaf are labeled with $a$ and whose leaves except for the right-most leaf are alternately labeled $b$ and $c$ . We get $H_{k}(t_{n})\in\Theta(n-k)$ : there are $n-k$ many nodes $v$ with $k$ -history $(a1)^{k-1}a0$ , and about half of them are $b$ -labeled leaves, while the other half are $c$ -labeled leaves. Moreover, we have $H_{k}(\text{fcns}(t_{n}))\in\Theta(\log(n-k))$ : the fcns-encodings of the binary trees $t_{n}$ can be inductively defined by $\text{fcns}(t_{1})=a(\Box,\Box)$ and $\text{fcns}(t_{n})=a(c(\Box,\text{fcns}(t_{n-1})),\Box)$ if $n$ is even and $\text{fcns}(t_{n})=a(b(\Box,\text{fcns}(t_{n-1})),\Box)$ if $n$ is odd. Intuitively, as the labels $b$ and $c$ are thus incorporated in $k$ -histories of nodes of $\text{fcns}(t_{n})$ , we can thus determine the label of a node from its $k$ -history for $k\geq 2$ for most nodes of $\text{fcns}(t_{n})$ .

Our definition of the $k^{th}$ -order empirical entropy of an unranked tree via the fcns-encoding has a practical motivation. Unranked trees occur for instance in the context of XML, where the hierarchical structure of a document is represented as an unranked node labeled tree. In this setting, the label of a node quite often depends on (i) the labels of the ancestor nodes and (ii) the labels of the (left) siblings. This dependence is captured by our definition of the $k^{th}$ -order empirical entropy.

We also confirmed this intuition by experimental data (shown in Table 1) with real XML document trees (ignoring textual data at the leaves) showing that in these cases the $k^{th}$ -order empirical entropy is indeed very small compared to the worst-case bit size. More precisely, we computed for 21 real XML document trees222All data are available from http://xmlcompbench.sourceforge.net/Dataset.html. the $k^{th}$ -order empirical entropy (for $k=1,2,4,8$ ) and divided the value by the worst-case bit length $2n+\log_{2}(\sigma)n$ , where $n$ is the number of nodes and $\sigma$ is the number of node labels [15].

Our experimental results combined with our entropy bound (1) for grammar-based compression are in accordance with the fact that grammar-based tree compressors yield impressive compression ratios for XML document trees, see e.g. [24]. Some of the XML documents from our experiments were also used in [24], where the performance of the grammar-based tree compressor TreeRePair was tested. An interesting observation is that those XML trees, for which our $k$ -th order empirical entropy is large are indeed those XML trees with the worst compression ratio for TreeRePair in [24]. This is in particular true for the Treebank document, see Table 1. TreeRePair obtained for Treebank a compression ratio of around 20%, whereas for all other documents tested in [24] TreeRePair achieved a compression ratio below 8%.

6. String straight-line programs versus higher-order empirical entropy of strings

Our definition of $k^{th}$ -order empirical entropy does not capture all regularities that can be exploited in grammar-based compression. Take for instance a complete unlabeled binary tree $t_{n}$ of height $n$ (all paths from the root to a leaf have length $n$ ). This tree has $2^{n}$ leaves and is very well compressible: its minimal DAG has only $n+1$ nodes, hence there also exists a TSLP of size $n+1$ for $t_{n}$ . But for every fixed $k$ the $k^{th}$ -order empirical entropy of $t_{n}$ divided by $n$ converges to $2$ (the trivial upper bound) for $n\to\infty$ . If $n\gg k$ then for every $k$ -history $z$ the number of leaves with $k$ -history $z$ is roughly the same as the number of internal nodes with $k$ -history $z$ . Hence, although $t_{n}$ is highly compressible with TSLPs (and even DAGs), its $k^{th}$ -order empirical entropy is close to the maximal value. We show in the following that the same phenomenon occurs for grammar-based string compression and the well-established empirical entropy of strings.

The $k^{th}$ -order empirical entropy of a string is defined as follows (see e.g. [9]). Let $\Sigma$ denote a finite alphabet and let $w\in\Sigma^{*}$ . For a non-empty string $\alpha\in\Sigma^{+},$ define $w(\alpha)\in\Sigma^{*}$ as the string whose $i^{th}$ symbol is the symbol in $w$ immediately following the $i^{th}$ occurrence of the string $\alpha$ in $w$ . Thus, if $\alpha$ is not a suffix of $w$ , the length of $w(\alpha)$ is equal to the number of occurrences of the string $\alpha$ in $w$ . In case $\alpha$ is a suffix of $w$ , $|w(\alpha)|$ is the number of occurrences of $\alpha$ in $w$ minus one. Recall the definition of the unnormalized empirical entropy $H(w)$ of a string $w\in\Sigma^{+}$ (or tuple) from Section 2.1. For an integer $k\geq 1$ , the $k^{th}$ -order (unnormalized) empirical entropy of a string $w\in\Sigma^{+}$ is defined as

[TABLE]

where we set $H(\varepsilon)=0$ . For $k=0$ , $H_{0}(w)=H(w)$ is the (unnormalized) empirical entropy of $w$ .

A straight-line program (SLP) for a string $w$ is a context-free grammar that produces only the string $w$ . The size of an SLP is the sum of the lengths of the right-hand sides of the production rules of the context-free grammar, see e.g. [22] for details. We prove that for each $n\geq 1$ there exists a string of length $2^{n+1}-1$ , which is highly compressible with SLPs, but whose $k^{th}$ -order empirical entropy is close to the maximum.

Theorem 5.

There exists a family of strings $(S_{n})_{n}$ ( $n\geq 1$ ) over a binary alphabet with the following properties:

•

$|S_{n}|=2^{n+1}-1$ ,

•

there exists an SLP of size $3n$ for $S_{n}$ , and

•

$H_{k}(S_{n})\geq 2^{n+1-k}(1-o(1))$ * for $k\in o(n)$ .*

Proof.

We inductively define a string $S_{n}\in\{a,b\}^{*}$ for $n\geq 1$ as follows: We set

•

$S_{1}=baa$ and

•

$S_{n}=bS_{n-1}S_{n-1}$ .

We have $|S_{n}|=2^{n+1}-1$ . The string $S_{n}$ corresponds to the preorder traversal of the perfect binary tree of size $2^{n}$ , whose internal nodes are labeled with the symbol $b$ and whose leaves are labeled with the symbol $a$ . The recursive definition of $S_{n}$ directly translates to an SLP for $S_{n}$ of size $3n$ (there is a nonterminal for each $S_{i}$ with $1\leq i\leq n$ and each rule has three symbols on the right-hand side according to the recursive definition).

It remains to show that $H_{k}(S_{n})\geq 2^{n-k}$ for $0\leq k<n$ . We start with the case $k=0$ . Recall that $|w|_{x}$ denotes the number of occurrences of a symbol $x$ in a string $w$ , as defined in Section 2. We have $|S_{n}|_{a}=2^{n}$ and $|S_{n}|_{b}=2^{n}-1$ , which yields

[TABLE]

Define the function $g:[2,\infty)\to\mathbb{R}$ by

[TABLE]

It converges to $1$ from below for $x\to\infty$ . Since $|S_{n}|=2^{n+1}-1$ we have $H(S_{n})=g(2^{n})|S_{n}|\geq 2^{n+1}(1-o(1))$ .

Let us now consider the case $k\geq 1$ and let $1\leq m\leq n$ . By construction of $S_{n}$ , the last symbol of $S_{n}$ is $a$ . Therefore, the length of the string $S_{n}(b^{m})$ equals the number of occurrences of the string $b^{m}$ in $S_{n}$ . In order to lower-bound the $k^{th}$ -order empirical entropy of $S_{n}$ , we first show inductively in $n$ , that

[TABLE]

for $1\leq m\leq n$ : For the base case, let $n=1$ . We have $S_{1}=baa$ and thus, $|S_{1}(b)|=1$ . For the induction step, let $n>1$ . By definition of $S_{n}$ , we have $S_{n}=bS_{n-1}S_{n-1}$ . By the induction hypothesis, we have $|S_{n-1}(b^{m})|=2^{n-m}-1$ for $1\leq m\leq n-1$ . Moreover, $b^{n}$ does not occur in $S_{n-1}$ (which follows by induction), i.e., $|S_{n-1}(b^{n})|=0=2^{n-n}-1$ . By construction, the last symbol of the string $S_{n-1}$ is $a$ . Thus, for all $1\leq m\leq n$ we have $|S_{n-1}S_{n-1}(b^{m})|=2|S_{n-1}(b^{m})|=2^{n-m+1}-2$ . Hence, as the string $b^{m}$ with $1\leq m\leq n$ occurs additionally as a prefix of the string $S_{n}=bS_{n-1}S_{n-1}$ , the number of occurrences of $b^{m}$ in $S_{n}$ in total is $|S_{n}(b^{m})|=2^{n-m+1}-1$ for every $1\leq m\leq n$ . This proves (21).

Next, we count the number of occurrences of $b^{m}$ in $S_{n}$ , which are followed by the symbol $a$ , that is, we count $|S_{n}(b^{m})|_{a}$ . We show inductively in $n$ , that

[TABLE]

for $1\leq m\leq n$ : For the base case, let $n=1$ . As $S_{1}=baa$ , we have $|S_{1}(b)|_{a}=1$ . For the induction step, let $n>1$ . By the induction hypothesis, we have $|S_{n-1}(b^{m})|_{a}=2^{n-1-m}$ for $1\leq m\leq n-1$ . As $S_{n-1}$ ends with $a$ , we obtain $|S_{n-1}S_{n-1}(b^{m})|_{a}=2^{n-m}$ for $1\leq m\leq n-1$ . Moreover, the construction of $S_{n}$ implies that the prefix $b^{n}$ of $S_{n}$ , which is the only occurrence of $b^{n}$ in $S_{n}$ , is followed by the symbol $a$ . Thus, $|S_{n}(b^{m})|_{a}=2^{n-m}$ for $1\leq m\leq n$ , which proves the claim.

As $|S_{n}(b^{m})|_{a}=2^{n-m}$ , we have $|S_{n}(b^{m})|_{b}=2^{n-m}-1$ . Thus, we obtain the following lower bound for the $k^{th}$ -order empirical entropy of $S_{n}$ for $k\in o(n)$ .

[TABLE]

This proves the theorem. ∎

Appendix A Histories of length smaller than $k$

In order to define $k^{th}$ -order empirical entropy for binary trees, there are basically three possibilities how to deal with nodes whose history is shorter than $2k$ :

(i)

pad the histories with a fixed dummy symbol $\Box\in\Sigma$ and direction $i\in\{0,1\}$ ,

(ii)

allow histories of length smaller than $2k$ , or, equivalently, pad the histories with a fixed dummy symbol $\diamond\notin\Sigma$ and direction $i\in\{0,1\}$ , or

(iii)

ignore nodes whose history is of length smaller than $2k$ .

Recall that in the main text we used the variant (i) with $i=0$ . In this subsection, we show that the above three variants are basically equivalent if $k$ is small compared to the size of the binary tree.

Fix an integer $k\geq 1$ . Recall that in Section 2.2.4 we defined for a tree $t$ , a $k$ -history $z\in\mathcal{L}_{k}$ , and $\tilde{a}\in\Sigma\times\{0,2\}$ the numbers $m^{t}_{z}=|V_{z}(t)|$ and $m^{t}_{z,\tilde{a}}=|\{v\in V_{z}(t)\mid\lambda(v)=\tilde{a}\}|$ . The tree $t$ will be fixed in this section; hence we will write $m_{z}$ and $m_{z,\tilde{a}}$ in the following. We define several variants of these numbers.

For a $k$ -history $z\in\mathcal{L}_{k}$ and $\tilde{a}\in\Sigma\times\{0,2\}$ we define:

[TABLE]

We have $m^{\scriptscriptstyle{<}}\leq 2^{k}-1$ and $m^{\scriptscriptstyle{<}}\geq 2k-1$ if $|t|\geq k$ . Also note that $m_{z}=m_{z}^{\scriptscriptstyle{<}}+m_{z}^{\scriptscriptstyle{\geq}}$ and $\sum_{z\in\mathcal{L}_{k}}m_{z}^{\scriptscriptstyle{<}}=m^{\scriptscriptstyle{<}}$ and $\sum_{z\in\mathcal{L}_{k}}m_{z}^{\scriptscriptstyle{\geq}}=2|t|-1-m^{\scriptscriptstyle{<}}$ .

Fix a fresh symbol $\diamond\notin\Sigma$ and let $\mathcal{L}^{\diamond}=((\Sigma\cup\{\diamond\})\{0,1\})^{*}$ and $\mathcal{L}_{k}^{\diamond}=\{w\in\mathcal{L}^{\diamond}\mid|w|=2k\}$ . Clearly, $\mathcal{L}\subseteq\mathcal{L}^{\diamond}$ and $\mathcal{L}_{k}\subseteq\mathcal{L}_{k}^{\diamond}$ . Let $\ell_{k}:\mathcal{L}^{\diamond}\rightarrow\mathcal{L}_{k}^{\diamond}$ denote the partial function mapping a string $z\in\mathcal{L}^{\diamond}$ with $|z|\geq 2k$ to the suffix of $z$ of length $2k$ . For a binary tree $t$ and a node $v\in V(t)$ , define $h_{k}^{\diamond}(v)=\ell_{k}((\diamond 0)^{k}h(v))$ . Note that $h_{k}^{\diamond}(v)=h_{k}(v)$ for nodes $v\in V(t)$ with $|v|\geq k$ . Finally, for $z\in\mathcal{L}_{k}^{\diamond}$ and $\tilde{a}\in\Sigma\times\{0,2\}$ we define

[TABLE]

Using the above numbers, we can define three natural variations of the $k^{th}$ -order empirical entropy of a binary node-labeled tree $t$ :

(i)

Padding histories of length shorter than $2k$ with $\Box\in\Sigma$ and $i\in\{0,1\}$ yields the definition of $k^{th}$ -order empirical entropy from Section 2 (for $i=0$ ):

[TABLE]

(ii)

Padding histories of length shorter than $2k$ with $\diamond\notin\Sigma$ and $i=0$ yields

[TABLE]

This is equivalent to allowing histories of length shorter than $2k$ : By padding with a symbol $\diamond\notin\Sigma$ , we have $h_{k}^{\diamond}(v_{1})=h_{k}^{\diamond}(v_{2})$ if and only if $h(v_{1})=h(v_{2})$ for nodes $v_{1},v_{2}\in V(t)$ with $|v_{1}|,|v_{2}|<k$ .

(iii)

Ignoring nodes whose history is of length smaller than $2k$ yields

[TABLE]

We can now show that these three approaches are basically equivalent:

Theorem 6.

For every $k\geq 1$ and every binary tree $t$ , we have the following:

[TABLE]

Proof.

First, note that

[TABLE]

as the inner sum is the Shannon entropy $H(P)$ of the probability distribution $P:\Sigma\times\{0,2\}\rightarrow[0,1]$ given by $P(\tilde{a})=m_{z,\tilde{a}}^{\scriptscriptstyle{<}}/m_{z}^{\scriptscriptstyle{<}}$ (and hence $H(P)\leq\log_{2}(2\sigma)=1+\log_{2}\sigma$ ) and as $\sum_{z\in\mathcal{L}_{k}}m_{z}^{\scriptscriptstyle{<}}=m^{\scriptscriptstyle{<}}$ . Analogously, we get

[TABLE]

We start with upper-bounding $|H_{k}(t)-H_{k}^{\scriptscriptstyle{\geq}}(t)|$ : By the log-sum inequality (Lemma 1) and (22), we get

[TABLE]

Moreover, we find

[TABLE]

by the log-sum inequality (Lemma 1) and our estimate from (22). We have

[TABLE]

which follows immediately from the mean-value theorem: as a consequence of the mean-value theorem, for every mapping $f:[a,b]\rightarrow\mathbb{R}$ , which is differentiable on $[a,b]$ , we have

[TABLE]

With $f(x)=\log_{2}(x)$ , $a=2|t|-1-m^{\scriptscriptstyle{<}}$ and $b=2|t|-1$ and by logarithmic identities, we obtain the estimate (24). Thus, we have:

[TABLE]

Next, we upper-bound $|H_{k}^{\scriptscriptstyle{\geq}}(t)-H_{k}^{\diamond}(t)|$ : From the definitions of $H_{k}^{\scriptscriptstyle{\geq}}(t)$ and $H_{k}^{\diamond}(t)$ , we get

[TABLE]

As the second sum on the right-hand side is between [math] and $m^{\scriptscriptstyle{<}}(1+\log_{2}\sigma)$ (see (23)), we get $|H_{k}^{\scriptscriptstyle{\geq}}(t)-H_{k}^{\diamond}(t)|\leq m^{\scriptscriptstyle{<}}(1+\log_{2}\sigma)$ .

Finally, as $H_{k}^{\diamond}(t)\geq H_{k}^{\scriptscriptstyle{\geq}}(t)$ and $H_{k}(t)\geq H_{k}^{\scriptscriptstyle{\geq}}(t)$ , we have

[TABLE]

This proves the theorem. ∎

Theorem 6 moreover shows that the choice of the symbol $\Box\in\Sigma$ used for padding the histories only affects the value of the $k^{th}$ -order empirical entropy by an additive term of at most $m^{\scriptscriptstyle{<}}(1+\log_{2}\sigma+1/\ln(2))+m^{\scriptscriptstyle{<}}\log_{2}((2|t|-1)/m^{\scriptscriptstyle{<}})$ .

Appendix B Proof of Lemma 8

Fix a binary tree $t\in\mathcal{T}(\Sigma)$ . By definition of the first-child next-sibling encoding, every inner node of $\text{fcns}(t)$ corresponds in a bijective manner to a node of $t$ : For an inner node $v$ of $\text{fcns}(t)$ , let $\text{fcns}^{-1}(v)$ denote the corresponding node of $t$ and let $\text{fcns}(v)$ denote the corresponding inner node of $\text{fcns}(t)$ of a node $v$ of $t$ . If $v$ is a node of $t$ , then we obtain $h(\text{fcns}(v))$ as follows: If $v=\varepsilon$ , then $h(\text{fcns}(v))=\varepsilon$ . Moreover, if $v$ is a left child of a node $\operatorname{parent}(v)$ with label $a\in\Sigma$ , then $h(\text{fcns}(v))=h(\text{fcns}(\operatorname{parent}(v)))a0$ (and $h(v)=h(\operatorname{parent}(v))a0$ ). Finally, if $v$ is a right child of a node $\operatorname{parent}(v)$ with label $a\in\Sigma$ and $v$ ’s left sibling has label $a^{\prime}\in\Sigma$ , then $h(\text{fcns}(v))=h(\text{fcns}(\operatorname{parent}(v)))a0a^{\prime}1$ (and $h(v)=h(\operatorname{parent}(v))a1$ ). Thus, we are also able to determine $h(\text{fcns}^{-1}(v))$ from $h(v)$ for every inner node $v$ of $\text{fcns}(t)$ : locating every occurrence of a pattern of the form $0a1$ with $a\in\Sigma$ in the string $h(v)$ and replacing it by $1$ yields $h(\text{fcns}^{-1}(v))$ .

In particular, we have $|h(\text{fcns}(v))|\leq 2|h(v)|$ for every node $v$ of $t$ , respectively, $|h(\text{fcns}^{-1}(v))|\geq 1/2|h(v)|$ for every inner node $v$ of $\text{fcns}(t)$ . Moreover, for every inner node $v$ of $\text{fcns}(t)$ , we can uniquely determine $h_{k}(\text{fcns}^{-1}(v))$ from $h_{2k}(v)$ . Thus, we are also able to determine $h_{k-1}(\text{fcns}^{-1}(v))$ from $h_{2k-1}(v)$ for every inner node $v$ of $\text{fcns}(t)$ . Let

[TABLE]

denote the set of $m$ -histories that appear as $m$ -history of an inner node of $\text{fcns}(t)$ . We define a mapping $\varphi:\mathcal{L}_{2k}(\text{fcns}(t))\rightarrow\mathcal{L}_{k}$ by $\varphi(h_{2k}(v))=h_{k}(\text{fcns}^{-1}(v))$ , which maps the $2k$ -history of an inner node of $\text{fcns}(t)$ to the $k$ -history of the corresponding node in $t$ : By the above considerations, this mapping is well-defined. Furthermore, we define a mapping $\pi:\mathcal{L}_{2k-1}(\text{fcns}(t))\rightarrow\mathcal{\mathcal{L}}_{k-1}$ by $\pi(h_{2k-1}(v))=h_{k-1}(\text{fcns}^{-1}(v))$ . Again, by the above considerations, this mapping is well-defined, as we are able to determine $h_{k-1}(\text{fcns}^{-1}(v))$ from $h_{2k-1}(v)$ .

For $m\geq 2$ we partition $\mathcal{L}_{m}$ into the following disjoint subsets:

[TABLE]

Moreover, we define $\mathcal{L}_{2k}^{s}(\text{fcns}(t))=\mathcal{L}_{2k}^{s}\cap\mathcal{L}_{2k}(\text{fcns}(t))$ for $s\in\{0,01,11\}$ . We observe the following:

(i)

If $h_{2k}(v)\in\mathcal{L}_{2k}^{11}$ for a node $v$ of $\text{fcns}(t)$ , then $v$ is a $\Box$ -labeled leaf of $\text{fcns}(t)$ : As $t$ is a binary tree, the right sibling of a node has no right sibling. Thus, there are no inner nodes $v$ in $\text{fcns}(t)$ with $h_{2k}(v)\in\mathcal{L}_{2k}^{11}$ .

(ii)

If $h_{2k}(v)\in\mathcal{L}_{2k}^{01}$ for a node $v$ of $\text{fcns}(t)$ , then $v$ is an inner node of $\text{fcns}(t)$ : This follows again from the fact that $t$ is a binary tree (and hence does not have unary nodes).

(iii)

If $h_{2k}(v)\in\mathcal{L}_{2k}^{0}$ for a node $v$ of $\text{fcns}(t)$ , then $v$ can be an inner node or a leaf of $\text{fcns}(t)$ . If $v$ is a leaf, then its label is the fixed dummy symbol $\Box\in\Sigma$ .

(iv)

For every $i\in\{0,1\}$ and node $v$ of $t$ , we have $h_{k}(v)\in\mathcal{L}_{k}^{i}$ if and only if $h_{2k}(\text{fcns}(v))\in\mathcal{L}_{2k}^{i}(\text{fcns}(t))$ . In particular $\varphi(z)\in\mathcal{L}_{k}^{0}$ for every $z\in\mathcal{L}_{2k}^{0}(\text{fcns}(t))$ and $\varphi(z)\in\mathcal{L}_{k}^{1}$ for every $z\in\mathcal{L}_{2k}^{01}(\text{fcns}(t))$ . Hence $\varphi(z)\neq\varphi(z^{\prime})$ if $z\in\mathcal{L}_{2k}^{01}(\text{fcns}(t))$ and $z^{\prime}\in\mathcal{L}_{2k}^{0}(\text{fcns}(t))$ .

From (i), we obtain

[TABLE]

From (ii) and (iv), we obtain the following:

[TABLE]

where the last estimate follows from the log-sum inequality (Lemma 1). For every $y\in\mathcal{L}_{k}^{1}$ we have

[TABLE]

Thus, we obtain

[TABLE]

From (iii) and (iv), we obtain

[TABLE]

For the first summand, we find analogously as in the previous estimate (26):

[TABLE]

For the second summand, we obtain as $k\geq 1$ :

[TABLE]

where the last inequality follows from the log-sum inequality. Moreover, for all $y\in\mathcal{L}_{k-1}$ we have

[TABLE]

Thus, we find

[TABLE]

Altogether, if we combine the estimates from (25), (26), (27) and (28), we obtain:

[TABLE]

where the last-but-one estimate follows again from the log-sum inequality. This proves Lemma 8. ∎

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Janos Aczél. On Shannon’s inequality, optimal coding, and characterizations of Shannon’s and Renyi’s entropies. Technical Report Research Report AA-73-05, University of Waterloo, 1973. https://cs.uwaterloo.ca/research/tr/1973/CS-73-05.pdf .
2[2] Philip Bille, Inge Li Gørtz, Gad M. Landau, and Oren Weimann. Tree compression with top trees. Information and Computation , 243:166–177, 2015.
3[3] Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of XML document trees. Information Systems , 33(4–5):456–474, 2008.
4[4] Thomas M. Cover. Enumerative source encoding. IEEE Transactions on Information Theory , 19(1):73–77, 1973.
5[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (2. ed.) . Wiley, 2006.
6[6] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S.Muthikrishnan. Structuring labeled trees for optimal succinctness, and beyond. Proceedings of the 46 46 46 th Annual Symposium on Foundations of Computer Science (FOCS 2005) , pages 184-196. IEEE Computer Society Press, 2005.
7[7] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM , 57(1):4:1–4:33, 2009.
8[8] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics . Cambridge University Press, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Entropy Bounds for Grammar-Based Tree Compressors

Abstract.

1. Introduction

Grammar-based string compression.

Grammar-based tree compression.

Entropy bounds for grammar-based tree compressors.

2. Preliminaries

2.1. Empirical distributions and empirical entropy

Lemma 1**.**

2.2. Trees, tree processes, and tree entropy

2.2.1. Trees and contexts

Example 1**.**

2.2.2. Histories

Example 2**.**

2.2.3. Tree processes

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

2.2.4. Higher-order entropy of a tree

Example 3**.**

Theorem 1**.**

Proof.

3. Tree straight-line programs and compression of binary trees

3.1. General tree straight-line programs

Example 4**.**

3.2. Tree straight-line programs in normal form

Example 5**.**

Example 6**.**

Lemma 4**.**

Proof.

Theorem 2**.**

3.3. Binary coding of TSLPs in normal form

Example 7**.**

Lemma 5**.**

Proof.

Lemma 6**.**

4. Entropy bounds for binary encoded TSLPs

Lemma 7**.**

Proof.

Theorem 3**.**

Proof.

5. Extension to unranked trees

Example 8**.**

Theorem 4**.**

Lemma 8**.**

6. String straight-line programs versus higher-order empirical entropy of strings

Theorem 5**.**

Proof.

Appendix A Histories of length smaller than kkk

Theorem 6**.**

Proof.

Appendix B Proof of Lemma 8

Lemma 1.

Example 1.

Example 2.

Lemma 2.

Lemma 3.

Example 3.

Theorem 1.

Example 4.

Example 5.

Example 6.

Lemma 4.

Theorem 2.

Example 7.

Lemma 5.

Lemma 6.

Lemma 7.

Theorem 3.

Example 8.

Theorem 4.

Lemma 8.

Theorem 5.

Appendix A Histories of length smaller than $k$

Theorem 6.