Entropy Bounds for Grammar-Based Tree Compressors
Danny Hucke, Markus Lohrey, and Louisa Seelbach Benkner

TL;DR
This paper extends the concept of empirical entropy to binary trees and demonstrates that grammar-based tree compression can achieve encoding sizes close to this entropy measure, generalizing previous string compression results.
Contribution
It introduces a new entropy measure for trees and shows that grammar-based tree encodings can be bounded by this measure, extending string compression theories to trees.
Findings
Tree entropy bounds are established for grammar-based tree compressors.
Binary encodings of trees are shown to be near the entropy limit.
Generalization of string compression results to tree structures.
Abstract
The definition of -order empirical entropy of strings is extended to node labelled binary trees. A suitable binary encoding of tree straight-line programs (that have been used for grammar-based tree compression before) is shown to yield binary tree encodings of size bounded by the -order empirical entropy plus some lower order terms. This generalizes recent results for grammar-based string compression to grammar-based tree compression.
| XML document | |||||||
|---|---|---|---|---|---|---|---|
| Baseball | 28 306 | 46 | 212 961.9447 | 2.9818 % | 1.2547 % | 0.6739 % | 0.6662 % |
| DBLP | 3 332 130 | 35 | 23 755 697.8193 | 10.9775 % | 8.7407 % | 8.2134 % | 6.7270 % |
| DCSD-Normal | 2 242 699 | 50 | 17 142 868.6330 | 4.2437 % | 2.2481 % | 1.7517 % | 1.3038 % |
| EnWikiNew | 404 652 | 20 | 2 558 180.8475 | 9.5317 % | 3.0760 % | 3.0759 % | 2.9378 % |
| EnWikiQuote | 262 955 | 20 | 1 662 382.6021 | 9.4270 % | 3.1014 % | 3.1014 % | 3.1006 % |
| EnWikiVersity | 495 839 | 20 | 3 134 658.5046 | 8.8952 % | 2.3753 % | 2.3753 % | 2.3750 % |
| EXI-Array | 226 523 | 47 | 1 711 288.1304 | 0.2506 % | 0.2495 % | 0.2492 % | 0.2483 % |
| EXI-factbook | 55 453 | 199 | 534 379.7451 | 2.2034 % | 0.9450 % | 0.8132 % | 0.8092 % |
| EXI-Invoice | 15 075 | 52 | 116 084.1288 | 0.0484 % | 0.0268 % | 0.0139 % | 0.0098 % |
| EXI-Telecomp | 177 634 | 39 | 1 294 135.1377 | 1.5405 % | 0.0044 % | 0.0034 % | 0.0021 % |
| EXI-weblog | 93 435 | 12 | 521 830.9713 | 0.0032 % | 0.0028 % | 0.0028 % | 0.0028 % |
| Lineitem | 1 022 976 | 18 | 6 311 685.1983 | 0.0003 % | 0.0003 % | 0.0003 % | 0.0003 % |
| Mondial | 22 423 | 23 | 146 277.8297 | 11.1285 % | 9.2940 % | 8.4702 % | 7.7679 % |
| NASA | 476 646 | 61 | 3 780 154.2290 | 7.7424 % | 4.4588 % | 3.8898 % | 3.8054 % |
| Shakespeare | 179 690 | 22 | 1 160 695.2676 | 11.9140 % | 10.8416 % | 10.6368 % | 10.4765 % |
| SwissProt | 2 977 031 | 85 | 25 035 017.5080 | 12.1892 % | 10.5249 % | 9.2455 % | 8.1204 % |
| TCSD-Normal | 2 749 751 | 24 | 18 107 007.2213 | 8.5450 % | 8.4004 % | 8.2862 % | 8.2472 % |
| Treebank | 2 437 666 | 250 | 24 293 253.5140 | 30.8912 % | 23.0825 % | 19.2444 % | 13.4058 % |
| USHouse | 6 712 | 43 | 49 845.0890 | 21.0500 % | 18.2164 % | 12.6572 % | 9.3754 % |
| XMark1 | 167 865 | 74 | 1 378 079.8892 | 12.1610 % | 9.5101 % | 9.2271 % | 8.4281 % |
| XMark2 | 1 666 315 | 74 | 13 679 535.2849 | 12.2125 % | 9.5634 % | 9.3259 % | 8.9400 % |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Entropy Bounds for Grammar-Based Tree Compressors
Danny Hucke
,
Markus Lohrey
and
Louisa Seelbach Benkner
Universität Siegen, Germany
{hucke,lohrey,seelbach}@eti.uni-siegen.de
Abstract.
The definition of -order empirical entropy of strings is extended to node-labeled binary trees. A suitable binary encoding of tree straight-line programs (that have been used for grammar-based tree compression before) is shown to yield binary tree encodings of size bounded by the -order empirical entropy plus some lower order terms. This generalizes recent results for grammar-based string compression to grammar-based tree compression.
Keywords. Grammar-based compression, binary trees, empirical entropy, lossless compression
This work has been supported by the DFG research project LO 748/10-1 (QUANT-KOMP)
1. Introduction
Grammar-based string compression.
The idea of grammar-based compression is based on the fact that in many cases a word can be succinctly represented by a context-free grammar that produces exactly . Such a grammar is called a straight-line program (SLP) for . In the best case, one gets an SLP of size for a word of length , where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces for a given word an SLP for , where, of course, should be smaller than . Grammar-based compressors can be found at many places in the literature. Probably the best known example is the classical LZ78-compressor of Lempel and Ziv [31]. Indeed, it is straightforward to transform the LZ78-representation of a word into an SLP for . Other well-known grammar-based compressors are Bisection [20], Sequitur [27], and Repair [21], just to mention a few.
Recently, several upper bounds on the compression perfomance of grammar-based compressors in terms of higher order empirical entropy have been shown. For this, the choice of a concrete binary encoding of an SLP is crucial. Kieffer and Yang [19] came up with such a binary encoding and proved that under certain assumptions on the grammar-based compressor , the combined compressor yields a universal code with respect to the family of finite-state information sources over finite alphabets. More precisely, it is needed that the size of the SLP is bounded by where is the size of the underlying alphabet and . This upper bound is met by all grammar-based compressors that produce so-called irreducible SLPs [19], which is the case for e.g. LZ78, Bisection, and Repair after a small modification of the latter. In their recent paper [28], Navarro and Ochoa used the binary encoding from [19] in order to prove for every word over an alphabet of size the upper bound for every . Here, is the -order empirical entropy of , and the grammar-based compressor must satisfy the upper bound . Similar but weaker upper bounds for more practical binary SLP-encodings have been shown in [12, 26].
Grammar-based tree compression.
Grammar-based compression has been generalized from strings to trees by means of linear context-free tree grammars generating exactly one tree [3]. Such grammars are also known as tree straight-line programs, TSLPs for short, see [23] for a survey. TSLPs can be seen as a proper generalization of SLPs and DAGs (directed acyclic graphs, which are a widely used compact representation of trees). Whereas DAGs only have the ability to share repeated subtrees of a tree, TSLPs can also share repeated tree patterns with a hole (so-called contexts). In [10], the authors presented a linear time algorithm that computes for a given binary tree of size a TSLP of size where is the size of the underlying set of node labels and . An alternative algorithm with the same asymptotic size bound can be found in [11]. TSLPs have been also extended to so-called forest straight-line programs (FSLPs) which allow to compress unranked node-labeled trees [14]. FSLPs are very similar to top DAGs [2] and also meet the size bound for unranked trees of size . The reader should notice that the -bound cannot be achieved by DAGs: the smallest DAG for an unlabeled binary tree of size may still contain edges.
Entropy bounds for grammar-based tree compressors.
In this paper we first consider node-labeled binary trees: every node has a label from a finite set of size and every non-leaf node has a left and a right child. For unlabeled binary trees the results of Kieffer and Yang on universal grammar-based compressors have been extended to trees in [16, 30]. Whereas the universal tree encoder from [30] is based on DAGs (and needs a certain assumption on the average DAG size with respect to the input distribution), the encoder from [16] uses TSLPs of size . For this, a binary encoding of TSLPs similar to the one for SLPs from [19] is proposed. In this paper we extend the binary TSLP-encoding from [16] to node-labeled binary trees and prove an entropy bound similar to the one from [28] for strings. To do this, we first have to come up with a reasonable higher order entropy for binary node-labeled trees (we just speak of binary trees in the following). Several notions of tree entropy can be found in the literature, but all are tailored towards unranked trees and do not yield nontrivial results for the special case of unlabeled binary trees.
- •
The -order label entropy from [6] is based on the empirical probability that a node is labeled with a certain symbol conditioned on the first labels from the parent node of to the root of the tree.
- •
The tree entropy from [18] is the -order entropy of the node degrees.
- •
Recently, two combinations of the two previous entropy measures were proposed in [13]. The first combination is based on the empirical probability that a node is labeled with a certain symbol conditioned on (i) the first labels from the parent node of to the root and (ii) the node degree of . The second combination uses the empirical probability that a node has a certain degree conditioned on (i) the first labels from the parent node of to the root and (ii) the node label of .
Tree entropy [18] is not useful in the context of binary trees, since a binary tree with leaves has nodes of degree , which shows that the tree entropy divided by the number of nodes () converges to when increases. On the other hand, the -order label entropy [6] is not useful for unlabeled trees. For the special case of unlabeled binary trees, also the combinations of [13] do not lead to useful entropy measures.
Our first contribution is the definition of a reasonable entropy measure for binary trees that can be also used for the unlabeled case. For this we define the -history of a node in a binary tree by taking the last edges on the unique path from the root to . For each edge traversed on this path we write down the node label of and a [math] (resp., ) if is a left (resp., right) child of . Thus, the -history of a node is a word of length that alternatingly consists of symbols from and directions that are encoded by [math] or . For nodes at depth smaller than we pad the history with [math]’s and a default node label in order to get length exactly .111This is an ad hoc decision to make the definitions easier. In the appendix we discuss different approaches of how to deal with nodes of depth smaller than , and prove that they asymptotically lead to the same entropy measure. For each -history we then consider the joint probability distribution of the node degree (either [math] or ) and the node label, conditioned on the history . Thus, is the probability that a randomly chosen node among the nodes with history is labeled with the symbol and has children. The -order empirical entropy of , for short, is then the sum of the entropies of these distributions (the sum is taken over all histories ) weighted with the number of nodes with history . This definition is similar to the definition of the order empirical entropy of a string.
Our main result states that
[TABLE]
where is a binary tree with leaves, the grammar-based compressor produces TSLPs of size for binary trees of size with many node labels and . Moreover, is an extension of the binary TSLP-encoding described in [16] from unlabeled binary trees to labeled binary trees (Section 3.3). If then this bound can be simplified to . The assumption can be also found in [28]. In fact, Gagie argued in [9] that the -order empirical entropy for strings stops being a reasonable complexity measure for almost all strings of length over alphabets of size when .
Our definition of -order empirical entropy does not capture all regularities that can be exploited in grammar-based compression: Take for instance a complete unlabeled binary tree of height (all paths from the root to a leaf have length ). This tree has leaves and is very well compressible: its minimal DAG has only nodes, hence there also exists a TSLP of size for . But for every fixed the -order empirical entropy of divided by converges to (the trivial upper bound) for . If then for every -history the number of leaves with -history is roughly the same as the number of internal nodes with -history . Hence, although is highly compressible with TSLPs (and even DAGs), its -order empirical entropy is close to the maximal value. However, this phenomenon occurs for grammar-based string compression and the well-established higher-order empirical entropy of strings as well; see Section 6.
In Section 5 we present a simple extension of our entropy notion to node-labeled unranked trees. In an unranked tree the number of children of a node is arbitrary. Unranked trees are important in the area of XML, where the hierarchical structure of a document is represented by a node-labeled unranked tree. For such a tree we define the -order empirical entropy as the -order empirical entropy of the first-child next-sibling (fcns for short) encoding of . The fcns-encoding of is a binary tree which contains all nodes of . If a node of has the first (i.e., left-most) child and the right sibling then (resp., ) is the left (resp., right) child of in the fcns-encoding of . If has no child or no right sibling then one adds dummy leaves to the fcns-encoding in order to obtain a full binary tree. Our choice of defining the -order empirical entropy of an unranked tree via the fcns-encoding is motivated by the fact that in XML document trees the label of a node usually depends on the labels of the ancestors and the labels of the left siblings of . This information is contained in the history of in the fcns-encoding.
We present experimental results with real XML document trees showing that in these cases the -order empirical entropy is indeed very small compared to the worst-case bit size. An unranked tree with nodes and node labels can be encoded with bits [15]. Up to low order terms, this is optimal. Table 1 shows the values of the -order empirical entropy (for ) divided by for several real XML trees (that were also used in other experiments for XML compression [24, 25]). For , these quotients never exceed 20% and for all quotients are bounded by 13.5%.
Our experimental results combined with our entropy bound (1) for grammar-based compression are in accordance with the fact that grammar-based tree compressors yield excellent compression ratios for XML document trees, see e.g. [24]. Some of the XML documents from our experiments were also used in [24], where the performance of TreeRePair (currently the best grammar-based tree compressor from a practical point of view) on XML document trees was tested. An interesting observation is that those XML trees, for which our -th order empirical entropy is large are indeed those XML trees with the worst compression ratio for TreeRePair in [24] (this is in particular the Treebank document from Table 1).
In a forthcoming paper we will compare our definition of the -order empirical entropy of trees with the above mentioned tree entropies from [6, 13, 18] for binary as well as unranked trees and both from a theoretical as well as experimental perspective. A short version of this paper can be found in [17].
2. Preliminaries
In this section, we introduce some basic definitions concerning information theory (Section 2.1) and binary trees (Section 2.2).
With we denote the natural numbers including [math]. We use the standard -notation. If is a constant, then we just write for . We make the convention that and for . For the unit interval we write .
Let be a word over an alphabet . With we denote the length of . The empty word is denoted by . For we denote with the number of occurrences of in .
2.1. Empirical distributions and empirical entropy
Let be a finite set. A probability distribution on is a mapping such that . For a probability distribution on we define its Shannon entropy
[TABLE]
We have . A well-known generalization of Shannon’s inequality states that for every probability distribution on and any mapping such that we have
[TABLE]
see [1] for a proof. Shannon’s inequality is the special case where is a probability distribution as well. The Kullback-Leibler divergence between two probability distributions on (see [5, Section 2.3]) is defined as
[TABLE]
It is known that for all (this follows from Shannon’s inequality) and if and only if .
Let be a tuple of elements that are from some (not necessarily finite) set . The empirical distribution of is defined by
[TABLE]
We use this (and the following) definition also for words over some alphabet by identifying a word with the tuple . The unnormalized empirical entropy of is
[TABLE]
From (2) it follows that for a tuple with and real numbers () with we have
[TABLE]
We also need the famous log-sum inequality, see e.g. [5, Theorem 2.7.1] (recall our conventions and for ):
Lemma 1**.**
Let be real numbers. Moreover, let and . Then
[TABLE]
2.2. Trees, tree processes, and tree entropy
2.2.1. Trees and contexts
Let denote a finite non-empty alphabet of size . Later, we will need a fixed distinguished symbol from that we will denote with . We will also need the value . With we denote the set of labeled binary trees over the alphabet . Formally, it is inductively defined as the smallest set of terms over such that
- •
and
- •
if and , then .
If e.g. , then is the binary tree with a single node labeled by and is the binary tree depicted on the left of Figure 1.
A tree encoder is an injective mapping such that the range is prefix-free, i.e., there do not exist with such that is a prefix of .
With we denote the number of leaves of , which can be inductively defined by and for and . Note that is the number of occurrences of symbols from in . Let for . Note that . We have , where is the Catalan number. These numbers satisfy the following well-known asymptotic estimate
[TABLE]
see e.g. [8]. In fact, we have for all and hence .
A context is a labeled binary tree, where exactly one leaf is labeled with the special symbol (called the parameter); all other nodes are labeled with symbols from . Formally, the set of contexts is the smallest set such that
- •
and
- •
if , and then also .
If e.g. , then is the context with a single node labeled by the parameter and is the context depicted on the right of Figure 1. For a tree or context and a context , we denote by the tree or context which results from by replacing the unique occurrence of the parameter by . For example and yield (with ). For a context we define inductively by and for and . In other words, is the number of leaves of , where the unique occurrence of the parameter is not counted. Note that , where is arbitrary. We define for . Since the set will not change in this paper, we use the abbreviations , , , and for , , , and , respectively.
Occasionally, we will consider a binary tree or context as a graph with nodes and edges in the usual way, where each node is labeled with a symbol from (or in the case of a context). Note that has nodes in total: leaves and internal nodes.
It is convenient to define a node of as a bit string that describes the path from the root to the node ([math] means left, means right). Formally, we define the node set of by
- •
for every ,
- •
and
- •
for every .
Note that for a context , the set does not contain the unique node in labeled with the parameter . We use this definition due to better readability of the paper since we mostly need the set of nodes without the parameter node. Also, it is still possible to uniquely determine from the path to the parameter due to the following properties: For a tree we have if and only if for all since each node has zero or two children. The only context which fulfills this property is , i.e., the parameter node is the only node of and . For all other contexts this property is violated since there exists a unique such that (respectively, ) and (respectively, ). In this case the parameter node is (respectively, ). Alternatively, the parameter node of a context is the single node in the set for a symbol . We denote this node with . In other words: .
Example 1**.**
Consider the tree with depicted on the left of Figure 1.We have . For the context depicted on the right of Figure 1, we have and .
Consider a tree or context and let . The leaves of are those strings in that are maximal with respect to the prefix relation. The length is the depth of the node in and the depth of is the maximal depth of a node in (the depth of is not defined but also not needed). Let denote the function mapping a node to the pair where is the label of and is the number of children of . We can define this function inductively as follows:
- •
for ,
- •
for with and ,
- •
for with , and .
Note that in the last case, if is a context, we cannot have because we must have . In the following, we will omit the subscript in if is clear from the context.
2.2.2. Histories
We now come to the crucial notion of the history of a node in a tree or context. Intuitively, the history of records all information that can be obtained by walking from the root of the tree/context straight down to the node . First, we define the set of histories as
[TABLE]
For an integer , let and let denote the partial function mapping a history with to the suffix of of length , i.e., (the function maps a string to the empty string).
For a tree and a node (resp., a context and a node ), we inductively define its history (in ) by
- •
and
- •
for and (resp., ).
Here, is the symbol that labels the node , i.e., . That is, in order to obtain , while walking downwards in the tree from the root node to the node we alternately concatenate symbols from with binary numbers in such that the symbol from corresponds to the label of the current node and the binary number [math] (resp., ) states that we move on to the left (resp. right) child node. Note that the symbol that labels is not part of the history of . The -history of a tree node is
[TABLE]
i.e., the suffix of length of the word , where is a fixed dummy symbol in (the choice is arbitrary). This means that if then describes the last directions and node labels along the path from the root to node . If , we pad the history of with ’s and zeros such that . In the appendix, we discuss other reasonable approaches of how to deal with nodes of depth smaller than . For we denote with
[TABLE]
the set of nodes in with -history .
Example 2**.**
Consider the tree from Example 1 and let . Then, and .
2.2.3. Tree processes
A tree process is an infinite tuple where every is a probability distribution on . With we associate the function with
[TABLE]
We are mainly interested in this definition for the case that is a tree, but for technical reasons we also have to allow contexts. Note that if is a context, then the parameter node of is not in and therefore does not contribute to .
A tree process can be used to randomly construct a tree from as follows: In a top-down way we determine for every tree node its label (from ) and its number of children, where this decision depends on the history of the tree node. We start at the root node, whose history is the empty word . If we have reached a tree node with history then we use the probability distribution to randomly choose a pair . We assign the label to . If then becomes a leaf, otherwise the process continues at the two children and (whose history is well-defined). Note that in this way we may produce infinite trees with non-zero probability (e.g. if for some ). Therefore, we only obtain an inequality instead of an equality in the following lemma (recall that only contains finite trees).
Lemma 2**.**
Let be a tree process. Then .
Proof.
Define the set of trees inductively by and
[TABLE]
We have and . It then suffices to show for every . This follows easily from the definition of and the inductive definition of . ∎
Lemma 2 cannot be extended to contexts, but the following bound will suffice for our purpose.
Lemma 3**.**
Let be a tree process. We have for every .
Proof.
In order to bound , we first represent the probability of each context as a sum of probabilities of trees. So fix a context for the first part of the proof. Note first that in general no tree exists such that (or even ) since (the parameter node of ) does not contribute to the probability of the context . For example, the tree () which results from by replacing the parameter node by an -labeled leaf node has probability . In order to bound , the idea is to replace the parameter node by all possible trees and not only by a single node. So consider the set of all trees that arise from by replacing the parameter by an arbitrary tree. Unfortunately, the total probability can still be strictly smaller than since there might be infinite trees with positive probability with respect to . To get rid of this problem, we fix an element and modify to a tree process such that (i) for and (ii) and for every and . The tree process is created such that all nodes of depth contribute the probability as before and all nodes of depth in a tree are -labeled leaves with probability . Note first that for each context and each node we have and thus . Secondly, all trees of depth larger than have probability [math] with respect to (including infinite trees). Hence, we get . We obtain
[TABLE]
We claim that equals . To see this, consider the tree process with . Also for only finite trees have non-zero probability and thus . We have
[TABLE]
It follows that . In the second part of the proof it remains to bound . The key point here is that for each tree there are at most different contexts such that . Note that for a tree , the number of different contexts such that is exactly the number of nodes such that replacing the subtree rooted at by the parameter yields a context with . This is the same as the number of subtrees of with leaves. Since different subtrees in of equal size do not share nodes, we can bound the number of subtrees with leaves by . We can assume that since otherwise there is no context such that . So we have for some and the number of subtrees of with leaves is at most . We get
[TABLE]
This concludes the proof of the lemma. ∎
A -order tree process is a tree process such that if . Thus, the probability distribution that is chosen for a certain tree node depends only on the last symbols of the history of the node (where histories are padded with on the left to reach length for the fixed symbol ). We will identify the -order tree process with the finite tuple ; it contains all information about . Note that for a -order tree process we can compute for a tree or context as
[TABLE]
where the empty product (which arises in case ) is .
2.2.4. Higher-order entropy of a tree
Let us fix . We define the -order (unnormalized) empirical entropy of a tree as follows: For let
[TABLE]
be the number of nodes of with -history and for let
[TABLE]
We then define the empirical -order tree process by
[TABLE]
for all and all with . If , then we can define as an arbitrary distribution. Then
[TABLE]
Note that
[TABLE]
since and . This upper bound on the entropy matches the information theoretic bound for the worst-case output length of any tree encoder on . Using the asymptotic bound (6) for the Catalan numbers, one sees that for any tree encoder there must exist a tree which is encoded with bits. The -order empirical entropy is a lower bound on the coding length of a tree encoder that encodes for each node the relevant information (the label of the node and the binary information whether the node is a leaf or internal) depending on the -history of the node.
Example 3**.**
Let denote the binary tree as depicted on the left of Figure1. In order to compute the first order empirical entropy of , we have to consider -histories of with : Let . It follows that , , and . Thus, we have and . Next, for each -history , we consider for : For , we have , and . Hence, and . Analogously, we find . Altogether, this yields which is roughly .
One can define alternatively in the following way: Take a -history and enumerate the set in an arbitrary way as . Define the string . We have
[TABLE]
where the empirical entropy is defined according to (4).
The following lemma and its proof are very similar to a corresponding statement for the -order empirical entropy of strings, see [9].
Theorem 1**.**
Let . For every -order tree process with we have
[TABLE]
with equality if and only if for all with .
Proof.
We have
[TABLE]
with equality in the last line if and only if for all with . ∎
3. Tree straight-line programs and compression of binary trees
We now introduce tree straight-line programs and use them for the compression of binary trees.
3.1. General tree straight-line programs
Let be a finite alphabet of symbols, where each symbol has an associated rank [math] or (we also speak of a ranked alphabet). The elements of are called nonterminals. We assume that contains at least one nonterminal of rank [math] and that is disjoint from the set , which are the labels used for binary trees and contexts. We use (resp., ) for the set of nonterminals of rank [math] (resp., of rank ). The idea is that nonterminals from (resp., ) derive to trees from (resp., contexts from ). We denote by the set of trees over , i.e., each node in a tree is labeled with a symbol from such that nodes labeled by symbols from have zero or two children and if a node is labeled by a symbol from , then the number of children of this node corresponds to the rank of its label (a formal definition follows). With we denote the corresponding set of all contexts, i.e., the set of trees over , where the parameter symbol occurs exactly once and at a leaf position. Formally, we define and as the smallest sets of formal expressions with the following conditions, where here and in the rest of the paper we use the abbreviations for and for :
- •
and ,
- •
if , and then , and
- •
if , , and then .
If e.g. , and , then and as depicted in Figure 2. Note that and .
A tree straight-line program , or TSLP for short, is a tuple , where is the start nonterminal and is a function which assigns to each nonterminal its unique right-hand side. It is required that if (resp., ), then (resp., ). Furthermore, the binary relation has to be acyclic. These conditions ensure that exactly one tree is derived from the start nonterminal by using the rewrite rules for . To define this formally, we define for and for inductively by the following rules:
- •
for and ,
- •
for and (and or since there is at most one parameter in ),
- •
for ,
- •
for and (note that is a context , so we can build ).
The tree defined by is .
Example 4**.**
Let and be a TSLP such that and
[TABLE]
We get , and .
3.2. Tree straight-line programs in normal form
In this section, we will use TSLPs in a certain normal form, which we introduce first.
A TSLP is in normal form if the following conditions hold:
- •
for some , .
- •
For every , the right-hand side is an expression of the form , where and .
- •
For every the right-hand side is an expression of the form , , or , where , and .
- •
For every define the word as follows:
[TABLE]
Let . Then we require that is of the form with .
- •
for
We also allow the TSLP for every in order to get the singleton tree . In this case, we set .
Let be a TSLP in normal form with for the further definitions. We define the size of as . Thus is the length of . Let be the word obtained from by removing the first (i.e., left-most) occurrence of from for every . Thus, if with , then . Note that . The entropy of the normal form TSLP is defined as the empirical unnormalized entropy of the word (see (4)):
[TABLE]
Example 5**.**
Let and be the normal form TSLP with and
[TABLE]
We have , (, , ), and .
The derivation tree of the normal form TSLP is a binary tree with node labels from . The root is labeled with . Nodes labeled with a symbol from are the leaves of . A node that is labeled with a nonterminal has many children. If with , then the left child of is labeled with and the right child is labeled with . For every node of we define the tree or context where is the label of . If then and if then . An initial subtree of the derivation tree is a tree that can be obtained from as follows: Take a subset of the nodes of and remove from all proper descendants of nodes from , i.e., all nodes that are located strictly below a node from .
Example 6**.**
Let be the normal form TSLP from Example 5. The derivation tree is shown in Figure 3 on the left; an initial subtree of it is shown on the right.
Lemma 4**.**
Let be a TSLP in normal form with . Let be an initial subtree of and let be the sequence of all leaves of (in left-to-right order). Then .
Proof.
Let be a node of and let be the subtree of rooted in . Then, the nodes of are in a one-to-one correspondence with the leaves of , that is, if , we have and if , we have (recall that is the number of leaves of ). Thus, . Since is an initial subtree of we get . Since we get and the statement follows. ∎
A grammar-based tree compressor is an algorithm that produces for a given tree a TSLP in normal form such that . It is not hard to show that every TSLP can be transformed with a linear size increase into a normal form TSLP that derives the same tree. For example, the TSLP from Example 4 is transformed into the normal form TSLP described in Example 5. We will not use this fact, since all we need is the following theorem from [10] (recall that ):
Theorem 2**.**
There exists a grammar-based compressor (working in linear time) with .
3.3. Binary coding of TSLPs in normal form
In this section we fix a binary encoding for normal form TSLPs. This encoding is similar to the one for TSLPs producing unlabeled binary trees [16] (which in turn is based on the encoding for SLPs from [19] and the encoding of DAGs from [30]). Let be a TSLP in normal form with nonterminals. We define the type of a nonterminal as follows:
[TABLE]
We define the binary word , where the words , , are defined as follows:
- •
- •
, where is the 2-bit binary encoding of . Note that .
- •
Let with . Then . Note that .
- •
For let be the number of occurrences of the nonterminal in the word . Moreover, fix a total ordering on . For , let denote the symbol in according to this ordering and let be the number of occurences of the symbol in the word . Then . Note that .
- •
The word encodes the word using the well-known enumerative encoding [4]. Every nonterminal , , has occurrences in . Every symbol , , has occurences in . Let be the set of words over the alphabet with occurrences of () and occurrences of (). Hence,
[TABLE]
Let be the lexicographic enumeration of the words from with respect to the alphabet order . Then is the binary encoding of the unique index such that , where (leading zeros are added to the binary encoding of to obtain the length ).
Example 7**.**
Consider the normal from TSLP from Example 5. We have , , and . To compute , note first that there are words with two occurrences of and and one occurrence of and . It follows that . Furthermore, with the canonical ordering on , the order of the alphabet is . The word is the lexicographically largest word in starting with . There are 132 words in that are lexicographically larger than , namely all words in that start with (60 words), (30 words), (30 words), or (12 words). Hence is the word in in lexicographic order, i.e., and thus .
The following lemma generalizes a result from [16]:
Lemma 5**.**
The set of code words , where ranges over all TSLPs in normal form, is a prefix code.
Proof.
Let with defined as above. We show how to recover the TSLP , given the alphabet and the ordering on . From we can determine and the factors , , and of . Hence, we can determine the type of every nonterminal from . The types allow to compute from the word . Hence, it remains to determine . To compute from , one only needs . For this, one determines the frequencies of the symbols in from . Using these frequencies one computes the size from (11) and the length of . From , one can finally compute . ∎
Note that . By using the well-known bound on the code length of enumerative encoding [5, Theorem 11.1.3], we get:
Lemma 6**.**
For the length of the binary coding we have
[TABLE]
4. Entropy bounds for binary encoded TSLPs
For this section we fix a grammar-based tree compressor such that ; see Theorem 2. Let be a concrete constant such that
[TABLE]
for every tree and large enough. We allow that the alphabet size grows with , i.e., is a function in the tree size such that (a binary tree has nodes).
We then consider the tree encoder defined by .
Lemma 7**.**
Let , with and let be a -order tree process with . We have
[TABLE]
Proof.
Let be the size of . Let be the derivation tree of . We define an initial subtree as follows: If and are non-leaf nodes of that are labeled with the same nonterminal and comes before in preorder (depth-first left-to-right), then we remove from all proper descendants of . Thus, for every there is exactly one non-leaf node in that is labeled with . For the TSLP from Example 5, the tree is shown in Figure 3 on the right.
Recall the definition of the words and from Section 3.2. The word can be obtained by writing down for every node of the labels of ’s children and then concatenating these labels. Moreover, the word is obtained by writing down (in the right order) the labels of the leaves of . Note that has non-leaf nodes and leaves. Let be the sequence of all leaves of (w.l.o.g. in preorder) and let be the label of . Let . Then is a permutation of . We therefore have for every . Hence, and are the same empirical distributions. For the TSLP from Example 5 we get . Let . Since for all ( is in normal form) and for all (this holds for every normal form TSLP that produces a tree of size at least two), the tuple satisfies for all :
[TABLE]
We define from for every a modified tree process by setting
[TABLE]
for all . Note that the -order tree process is obtained for for the fixed padding symbol . We define a mapping by
[TABLE]
for every . Thus, for every , the function maximizes the values of the function associated with the -order tree process by choosing an optimal -history for the nodes of whose history is of length smaller than . We show that satisfies
[TABLE]
In order to prove (16), first note that by definition of the tree/context , for each node of the derivation tree , the tree/context corresponds to a subtree/subcontext or a single inner node of the binary tree . We define a function which maps a node of the derivation tree to a node : Intuitively, is the root of the subtree/subcontext, respectively, the inner node of which corresponds to . Formally, is defined inductively as follows: For the root node of , we set . Furthermore, let be a non-leaf node of which is labeled with the non-terminal and for which has been defined. Let be the left child and be the right child of in . We define . The node is defined as follows:
- (i)
If with and , then we set (recall that is the position of the parameter in the context ). 2. (ii)
If (respectively, ) for and , then we define (respectively, ).
This yields a well-defined function mapping a node of to a node . Let us define
[TABLE]
Then, the mapping
[TABLE]
is bijective. The definition of the sets implies that if two nodes and of are not in an ancestor-descendant relationship, then . Since the nodes are the leaves of the initial subtree and hence not in an ancestor-descendant relationship, the sets are disjoint subsets of . For the TSLP from Example 5, the node sets and corresponding to the six leaves of the initial subtree depicted in Figure 3 (right) are shown in Figure 4. Note that if then the bijection from (17) also preserves the -mapping in the following sense:
[TABLE]
for every . However, if then this statement can be wrong since the number of children is not preserved in general: If , then might correspond to a single inner node of . In this case, we have , and for some , but . For example, in the TSLP from Example 5, the left-most leaf node of its initial subtree depicted in Figure 3 corresponds to the root node of the tree (see Figure 4). We define
[TABLE]
In our running example, we have and hence .
The history of a node with in the tree is the concatenation of the history of in and the history of in the tree/context . Thus, if , we have
[TABLE]
For the inequality in the last line, note that every -history for is also of the form for some .
We can now show (16). Since with we have
[TABLE]
Next, we define the function as follows:
[TABLE]
We get
[TABLE]
where () follows from Lemmas 2 and 3 and and () follows from the well-known fact that . In particular, we have Thus, with Shannon’s inequality (5), we obtain:
[TABLE]
With and we obtain
[TABLE]
by definition of Using logarithmic identities, we get
[TABLE]
Using , and , we obtain
[TABLE]
Equation (16) and yield
[TABLE]
Let us bound the sum : Using Jensen’s inequality and Lemma 4 (which yields ), we get
[TABLE]
and thus
[TABLE]
To bound the term recall that for large enough we have by (12). Here, is a constant. Since there is a constant with . Since for every fixed , the function is monotonically increasing for (where is Euler’s number), we get
[TABLE]
With (20) we get
[TABLE]
which proves the lemma. ∎
Theorem 3**.**
For every and every we have
[TABLE]
Proof.
Let be a -order tree process with . Lemmas 6 and 7 yield
[TABLE]
where the last equality uses the bound . Finally, by taking for be the empirical -order tree process , we get
[TABLE]
from Theorem 1. ∎
5. Extension to unranked trees
So far, we have only considered binary trees. In this section, we consider unranked, ordered trees, where the number of children of a node (also called its degree) can be any natural number and the children of every node are totally ordered. As before, each node is labeled by an element of some finite alphabet . Let us denote by (or simply ) the set of all such trees. For technical reasons we also define forests which are ordered sequences of trees from . The set of forests is denoted with . The sets and can be inductively defined as the smallest sets of strings over the alphabet such that the following conditions hold:
- •
(this is the empty forest),
- •
if and then ,
- •
if and then .
The singleton tree (which is obtained by taking in the second point) is usually written as . Note that and that . The size of is the number of occurrences of -labels in ; formally: , and for , , and .
The first-child/next-sibling encoding transforms a forest into a binary tree . It is defined inductively as follows (recall that is a fixed distinguished symbol in ):
- •
and
- •
for and .
Thus, the left (resp., right) child of a node in is the first child (resp., right sibling) of the node in or a -labeled leaf if it does not exist.
Example 8**.**
If then
[TABLE]
see also Figure 5.
Note that if , then is a binary tree with internal nodes. Hence we have (which is the number of leaves of ). We define the -order empirical entropy of an unranked tree as . Note that this definition is independent of the choice of the symbol . From Theorem 3, we immediately obtain:
Theorem 4**.**
For every with and every we have
[TABLE]
The above definition of the -order empirical entropy of an unranked tree can be also applied to binary trees (a binary tree can be viewed as a particular unranked tree). This yields and leads to the question how this value relates to (the -order empirical entropy of as defined before in (10)). In one direction, we have the following bound:
Lemma 8**.**
Let denote a binary tree with first-child next-sibling encoding . Then for .
The somewhat technical proof of Lemma 8 can be found in Appendix B. In contrast to Lemma 8, there are families of binary trees where is exponentially smaller than for every and : Define inductively by and if is even and if is odd. Thus, denotes a right-degenerate binary tree of size , whose inner nodes and right-most leaf are labeled with and whose leaves except for the right-most leaf are alternately labeled and . We get : there are many nodes with -history , and about half of them are -labeled leaves, while the other half are -labeled leaves. Moreover, we have : the fcns-encodings of the binary trees can be inductively defined by and if is even and if is odd. Intuitively, as the labels and are thus incorporated in -histories of nodes of , we can thus determine the label of a node from its -history for for most nodes of .
Our definition of the -order empirical entropy of an unranked tree via the fcns-encoding has a practical motivation. Unranked trees occur for instance in the context of XML, where the hierarchical structure of a document is represented as an unranked node labeled tree. In this setting, the label of a node quite often depends on (i) the labels of the ancestor nodes and (ii) the labels of the (left) siblings. This dependence is captured by our definition of the -order empirical entropy.
We also confirmed this intuition by experimental data (shown in Table 1) with real XML document trees (ignoring textual data at the leaves) showing that in these cases the -order empirical entropy is indeed very small compared to the worst-case bit size. More precisely, we computed for 21 real XML document trees222All data are available from http://xmlcompbench.sourceforge.net/Dataset.html. the -order empirical entropy (for ) and divided the value by the worst-case bit length , where is the number of nodes and is the number of node labels [15].
Our experimental results combined with our entropy bound (1) for grammar-based compression are in accordance with the fact that grammar-based tree compressors yield impressive compression ratios for XML document trees, see e.g. [24]. Some of the XML documents from our experiments were also used in [24], where the performance of the grammar-based tree compressor TreeRePair was tested. An interesting observation is that those XML trees, for which our -th order empirical entropy is large are indeed those XML trees with the worst compression ratio for TreeRePair in [24]. This is in particular true for the Treebank document, see Table 1. TreeRePair obtained for Treebank a compression ratio of around 20%, whereas for all other documents tested in [24] TreeRePair achieved a compression ratio below 8%.
6. String straight-line programs versus higher-order empirical entropy of strings
Our definition of -order empirical entropy does not capture all regularities that can be exploited in grammar-based compression. Take for instance a complete unlabeled binary tree of height (all paths from the root to a leaf have length ). This tree has leaves and is very well compressible: its minimal DAG has only nodes, hence there also exists a TSLP of size for . But for every fixed the -order empirical entropy of divided by converges to (the trivial upper bound) for . If then for every -history the number of leaves with -history is roughly the same as the number of internal nodes with -history . Hence, although is highly compressible with TSLPs (and even DAGs), its -order empirical entropy is close to the maximal value. We show in the following that the same phenomenon occurs for grammar-based string compression and the well-established empirical entropy of strings.
The -order empirical entropy of a string is defined as follows (see e.g. [9]). Let denote a finite alphabet and let . For a non-empty string define as the string whose symbol is the symbol in immediately following the occurrence of the string in . Thus, if is not a suffix of , the length of is equal to the number of occurrences of the string in . In case is a suffix of , is the number of occurrences of in minus one. Recall the definition of the unnormalized empirical entropy of a string (or tuple) from Section 2.1. For an integer , the -order (unnormalized) empirical entropy of a string is defined as
[TABLE]
where we set . For , is the (unnormalized) empirical entropy of .
A straight-line program (SLP) for a string is a context-free grammar that produces only the string . The size of an SLP is the sum of the lengths of the right-hand sides of the production rules of the context-free grammar, see e.g. [22] for details. We prove that for each there exists a string of length , which is highly compressible with SLPs, but whose -order empirical entropy is close to the maximum.
Theorem 5**.**
There exists a family of strings () over a binary alphabet with the following properties:
- •
,
- •
there exists an SLP of size for , and
- •
* for .*
Proof.
We inductively define a string for as follows: We set
- •
and
- •
.
We have . The string corresponds to the preorder traversal of the perfect binary tree of size , whose internal nodes are labeled with the symbol and whose leaves are labeled with the symbol . The recursive definition of directly translates to an SLP for of size (there is a nonterminal for each with and each rule has three symbols on the right-hand side according to the recursive definition).
It remains to show that for . We start with the case . Recall that denotes the number of occurrences of a symbol in a string , as defined in Section 2. We have and , which yields
[TABLE]
Define the function by
[TABLE]
It converges to from below for . Since we have .
Let us now consider the case and let . By construction of , the last symbol of is . Therefore, the length of the string equals the number of occurrences of the string in . In order to lower-bound the -order empirical entropy of , we first show inductively in , that
[TABLE]
for : For the base case, let . We have and thus, . For the induction step, let . By definition of , we have . By the induction hypothesis, we have for . Moreover, does not occur in (which follows by induction), i.e., . By construction, the last symbol of the string is . Thus, for all we have . Hence, as the string with occurs additionally as a prefix of the string , the number of occurrences of in in total is for every . This proves (21).
Next, we count the number of occurrences of in , which are followed by the symbol , that is, we count . We show inductively in , that
[TABLE]
for : For the base case, let . As , we have . For the induction step, let . By the induction hypothesis, we have for . As ends with , we obtain for . Moreover, the construction of implies that the prefix of , which is the only occurrence of in , is followed by the symbol . Thus, for , which proves the claim.
As , we have . Thus, we obtain the following lower bound for the -order empirical entropy of for .
[TABLE]
This proves the theorem. ∎
Appendix A Histories of length smaller than
In order to define -order empirical entropy for binary trees, there are basically three possibilities how to deal with nodes whose history is shorter than :
- (i)
pad the histories with a fixed dummy symbol and direction ,
- (ii)
allow histories of length smaller than , or, equivalently, pad the histories with a fixed dummy symbol and direction , or
- (iii)
ignore nodes whose history is of length smaller than .
Recall that in the main text we used the variant (i) with . In this subsection, we show that the above three variants are basically equivalent if is small compared to the size of the binary tree.
Fix an integer . Recall that in Section 2.2.4 we defined for a tree , a -history , and the numbers and . The tree will be fixed in this section; hence we will write and in the following. We define several variants of these numbers.
For a -history and we define:
[TABLE]
We have and if . Also note that and and .
Fix a fresh symbol and let and . Clearly, and . Let denote the partial function mapping a string with to the suffix of of length . For a binary tree and a node , define . Note that for nodes with . Finally, for and we define
[TABLE]
Using the above numbers, we can define three natural variations of the -order empirical entropy of a binary node-labeled tree :
- (i)
Padding histories of length shorter than with and yields the definition of -order empirical entropy from Section 2 (for ):
[TABLE]
- (ii)
Padding histories of length shorter than with and yields
[TABLE]
This is equivalent to allowing histories of length shorter than : By padding with a symbol , we have if and only if for nodes with .
- (iii)
Ignoring nodes whose history is of length smaller than yields
[TABLE]
We can now show that these three approaches are basically equivalent:
Theorem 6**.**
For every and every binary tree , we have the following:
[TABLE]
Proof.
First, note that
[TABLE]
as the inner sum is the Shannon entropy of the probability distribution given by (and hence ) and as . Analogously, we get
[TABLE]
We start with upper-bounding : By the log-sum inequality (Lemma 1) and (22), we get
[TABLE]
Moreover, we find
[TABLE]
by the log-sum inequality (Lemma 1) and our estimate from (22). We have
[TABLE]
which follows immediately from the mean-value theorem: as a consequence of the mean-value theorem, for every mapping , which is differentiable on , we have
[TABLE]
With , and and by logarithmic identities, we obtain the estimate (24). Thus, we have:
[TABLE]
Next, we upper-bound : From the definitions of and , we get
[TABLE]
As the second sum on the right-hand side is between [math] and (see (23)), we get .
Finally, as and , we have
[TABLE]
This proves the theorem. ∎
Theorem 6 moreover shows that the choice of the symbol used for padding the histories only affects the value of the -order empirical entropy by an additive term of at most .
Appendix B Proof of Lemma 8
Fix a binary tree . By definition of the first-child next-sibling encoding, every inner node of corresponds in a bijective manner to a node of : For an inner node of , let denote the corresponding node of and let denote the corresponding inner node of of a node of . If is a node of , then we obtain as follows: If , then . Moreover, if is a left child of a node with label , then (and ). Finally, if is a right child of a node with label and ’s left sibling has label , then (and ). Thus, we are also able to determine from for every inner node of : locating every occurrence of a pattern of the form with in the string and replacing it by yields .
In particular, we have for every node of , respectively, for every inner node of . Moreover, for every inner node of , we can uniquely determine from . Thus, we are also able to determine from for every inner node of . Let
[TABLE]
denote the set of -histories that appear as -history of an inner node of . We define a mapping by , which maps the -history of an inner node of to the -history of the corresponding node in : By the above considerations, this mapping is well-defined. Furthermore, we define a mapping by . Again, by the above considerations, this mapping is well-defined, as we are able to determine from .
For we partition into the following disjoint subsets:
[TABLE]
Moreover, we define for . We observe the following:
- (i)
If for a node of , then is a -labeled leaf of : As is a binary tree, the right sibling of a node has no right sibling. Thus, there are no inner nodes in with .
- (ii)
If for a node of , then is an inner node of : This follows again from the fact that is a binary tree (and hence does not have unary nodes).
- (iii)
If for a node of , then can be an inner node or a leaf of . If is a leaf, then its label is the fixed dummy symbol .
- (iv)
For every and node of , we have if and only if . In particular for every and for every . Hence if and .
From (i), we obtain
[TABLE]
From (ii) and (iv), we obtain the following:
[TABLE]
where the last estimate follows from the log-sum inequality (Lemma 1). For every we have
[TABLE]
Thus, we obtain
[TABLE]
From (iii) and (iv), we obtain
[TABLE]
For the first summand, we find analogously as in the previous estimate (26):
[TABLE]
For the second summand, we obtain as :
[TABLE]
where the last inequality follows from the log-sum inequality. Moreover, for all we have
[TABLE]
Thus, we find
[TABLE]
Altogether, if we combine the estimates from (25), (26), (27) and (28), we obtain:
[TABLE]
where the last-but-one estimate follows again from the log-sum inequality. This proves Lemma 8. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Janos Aczél. On Shannon’s inequality, optimal coding, and characterizations of Shannon’s and Renyi’s entropies. Technical Report Research Report AA-73-05, University of Waterloo, 1973. https://cs.uwaterloo.ca/research/tr/1973/CS-73-05.pdf .
- 2[2] Philip Bille, Inge Li Gørtz, Gad M. Landau, and Oren Weimann. Tree compression with top trees. Information and Computation , 243:166–177, 2015.
- 3[3] Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of XML document trees. Information Systems , 33(4–5):456–474, 2008.
- 4[4] Thomas M. Cover. Enumerative source encoding. IEEE Transactions on Information Theory , 19(1):73–77, 1973.
- 5[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (2. ed.) . Wiley, 2006.
- 6[6] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S.Muthikrishnan. Structuring labeled trees for optimal succinctness, and beyond. Proceedings of the 46 46 46 th Annual Symposium on Foundations of Computer Science (FOCS 2005) , pages 184-196. IEEE Computer Society Press, 2005.
- 7[7] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM , 57(1):4:1–4:33, 2009.
- 8[8] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics . Cambridge University Press, 2009.
