Grammar-Based Graph Compression
Sebastian Maneth, Fabian Peternek

TL;DR
This paper introduces a grammar-based graph compression method that recursively detects repeated substructures, resulting in smaller representations and enabling efficient query evaluation on compressed graphs.
Contribution
The paper proposes a novel grammar-based approach for graph compression that improves size reduction and query efficiency compared to existing methods.
Findings
Achieves smaller graph representations for many graph types.
Enables linear-time reachability queries on compressed graphs.
Allows regular path queries with quadratic time complexity.
Abstract
We present a new graph compressor that works by recursively detecting repeated substructures and representing them through grammar rules. We show that for a large number of graphs the compressor obtains smaller representations than other approaches. Specific queries such as reachability between two nodes or regular path queries can be evaluated in linear time (or quadratic times, respectively), over the grammar, thus allowing speed-ups proportional to the compression ratio.
| Graph | |||
|---|---|---|---|
| CA-AstroPh | 18,772 | 396,160 | 14,742 |
| CA-CondMat | 23,133 | 186,936 | 17,135 |
| CA-GrQc | 5,242 | 28,980 | 3,394 |
| Email-Enron | 36,692 | 367,662 | 5,805 |
| Email-EuAll | 265,214 | 420,045 | 28,895 |
| NotreDame | 325,729 | 1,497,134 | 118,264 |
| Wiki-Talk | 2,394,385 | 5,021,410 | 566,846 |
| Wiki-Vote | 7,115 | 103,689 | 5,806 |
| Graph | ||||
|---|---|---|---|---|
| 1 Specific properties en | 609,014 | 819,764 | 71 | 236,235 |
| 2 Types ru | 642,340 | 642,364 | 1 | 79 |
| 3 Types es | 818,657 | 819,780 | 1 | 336 |
| 4 Types de with en | 618,708 | 1,810,909 | 1 | 335 |
| 5 Identica | 16,355 | 29,683 | 12 | 14,588 |
| 6 Jamendo | 438,975 | 1,047,898 | 25 | 396,725 |
| Graph | ||||
|---|---|---|---|---|
| Tic-Tac-Toe | 5,634 | 10,016 | 3 | 9 |
| Chess | 76,272 | 113,039 | 12 | 74,592 |
| DBLP60-70 | 24,246 | 23,677 | 1 | 2,739 |
| DBLP60-90 | 658,197 | 954,521 | 1 | 207,305 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|
| Network Graphs | 11.74 | 11.59 | 11.44 | 11.94 | 12.51 | 13.67 | 12.92 |
| RDF Graphs | 5.17 | 5.40 | 4.85 | 4.92 | 5.60 | 6.12 | 6.25 |
| Version Graphs | 7.99 | 8.21 | 5.57 | 5.93 | 6.07 | 6.12 | 6.11 |
| RDF-Graph | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| gRePair | 1,271 | 1 | 3 | 267 | 30 | 872 |
| -tree | 2,731 | 590 | 938 | 1,119 | 52 | 988 |
| TTT | Chess | DBLP60-70 | DBLP60-90 | |
|---|---|---|---|---|
| gRePair | 0.12 | 9.06 | 9.54 | 13.39 |
| -tree | 9.62 | 13.10 | 15.78 | 20.80 |
| LM | - | - | 16.44 | 19.32 |
| HN | - | - | 16.65 | 18.26 |
| Triangle fractal | Grid | |||||||
|---|---|---|---|---|---|---|---|---|
| Order | MaxRank | |||||||
| FP | 2 | 36.23% | 4.61% | 0.44% | 98.26% | 99.95% | 100.00% | |
| 4 | 46.38% | 5.40% | 0.50% | 100.00% | 95.95% | 97.15% | ||
| 15 | 85.51% | 23.59% | 9.43% | 100.00% | 72.31% | 46.38% | ||
| 85.51% | 24.80% | 6.23% | 100.00% | 70.00% | 33.99% | |||
| FP0 | 2 | 36.23% | 4.61% | 0.44% | 98.26% | 99.95% | 100.00% | |
| 4 | 46.38% | 16.54% | 5.70% | 100.00% | 92.23% | 93.37% | ||
| 15 | 46.38% | 61.10% | 8.99% | 100.00% | 99.93% | 96.99% | ||
| 46.38% | 61.10% | 8.99% | 100.00% | 99.93% | 96.99% | |||
| Nat | 2 | 39.13% | 4.79% | 0.45% | 98.26% | 99.95% | 100.00% | |
| 4 | 100.00% | 82.77% | 80.69% | 99.42% | 95.88% | 97.15% | ||
| 15 | 95.65% | 28.72% | 5.19% | 94.77% | 62.96% | 75.48% | ||
| 95.65% | 28.72% | 4.51% | 94.77% | 13.13% | 1.23% | |||
| BFS | 2 | 60.87% | 18.28% | 7.49% | 98.26% | 99.95% | 100.00% | |
| 4 | 81.16% | 54.22% | 56.68% | 100.00% | 95.88% | 97.15% | ||
| 15 | 81.16% | 57.18% | 72.33% | 100.00% | 47.81% | 37.89% | ||
| 81.16% | 52.13% | 71.11% | 100.00% | 63.74% | (19.85%) | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Grammar-Based Graph Compression
Sebastian Maneth
University of Edinburgh
Fabian Peternek
University of Edinburgh
Abstract
We present a new graph compressor that works by recursively detecting repeated substructures and representing them through grammar rules. We show that for a large number of graphs the compressor obtains smaller representations than other approaches. Specific queries such as reachability between two nodes or regular path queries can be evaluated in linear time (or quadratic times, respectively), over the grammar, thus allowing speed-ups proportional to the compression ratio.
1 Introduction
Graph databases were investigated already some 30 years ago as described by Wood [54]. Today, with linked data on the web and social network data, there has been a resurgence of graph databases and graph processing systems. Compression is an important technique of dealing with large graph data: it saves storage space and data transfer time [18]. Grammar-based compression is a technique by which even query evaluation time can be saved. The idea is to compute a small context-free grammar generating a given object, e.g., a string, tree, or graph. Specific queries, e.g. queries performed by a finite-state automaton can be evaluated in linear time (and one pass) over such a grammar, thus providing speed-ups proportional to the compression ratio. Grammar-based compression has been known for strings and trees (see, e.g., [32, 34]). This paper investigates grammar-based graph compression. Our main contributions are:
- (1)
we generalize to graphs the RePair compression scheme, 2. (2)
we experimentally evaluate an implementation of the compression scheme, and 3. (3)
we present new algorithms for query evaluation over compressed graphs.
Let us discuss these contributions in more detail. It is well-known that finding a smallest context-free grammar for a given string is NP-complete [9]. Various approximation algorithms have been considered, see [9]. A particularly effective and simple such algorithm is RePair by Larsson and Moffat [28]. The idea is to repeatedly replace a most-frequent digram in a string by a new nonterminal, and to introduce a corresponding rule for the nonterminal. A digram in a string consists of two consecutive symbols. For instance, the string
[TABLE]
has three occurrences of the digram , two occurrences of the digram , and two occurrences of the digram . Thus, RePair can replace the digram by a new nonterminal, say . The resulting string has two occurrences of the digram which may be replaced by the new nonterminal to obtain this grammar:
[TABLE]
Note that the size of the resulting grammar, i.e., the sum of lengths of the right-hand sides is , i.e., is by one smaller than the length of the original string. RePair has been generalized to ranked, ordered trees by Lohrey, Mennicke, and Maneth [36]. For such trees, a digram consists of two nodes and an edge, i.e., a node and its -th child for some .
Consider the edge-labeled graph on the left of Figure 1a. Our idea is to define a digram as two edges which have least one node in common. The graph contains three occurrences of the digram consisting of an - and a -edge. The graph also contains three occurrences of the digram consisting of two -edges (and the same for ), but these occurrences are overlapping so that there is at most one non-overlapping occurrence of that digram. If we replace each occurrence of the -digram by a nonterminal edge labeled , then we obtain the graph shown on the right of Figure 1a. The complete graph grammar for this graph is shown in Figure 1b. The size of the original graph (i.e., the sum of numbers of edges and nodes) is , while the size of the grammar is .
One of the major challenges in implementing RePair for graphs is that we are unable to determine a most-frequent (non-overlapping) digram in linear time. For strings and trees this is achieved by a trivial greedy left-to-right and bottom-up counting procedure, respectively. Let us consider an example to see why in a graph a greedy counting procedure does not find a maximal set of non-overlapping occurrences of a digram. We follow a specific node order and greedily count occurrences of digrams with the current node as center node. If we start with the node numbered “1” in Figure 2a, then we count exactly two non-overlapping occurrences of a digram (note that the gray shaded areas are occurrences of a different digram). If however, we follow the order of nodes shown in Figure 2b, then we find four occurrences of the same digram. The most efficient way we are aware of for finding a most-frequent (non-overlapping) digram in a graph has quadric time complexity. This is prohibitively expensive for large graphs. We therefore resort to a greedy counting principle which can be performed in linear time. We experiment with different node orders. Interestingly, a node order based on the “similarity” of nodes inspired by the numbering from the Weisfeiler-Lehman isomorphism test [53] achieves the best results in our experiments.
We run experiments with a prototype implementation of our RePair for graphs compression algorithm. Note that typically the algorithm ends up with a large graph that does not contain repeated digrams. Thus, we need an efficient way to represent this rest graph. We use the -trees of Brisaboa, Ladra, and Navarro [5]. We compare our compressor against -trees, the list-merge compressor (LM) of Grabowski and Bieniecki [24], and the compressor of Hernández and Navarro [27]. The latter first applies a generalization of the dense substructure removal (DSR) of Buehrer and Chellapilla [6] and then uses -trees. We find that over network graphs (which have no edge labels), the combination of our new RePair algorithm with DSR gives the best results for all graphs but two (where LM obtains slightly better compression). On RDF graphs (which contain edge labels) we compare our compressor with -trees. Here our method consistently obtains better compression, sometimes by a factor of several hundreds.
We finally investigate query evaluation over grammar-compressed graphs. For strings and trees one basic technique is to run a finite-state automaton in one pass over the grammar. Unfortunately, for graphs there is no well accepted notion of a finite-state automaton. Instead, counting monadic second-order logic (CMSO) is considered as the graph counterpart to regular languages. It follows from well-known results that evaluating a fixed CMSO formula over a grammar-compressed graph can be carried out in (data complexity) time where is an upper bound on the time needed to evaluate the formula over a right-hand side of the grammar [11]. Since this may be too expensive for large graphs, we investigate new algorithms for particular queries. Given a graph grammar , (a) reachability queries (for nodes determine if there is a path from to ) can be evaluated in time , and (b) regular path queries (i.e., determine if there exists two nodes with a labeled path between them matching a regular expression ) can be evaluated in time .
This paper is based on a preliminary paper that was presented at ICDE 2016 (see[41]). We have added several new results and new material with respect to the ICDE 2016 paper. In particular have we greatly extended the Related Work section. Section 4.4 on tree- and string-graphs is entirely new; the main result of that section (Theorem 6 and Corollary 7) essentially shows that SL-HR grammars do not allow stronger tree or string compression as do the existing well-known straight-line tree and string grammars. Section 4.5.2 is entirely new; it shows that the maximal rank parameter can heavily influence the compression behavior or our compressor. Section 4.7 is entirely new; it discusses the choice of our grammar formalism and compares it to other formalisms such as node replacement graph grammars. The experimental section has been extended by new experiments over synthetic graphs, which allow to show the influence of the individual parameters of the compressor. The section on Query Evaluation has been largely extended; for instance, an algorithm to traverse a grammar-represented graph and a new section on regular path queries has been added.
2 Related Work
Our grammar formalism is known as context-free hyperedge replacement (HR) grammars, see [17, 12]. An approximation algorithm for finding a small HR grammar that generates a given graph is considered already by Peshkin [48]. It is based on the SEQUITUR compression scheme [46]. However, experiments are only presented for rather small protein graphs, and we have not been able to obtain their implementation. As far as we know, no other compressor for straight-line graph grammars has been considered. Claude and Navarro [10] apply string RePair on the adjacency list of a graph. This works well, but is outperformed by newer compression schemes such as -trees [5]. More database oriented work is found for semi-structured data. For example the XMill-compressor [31] groups XML-data such that a subsequent use of general-purpose compression (e.g. gzip) is more effective. Schema information can improve its effectiveness, but is not required. Deriving schema information from existing data can be seen as a form of lossy compression. DataGuides [23] are a way of doing just that for XML data. As XML documents can be represented as trees, methods to compress trees are applicable. Grammar-based tree compression can be seen as a precursor to the work presented here. One of the first such algorithms was BPLEX [7]. The results of BPLEX were later improved by applying the RePair compression scheme to trees [36], which is what we are proposing to do for graphs.
2.1 Succinct Graph Representations
Several compression approaches have been developed particularly for web graphs. In a web graph, nodes represent pages (i.e., URLs) and edges represent links from one page to another. Web graphs have two properties which are useful for compression:
Locality:
most links lead to pages within the same host (i.e., the URLs have the same prefix) and
Similarity:
pages on the same host often share the same links.
Due to these properties, ordering the nodes lexicographically by their URL provides an order in which similar nodes are close to each other. The WebGraph framework [4] by Boldi and Vigna is originally based on this order, but was later improved with a different order [3]. It represents the adjacency list of a graph using several layers of encodings, while retaining the ability to answer out-neighborhood queries. An out-neighborhood (in-neighborhood) query applied to a node retrieves all nodes such that there is an edge from to (from to ). As not every graph is a web graph, a lexicographical order of the node-names is not always possible or useful. Apostolico and Drovandi therefore propose to use a BFS-order [2] combined with another encoding. A different approach is proposed by Grabowski and Bieniecki [24], where contiguous blocks of the adjacency list are merged into a single ordered list, and a list of flags which are used to recover the original lists. They then encode the ordered list and use the deflate-compressor to compress both lists. To our knowledge, their method is the current state-of-the-art in compression/query trade-off, when only out-neighborhood queries are considered.
The methods above have in common that they encode the adjacency list of a graph and natively only support out-neighborhood queries. The -trees of Brisaboa et al. [5] on the other hand compress the adjacency matrix of the graph. They do this by recursively partitioning it into many squares. If one of these includes only 0-values, then it is represented by a 0-leaf in the tree, and otherwise the square is partitioned further. This Quadtree-like representation is well known (see, e.g., [55]), but their succinct binary encoding is a clever new approach. The method provides access to both, in- and out-neighborhood queries, and can be applied to any binary relation. We use -trees to represent the start graph of our grammars. The -tree-method was combined by Hernández and Navarro [27] with dense substructure removal, originally proposed by Buehrer and Chellapilla [6]. A dense substructure is defined by two sets of nodes such that they induce a complete bipartite graph. Note that and need not be disjoint. The edges in these bicliques are replaced by a single “virtual node”. To our knowledge, the method of [27] is the current state-of-the-art in compression/query trade-off, when in- and out-neighborhood queries are considered.
2.2 RDF Graph Compression
The Resource Description Framework (RDF) is a fairly recent specification, first standardized by the w3c in February 2004. The current version is RDF1.1111https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ from February 2014. It is used to represent linked data and semantic information. Its relationship to graphs is similar to XML’s relationship to trees, in that graphs are a natural representation of the structure defined by RDF. Roughly speaking, RDF data is represented as a set of triples , connecting a subject with an object by a predicate . Notably, the domains for these can overlap to some degree (for example, values used as predicates in some triples may be subjects in others). Such a set of triples can be represented by having nodes for the subjects and objects, and edges for the predicates. Thus becomes an edge from to labeled . Note that it is not necessary to model overlapping domains as edges pointing to other edges. Instead, there may be a node label that also appears as an edge label. As RDF graphs encode semantic information, the concrete values for subjects, predicates, and objects can be long strings (for example URIs). It is a common practice (see e.g. [19, 45, 1]) to map the possible values to integers using a dictionary and to represent the graph using triples of integers. This also leads to different approaches regarding RDF graph compression: one is to compress the dictionary further (see, e.g., [45, 52]), another to compress the underlying graph structure.
Of the methods mentioned in the previous section, only the -tree has been applied to RDF graph compression [1]. This is done by encoding a separate adjacency matrix for every predicate (i.e., edge label) in the graph. The aforementioned splitting into dictionary and triples was first proposed by Fernández et al. [20] as a datastructure called HDT (Header-Dictionary-Triples). Together with an encoding of the triples grouping them by subject as an adjacency list, this yields an in-memory representation of RDF data. They also investigate compressing this data structure using conventional general-purpose compression methods (gzip, bzip2, and ppmdi), beating these universal compressors used on the plain original file. Another approach of Swacha and Grabowski [49] combines techniques to compress graph and dictionary achieving a succinct representation of the RDF graph, which is subsequently compressed by general purpose compressors.
2.3 Queries with Compressed Input
Often compressed input induces an exponential blow-up in complexity when deciding queries on the represented data. Such “upgrading-theorems” have been shown for example for the representation of graphs as boolean circuits, which can be exponentially smaller than an explicit representation, but in turn make queries exponentially harder to answer (see e.g. [21, 47]). Fortunately, upgrading-theorems do not hold for grammar-compressed graphs as already shown in [29]. Indeed, it is possible to obtain a speed-up proportional to the compression ratio for certain queries. Lengauer and Wanke [30] show that connectivity of graphs specified using a hierarchical graph definition (which is essentially the same as the hyperedge replacement mechanism we use in this paper) can be decided in linear time with respect to the size of the compressed representation. They also show this for biconnectivity and (undirected) reachability. For strong connectivity however, they only get a quadratic upper bound. It should be noted that there is still an increase in complexity here, but a smaller one. Lengauer and Wagner [29] study several problems and their complexity on explicit versus hierarchical representations. They show, for example, that reachability is complete for on hierarchical graphs, but the problem is -complete for explicit representations. The latter class is widely believed to be a proper subset of the former, but both classes only include problems which can be solved in polynomial time, so there is no exponential blow-up in computation time. As the hierarchical representation can be exponentially smaller than the represented graph, this makes a speed-up possible. This smaller increase in complexity is also stated by Marathe et al. [44, 42, 43], who show that some -complete problems on graphs (e.g., 3-colorization, Max Cut, and Vertex Cover) become -hard on hierarchically defined graphs, which again does not imply an exponential blow-up in computation time (assuming widely-held assumptions in complexity theory). They also present polynomial approximation algorithms for some of these problems with hierarchical input. Note that we cannot conclude that hierarchical versions of classical problems always lead to a speed-up: in the same paper they show that the -complete threshold network flow problem becomes -complete for hierarchical input graphs.
3 Preliminaries
For we denote by the set . A ranked alphabet consists of an alphabet together with a mapping that assigns a rank to every symbol in . For the rest of the paper, we assume that is fixed, and of the form for some integer .
As we do some comparisons to grammar-based string and tree compression, let us recall some definitions. We denote the empty string by , and for two strings the concatenation of and by , or if that is unambiguous. Let be an alphabet (that is, a finite set of symbols). The size of a string ( for ) is defined as with , and we write to express that is part of the string . We recursively define trees over a ranked alphabet as the smallest set , such that for , if , then , provided that and . The nodes of a tree can be addressed by their Dewey addresses . For a tree , is recursively defined as . Thus addresses the root node, and the -th child of . For we denote by the symbol at . The size of is , i.e., the number of edges in .
A hypergraph over (or simply graph) is a tuple where , is the set of nodes, is a finite set of edges, is the attachment mapping, is the label mapping, and is a string of external nodes. We define the rank of an edge as and require that for every edge in . The latter requirement (and the attachment mapping mapping to ) mean that we do not allow edges that are not attached to any nodes. We add the following two restrictions on hypergraphs:
- (C1)
for all edges contains no node twice, and 2. (C2)
contains no node twice.
An edge is simple, if its rank equals two. A hypergraph is simple, if all its edges are simple, and for any distinct edges of it holds that or , i.e., has no “multi-edges”. For a hypergraph we use , and to refer to its components. We may omit the subscript if the hypergraph is clear from context. The rank of a hypergraph is defined as . Nodes that are not external are called internal. We define the node size of as , the edge size as
[TABLE]
and the size as . We denote the set of all hypergraphs over by . Note that our size definition differs slightly from the one given in [30], because they are encoding hyperedges using bipartite graphs. Thus, an edge of rank in our model has size (or 1 if ), whereas they would calculate a size of edges and one node. This makes our sizes slightly smaller. For two nodes we say that there is a path from to , if there exists a sequence of edges for some , such that (), with and , and for every if then there exists a such that for some . For an edge let be the first node is attached to. For a path we refer to the nodes as the nodes on the path . We call internal, if all the nodes on are internal.
An example of a hypergraph can be seen in Figure 3. Formally, the pictured graph has , , , , and . Note that external nodes are filled black and have indices below them indicating their position within . Similarly, the hyperedge of rank has indices indicating the order of the attached nodes (simple edges are drawn directed, from their first to second attachment node). In the following we sometimes omit either of these indices. In these cases, we either use colors to indicate this order, or the specific order is irrelevant for the example.
Definition 1**.**
A hyperedge replacement grammar (HR grammar) over is a tuple , where is a ranked alphabet of nonterminals with , is the set of rules such that for every , and is the start graph.
The size of is defined as , and similarly the edge and node sizes and . The rank of an HR grammar is defined by . We often write for a rule and call the left-hand side and the right-hand side of . We call symbols in terminals. Consequently an edge is called terminal if it is labeled by a terminal and nonterminal otherwise.
To define a derivation relation for the grammar , we first introduce some notation. Let be a hypergraph, a bijective function, and the natural extension of to strings. We call a node renaming on . We define , where for all . For a hypergraph and an edge we denote by the hypergraph obtained from by removing the edge . For two hypergraphs we denote by the union of the two hypergraphs defined by . Note that this union
merges nodes that exist in both and , 2. 2.
creates disjoint copies of , , and , and 3. 3.
uses the external nodes of only.
Now, let be hypergraphs and such that . Let be a node renaming on such that , and for every internal node of . The replacement of by in is defined as .
For a grammar we define its derivation relation as follows. For , if and only if there is a nonterminal edge in such that . For we write , if there is a sequence of derivation steps to derive from , and extend this to , if for some . The language of a grammar is defined as . We omit the subscript where the grammar is clear from context. For an example of a derivation, see Figure 4, which shows the full derivation of the grammar given in Figure 1b. Note that the nodes in the start graph initially have IDs and , whereas the nodes that originate from the internal node of the right-hand side of get variables , and . Any pairwise different values for these variables that are different from and yield a correct derivation of this grammar.
Definition 2**.**
An HR grammar is called straight-line (SL-HR grammar) if
the relation is acyclic, 2. 2.
for every there exists exactly one rule with , and 3. 3.
has no useless (unreachable) rules, i.e., if and , then .
Note that and contains (due to the element of choice in the node renaming, possibly infinitely many) only isomorphic graphs. As the right-hand side for a nonterminal is unique in SL-HR grammars, we denote the right-hand side of by . By convention, whenever we state something over all right-hand sides of a grammar, this includes the startgraph (i.e., is assumed to be implicitly in and ). The height of an SL-HR grammar is the height of . Note that in terms of the expressive power of HR grammars for defining graph languages, the conditions (C1) and (C2) do not have an effect, see [25, Chapter 1, Theorem 4.6]. In terms of compression power however, condition (C1) is harmless, but (C2) is not. See Section 4.7 for more details. In the following we often say grammar instead of SL-HR grammar.
3.1 Creating Graphs with Unique Node IDs
Let be an SL-HR grammar. Instead of considering all the (possibly infinitely many) isomorphic graphs in the set , we would like to fix one particular graph of . First of all, we fix a particular node renaming during a derivation step. Let be a graph with a nonterminal edge labeled , and let . Further, let and let be the internal nodes of such that for . In the definition of , we now require for .
This alone is not enough. We also need to define in which order the nonterminal edges are derived, to make sure the full derivation ends up with a unique graph. To do so, let first be a graph, and is nonterminal be the set of nonterminal edges in . We define the sibling-tuple of as the tuple such that , and if (for , i.e., comes before in the tuple), then
- •
(here is the lexicographical order), or
- •
and .
Note that the order for edges with and is arbitrary, as it does not matter. Intuitively, the sibling-tuple is an ordered sequence of siblings within the derivation tree of . Using this order we define the derivation-tree for an SL-HR grammar recursively in the following way: let be a nonterminal edge labeled , let , and let for . Then . To define the root of the tree, let , and . Figure 5 shows an example of a derivation tree. The order in which the nonterminal edges of are derived is now given by a pre-order traversal (i.e., a depth-first traversal, which always visits the leftmost unvisited node next) of . This yields a unique hypergraph out of the many isomorphic options in . We denote this hypergraph by . We denote the hypergraph derived by a single edge in this way by . Finally we denote by the hypergraph derived from a graph with many nodes attached to a single -edge.
For an example consider first again the derivation shown in Figure 4. The only allowed choice of node-IDs for here would be , , and . Furthermore, Figure 6 shows the graph resulting from the derivation of the grammar given in Figure 5. Note how in both cases the order in which the -edges between nodes and are derived does not matter: if we were to switch the node-IDs created by them ( and in Figure 6) we would still have the same graph, not just an isomorphic one.
3.2 String- and Tree-Graphs
Strings and trees can be conveniently encoded as hypergraphs. As the latter structure is more complex, this affects the size. We make use of these encodings, when discussing the relation of graph compression to previously proposed string- and tree-compression in Section 4.4. Let be a string. We define the hypergraph representing the string as , where . For an edge we let and . Finally the external nodes are . The size of is . Note that the external nodes indicate beginning and end of the string. Figure 7a shows for .
The encoding for trees is a little less straight-forward, because the node labels are moved into the edges, and the children of a node in a tree are ordered. We convert a node with children into a hyperedge of rank (as it is connected to the parent as well). The order of the hyperedge retains the order of the children and the parent is always located at the first node attached to a hyperedge. This is a well-known encoding, see e.g., [15]. Let be a tree, and let be the lexicographical order on . For an address we denote by the position of within the order , such that . We define the hypergraph representing the tree as , where for and ,
- •
, and
- •
.
The size of is between (for a monadic tree ) and (for a tree where every inner node has rank greater than ). Figure 7b shows for .
4 GraphRePair
We present our generalization to graphs of the RePair compression scheme.
4.1 RePair Compression Scheme for Strings and Trees
Let us first explain the classical RePair compressor for strings and trees. The RePair compression scheme is an approximation algorithm for a smallest context-free grammar for a given string. The smallest grammar problem is the problem of deciding, given a string and a number , whether there exists a straight-line grammar of size at most generating . It was shown by Charikar et al. [9] that the smallest grammar problem is -complete. Hence, already for string-graphs it follows that finding a smallest grammar is -hard. In a string, a digram consists of two consecutive symbols. The idea of Larsson and Moffat’s RePair Compression scheme [28] is to repeatedly replace all (non-overlapping) occurrences of a most frequent digram by a new nonterminal. The process ends when no repeating digrams occur. Consider as an example the string
[TABLE]
It contains occurrences of the digrams ab (3 times), bc (3 times), and ca (2 times). If ab is replaced by then we obtain AcAcAc. If now Ac is replaced by , then we obtain this grammar
[TABLE]
Note that the original string has size , whereas the grammar has size (the size of a grammar is the sum of the sizes of its right-hand sides). Note that overlapping occurrences only exist for digrams of the form .
To compute in linear time such a grammar from a given string requires a set of carefully designed data structures. The input string is represented as a doubly linked list. Additionally a list of active digrams (digrams that occur at least twice) is maintained. Every entry in the list of active digrams points to an entry in a priority queue of length (with being the length of the input string) containing doubly linked lists. The list with priority in contains every digram that occurs times, the last list contains every digram occurring or more times. The list items also contain pointers to the first occurrence of the respective digram. This queue is used to find the most frequent digram in constant time. Larsson and Moffat [28] prove that -length guarantees a linear runtime for the complete algorithm. Basically, in a string of length the most occurrences a digram can have is and in that case no other digram can occur as often in the string. There may, in general, be more than one digram occurring more than times, but the last list can not have more than entries, and the frequency of the most frequent digram is monotonely decreasing. This overall leads to a linear running time. All these data structures are updated whenever an occurrence is removed. Consider removing one occurrence of ab in the example above: when doing so, one occurrence of bc and possibly ac need to be removed from the list. On the other hand, new occurrences of Ac and possibly cA are created.
An even smaller grammar than the one above can be obtained through pruning, which removes nonterminals that are referenced only once, i.e., pruning would remove the nonterminal in the grammar above, so that the -rule becomes . This reduces the size of the grammar from to . Note that pruning can never increase the size of the grammar (but may not decrease it).
RePair was generalized to trees by Lohrey et al. [36]. Here a digram consists of two nodes, and the -edge between them, with meaning that the second node is the -th child of the first node, denoted by . Note that overlapping occurrences can only happen for digrams of the form . A digram has, in a binary tree, at most three “dangling edges”. Dangling edges in context-free tree grammars are represented by parameters of the form . The number is the rank of the rule (digram). E.g. the -rule in Figure 9 represents a digram of rank 1, while the -rule represents a digram of rank 3. The rank of a grammar is defined as the maximal rank of a rule occurring in the grammar. Parameters have a similar role to external nodes in HR grammars, as they indicate how the right-hand side is connected with the tree during a derivation. Similarly to string grammars, the size of a tree grammar is the sum of sizes of its right-hand sides. Note that neither of the two replacements shown in Figure 9 make the resulting grammar smaller and would thus not be considered contributing: the digram represented by the -rule only occurs once, therefore replacing it makes the grammar larger. Replacing the digram represented by the -rule reduces the size of the tree by , but the right-hand side of the rule has size two as well, thus leading to a grammar that has the same size as the original graph. Were the digram to occur once more however, the replacement would become contributing. The rank of a grammar also has an impact on further algorithms run on it, see e.g. [35, 37]. Keeping the rank small is thus desirable. Therefore, TreeRePair has a user-defined “maxRank” parameter.
4.2 RePair on Graphs
The first step in generalizing RePair to graphs, is defining the notion of a digram in a graph. For simple graphs two options come up naturally: two neighboring nodes and the edge between them, or two edges with a common node. The first option is not viable, as it does not allow to compress some very basic graphs: consider a cycle consisting of consecutive edges (and nodes). Replacing a digram of the first kind does not remove nodes nor edges, and only relabels edges by a nonterminal. Thus no compression is achieved. We therefore use the latter option.
Definition 3**.**
A digram over is a hypergraph , with such that
for all , or , 2. 2.
there exists a such that and , and 3. 3.
.
Every possible digram over undirected, unlabeled edges is shown in Figure 8. Note that we grouped some of the digrams. This is because, one can represent one by the other by using a different order in the -relation of the nonterminal edge. Another example for digrams are the right-hand sides of the two -rules in Figure 10. Note that both grammars in the figure generate the graph on the left. However, they differ in size: the grammar in the middle has size 12, while the grammar on the right has size 9 (recall that simple edges have size 1).
As mentioned before, RePair replaces a digram that has the largest number of non-overlapping occurrences.
Definition 4**.**
Let such that is a digram. Let be the two edges of . Let and let be the set of nodes incident with edges in . Then is an occurrence of in , if there is a bijection such that for and
if and only if , 2. 2.
, and 3. 3.
if and only if for some .
The first two conditions of this definition ensure that the two edges of an occurrence induce a graph isomorphic to . The third condition requires that every external node of is mapped to a node in that is incident with at least one other edge. Thus, the edges marked in the graph on the left of Figure 10 constitute an occurrence of the digram of Figure 8, which is the same as the one of the -rule of the right grammar, but not an occurrence of the digram of the -rule in the middle grammar (digram of Figure 8). We call the nodes in that are mapped to external nodes of , attachment nodes of , and the ones mapped to internal nodes, removal nodes of . Two occurrences of the same digram are called overlapping if . Otherwise they are non-overlapping. If there are at least two non-overlapping occurrences of in a graph , we call an active digram.
Let be a symbol of rank and a digram of rank . The replacement of an occurrence of in by is the graph obtained from by removing the edges in from , removing the removal nodes of , and adding an edge labeled that is attached to the attachment nodes of , in such a way that applying the rule yields the original graph. Consider again Figure 10: the start graph of the right grammar is the replacement of the shaded occurrence of in the left graph by (where is of the grammar on the right).
4.3 The Algorithm
Given a graph , gRePair first performs the steps given in Algorithm 1. As an additional step after the loop finishes, we connect the disconnected components of the graph by virtual edges and run the algorithm again before pruning (and in a final step remove the virtual edges from the grammar). This improves the compression on graphs with disconnected components. We now provide more details on the steps in lines 5 and 12, and on the pruning step.
4.3.1 Counting Occurrences (Step 5)
We aim to find a set of non-overlapping occurrences of a digram that occurs in , that is of maximal size. As stated in the Introduction, this can be solved in time. This is done by reducing it to maximum matching: Let be a graph with nodes and edges, and a digram. The first step is to compute a set containing all occurrences of in , including overlapping ones. This already takes time, as . To see this, consider a “star”; a graph with one central node connected to other nodes, which in turn have no neighbors except the central node. There are pairs of edges and thus as many overlapping occurrences of the same digram. We encode this set of occurrences into a graph such that every occurrence in is represented by an edge from to in . Thus, potentially has a node for every edge in , and an edge for every occurrence in . It can be shown that a maximum matching (i.e., a maximal cardinality set of edges such that no two edges have a common node) on corresponds to a maximal non-overlapping subset of . Computing a maximum matching in graphs can be done in by using, e.g. the Blossom-algorithm [13]. As has nodes, and edges, the total running time is in .
Doing the above for one digram is already prohibitively expensive. Thus we approximate. Let be an order on the nodes of . We traverse the nodes of in this order, and at every node iterate through occurrences centered around this node, as detailed in the next section below. The node order heavily influences the compression behavior. Consider the graph in Figure 11. We want to find the non-overlapping occurrences of the first digram in Figure 8. Note that all three nodes are external, that is, we are looking for three nodes such that has an edge to , has an edge to , and all three nodes also have edges to other nodes (different from ). Figure 11a shows the non-overlapping occurrences found if we start in the central node of the graph. Using the DFS-type order starting at a different node given by the numbers in Figure 11b, three occurrences are determined. Using the “jumping” order in Figure 11c, a maximum set of four non-overlapping occurrences is found. Note that for strings and trees, maximum sets of non-overlapping occurrences can be obtained in linear time using left-to-right and post order, respectively, and assigning occurrences in a greedy way.
Our implementation offers a choice of four different orders, as explained in Section 4.5.1. Implementation details of this step are given in Section 4.6.1.
4.3.2 Updating Occurrence Lists (Step 12)
Let be an occurrence of that is being replaced and let be the set of edges in that are incident with the attachment nodes of . Removing from the graph can only affect the occurrence lists of digrams that have occurrences using edges in . In particular, for the two edges in ( and ) we need to remove every occurrence that either or participate in from the corresponding occurrence list. To do this efficiently, we store a list of digrams with every edge for which the edge appears in an occurrence. After the replacement let be the new -labeled nonterminal edge in . Then every pair of edges is an occurrence of a digram, and is thus inserted into the appropriate occurrence list (and the frequency counts are updated accordingly). The last step has again complexity issues. Let be the sum of degrees of all attachment nodes of . Then there are pairs of edges to be considered as occurrences with the new nonterminal edge. This is not a problem in itself, but consider the following situation: let there be an attachment node with degree . Further, let every one of the edges around be part of a distinct occurrence of the digram being replaced. As explained, when replacing one of these occurrences, the other edges are considered as occurrences with the new nonterminal. Now however, when replacing the next one, the remaining edges have to be considered again. Thus, during all the replacement steps, we would again need to consider occurrences. To solve this issue, we consider first the case of a graph without edge labels and directions. Let node have degree , i.e., there are edge pairs that are occurrences of some digram. After inserting one of them into the occurrence list, we take the two edges involved out of further consideration, because every other occurrence using one of them would overlap with this one. Let be the edges incident with that have not been added into occurrence lists yet. We partition into two sets and where . We then add as the occurrences around to the list. Note that only if all occurrences around are occurrences of the same digram, this procedure guarantees to produce a maximum non-overlapping set of occurrences around .
From here, adding labels (or directions, which can be viewed as labels) is straightforward. For two labels and let be the set of edges incident with labeled and not yet added to an occurrence list for a digram with an edge labeled . Then for distinct symbols and add the occurrences and for every add the occurrences where both edges have the same symbol by splitting as above. This takes time (where is expected to be small).
The above method works during the initial traversal, however every time an occurrence of a digram is replaced, the occurrence lists of affected digrams need to be updated. After inserting the new nonterminal edge we need to select an edge from the set of neighboring edges to make an occurrence of a new digram. This can again lead to a situation where time quadratic in the node degree is necessary, unless this selection is done in constant time. Our implementation does this by storing a list of available edges for every pair of edge labels attached to every node of the graph. For every edge label the first edge in the respective list is selected to create the occurrence . This takes time.
4.3.3 Pruning
As mentioned in Section 4.1, pruning can reduce the size of a grammar. For string grammars, pruning removes every nonterminal that is referenced only once in the grammar. The presence of parameters for tree grammars, or external nodes for HR grammars, complicates the pruning step. While, every nonterminal only referenced once can be removed just like before, it is then possible to still have nonterminals which are referenced more than once but do not contribute to the compression (this is not the case for strings). To see this, consider the following tree grammar (from [36, Example 8]):
[TABLE]
In this grammar and are each referenced twice, but do not “contribute” to compression. To see this, note that the grammar above has size 12 and consider the grammars obtained by removing either nonterminal (left: after removing , right: after removing ):
[TABLE]
These grammars now have sizes 11 and 12, respectively. Clearly, both rules where therefore not contributing in the full grammar above, as neither grammar is now larger than before. However, removing first did make the grammar smaller than it was before. Further, if we were to remove one more nonterminal, we would in both cases get the original tree
[TABLE]
which is still of size . This shows that the rule was still not contributing after removal of the -rule, but the -rule turns into a contributing rule if the -rule is removed first. Thus the order in which the nonterminals are removed may also affect the quality of the pruning. Finding an optimal order is a complex optimization problem as mentioned in [36, Section 3.2].
Let us consider pruning for SL-HR grammars. For a nonterminal of rank we define . The contribution of is defined as
[TABLE]
where is the number of edges labeled in the grammar. The contribution of counts by how much the size of the grammar changes when every instance of the nonterminal is derived, i.e., it measures how much contributes towards compression. If then we say that contributes towards compression. The grammar in Figure 12 represents the graph of Figure 11. Here, the -rule has and thus contributes to the compression. The reader may verify that the sizes of this grammar and the graph (given in Figure 13, with the IDs assigned as explained in Section 3.1) differ by exactly three. Note that, as we remove rules, the contribution of other nonterminals might change as edges are added or deleted. Therefore, the effectiveness of pruning depends on the order in which the nonterminals are considered. For TreeRePair, a bottom-up hierarchical order works well in practice. We use a similar approach. First every nonterminal with is removed, because, by definition, they do not contribute towards compression. The order does not matter for this step. To remove we apply its rule to each -edge in the grammar and remove the -rule. Then we traverse the nonterminals in bottom-up -order (see Preliminaries), removing each nonterminal with .
4.4 Relation to Compression of Strings and Trees
Our algorithm is a generalization of previous RePair-variants for strings [28] and trees [36]. Let be a string, and be the digrams that string-RePair would replace during its run on . If we run gRePair on , and use a left-to-right node order, then the digrams that gRePair replaces are exactly to . Therefore, we obtain a graph-encoding of the same string-grammar that the original RePair generates. The comparison with TreeRePair is less clear. TreeRePair compresses ordered trees where the nodes, not the edges, are labeled from a finite alphabet. For gRePair to compress trees in the same way, we need to represent the tree as a tree-graph as given in Section 3.2 and use a similar postOrder-traversal as TreeRePair does. We can experimentally confirm that gRePair achieves comparable (within the same order of magnitude) compression ratios as TreeRePair.
An SL tree grammar achieves compression by representing only once repeating connected subgraphs of the given tree. An SL-HR grammar is more general, because it can share repeating disconnected subgraphs. Does this allow, for an SL-HR grammar, to compress a tree more effectively than any SL tree grammar? We show that this is not the case by proving that every SL-HR grammar that represents a tree (string), can be converted to an SL-HR grammar, that is only slightly larger than and for which every right-hand side is a tree (string). This is not trivial, as the expressive power of graph grammars is higher than the one of context-free string grammars. For example, it is possible to generate the language using HR grammars, but the string-language represented here is not context-free. We call an SL-HR grammar tree generating, if is a tree-graph as defined in Section 3.2. Similarly, is called string-generating, if is a string-graph. Recall that a path from a node to a node in a hypergraph is a tuple of edges, which connect to , such that an edge of rank is considered to have one source (the first attached node) and target nodes, and that we call a path internal, if it uses only internal nodes (with the possible exception of and ). For example, in Figure 14 the right-hand side of has a path from the green to the red node, but it is not an internal path, because it crosses the orange node.
Definition 5**.**
Let be a graph with the set of external nodes. Then the line-structure is the directed, unlabeled graph such that for all if and only if there is an internal path from to in .
For a nonterminal of an SL-HR grammar we denote by . For an example of a line-structure, see Figure 14. Note that if a grammar is tree-generating, is always a rooted forest for every nonterminal in (otherwise there would be a cycle in , because is tree-generating and is useful). We refer to this property by (LG). However, the right-hand side of may not be a tree-graph itself, as in the -rule in Figure 14. There are two ways in which for a rule in a tree-generating grammar can not be a tree-graph: it can contain cycles, or nodes with multiple parents. Both of these cases can only happen with nonterminal edges, however. The line-structure is a tool to split these nonterminal edges, such that the resulting graphs are tree-graphs.
Theorem 6**.**
Let be a tree-generating SL-HR grammar. We can construct an SL-HR grammar such that
, 2. 2.
, and 3. 3.
* is a tree-graph for every .*
Proof.
We describe three transformations on , which yield the desired grammar . In the first transformation, the set of nonterminals remains unchanged. We use the line-structure to order the external nodes in a top-down way, i.e., if is an ancestor of within the line-structure, then should be ordered before . Let for some nonterminal , and let . We define the partial order such that if and only if there is a directed path from to in . Due to the property (LG) mentioned above, is well-defined. Let be a rule, and two of its external nodes, such that , but appears before in . We reorder (and by extension, the attachment relation of every nonterminal edge labeled in the entire grammar) such that they are sorted according to some (arbitrary, but fixed) total order consistent with . This step does not change the structure of the grammar or affect its size. As an example, consider for in Figure 14. In all three cases, the red external node, needs to come first to be consistent with , but it is currently third. The orange and green nodes can be ordered arbitrarily after that. Permissible orders for thus are red orange green or red green orange (see Figure 15 for the final grammar).
In the next two steps, additional nonterminals are (in general) added to the set of nonterminals. We iterate over in a bottom-up fashion, i.e., in reverse -order. If we encounter an such that has at least two connected components then we add rules for , where is a new nonterminal of rank , and is a subgraph of . Note that . The set of nodes of is defined as
[TABLE]
i.e., contains the nodes of that can be reached from the external nodes in . The set of edges of contains every edge of , which is attached only to nodes in . Let be a string and a set of indices. By we denote the string such that . We define the set of -indices as
[TABLE]
Using this, we define , and for some edge labeled we define . Now, we replace every occurrence of an edge labeled within the grammar by edges, such that for the edge is labeled and attached to the nodes . We remove the nonterminal and its rule from the grammar. Not that now all right-hand sides are tree graphs. Note further that , and and therefore the size of the new grammar is at most .
In the third step we eliminate rules that have external nodes, which are neither root nor leafs. To do so, we again split the rules up. Let be a rule which has at least three external nodes such that there is a path from to and from to , i.e., is an inner node and not root or leaf. Let be the maximally sized subgraphs of , such that their union is , and for every , is a tree-graph which does not have a triple of external nodes such that paths exist from to and to , i.e., every external node of is either a root or a leaf. Let be an -labeled edge of the grammar and let and be defined as above. Now we again replace by the rules and every nonterminal edge with by the edges , where and . In this case, an external node that is neither root nor leaf will appear in more than one of the graphs , thus this step increases the size of the grammar. This can be bounded by
[TABLE]
The factor comes from the fact that, in the worst case, we may split at every external node. Thus the node- and edge-sizes are at most doubled. Therefore, for the grammar we have . ∎
As an example, we provide in Figure 15 the conversion of the rules presented in Figure 14. We first adjust the order of the external nodes to one consistent with for by using red orange green as the order of the external nodes in all three cases. Thus, the first external node becomes the third one, and the third external node becomes the first one. In the same way, the appropriate nonterminal edges ( and in ) have their attachment relation reordered. We then apply the second step, splitting the -rule into two rules and . In this case the conversion was already finished after the second step, thus the new rules are not larger than the original ones (in fact, they are smaller, because the -edge now is of rank , which counts as size 1).
Consider again Theorem 6. It clearly follows from the construction of that . This allows us to easily transfer the result also to string-graphs. Let be a string-generating SL-HR grammar. Recall that ’s start graph has two external nodes (because it generates a string-graph), and note that every nonterminal of has rank . We convert into a tree-generating SL-HR grammar by making the second external node in internal. The resulting grammar generates a monadic tree, and we can use the result of Theorem 6 to convert it into a grammar, where every right-hand side is a monadic tree-graph (of rank 2). Finally, this grammar can be converted back into a string-generating grammar, by again making external.
Corollary 7**.**
Let be a string-generating SL-HR grammar. We can construct an SL-HR grammar such that
, 2. 2.
, and 3. 3.
* is a string-graph for every .*
4.5 Important Parameters
In this section we describe some parameters of our algorithm that influence the compression ratio. Their effect is evaluated experimentally in Section 5.
4.5.1 Node Order
As discussed before the node order heavily influences the digram counting, which in turn influences the compression behavior. A node order is given by a bijective function such that if . Some of the orders below are modeled by functions for some , i.e., there exist nodes such that , but a strict order is needed to run the algorithm. For this reason, we consider sets of functions to be an admissible set of orders. This means, that whenever we state that a node order was used, we actually use an arbitrary order out of the set of admissible orders. We evaluate these orders:
natural order (nat) uses the node IDs as given in the source graph, 2. 2.
BFS order follows a breadth-first traversal, 3. 3.
FP computes a fixpoint on the node neighborhoods starting from the degrees, and 4. 4.
FP0, which is a degree order.
Formally, the natural order is just the identity. In the case of BFS there exists an additional element of nondeterminism. We choose as the first node (i.e., the one with ) any node of lowest degree. For any other node the function evaluates to the length of a shortest path from to . This forms the basis for an admissible set of orders. Note that, for graphs with more than one connected component, one node has to be picked for every connected component. All of these initial nodes evaluate to by . We now define the FP and FP0 orders. For a graph let be a family of functions that color every node with an integer. We first define , where is the degree of . This is the order FP0. Now we map every node to the tuple , where are the neighbors of ordered by their values in . We sort these tuples lexicographically and let be the position of in this lexicographical order. This process is iterated until . Now can be used as a basis for an admissible set of orders. This computation of the order works for undirected, unlabeled graphs, but can be straightforwardly extended to directed labeled graphs. We call this order FP. Figure 16 shows an example of the FP-order. The graph on the left is annotated by , the graph in the middle shows , which is then ordered lexicographically to get on the right. This is the fixpoint for this graph.
Note that it is not necessarily a strict order and thus also implies an equivalence relation on the nodes ( if and only if ). The number of equivalence classes of has an interesting correlation with the compression ratio of gRePair, as discussed in Section 5.2.2.
4.5.2 Maximal Rank
The maximal rank is a user defined parameter of gRePair that specifies the maximal rank of a digram (and thus the maximal rank of a nonterminal edge) that the compressor considers. Digrams with a higher rank are ignored and not counted. It was shown already for TreeRePair [36, Theorems 9 and 10] that choosing this parameter too high or too small can have strong effects on compression (in both directions). The two families of trees given there, can be converted into families of graphs showing the same relation to the maximal rank for gRePair. We briefly recap the arguments, but as the proof details are almost identical to the case for TreeRePair, we do not repeat them here.
High Rank
Let be the graph , where is a tree consisting of a right comb of nodes labeled , a symbol of rank . The first children of these nodes are leaves labeled , the last child is the next -labeled node (except for the last of these, which only has -children). This is the same tree as given in [36, Theorem 9]. The size of depends on : for the -edges are of rank 2 (i.e., size 1) and thus we get . For the size of each -edge is and thus .
Lemma 8**.**
Given gRePair produces a grammar of size if the maximal rank allowed is at least , and does not compress at all if the maximal rank is less than .
It is easy to see that there can be no compression with a maximal rank less than : does not contain occurrences of any digram with a rank smaller than . Once the allowed rank is at least , gRePair reduces the width of the tree by with every iteration by combining one of the -leaves with its -/nonterminal parent. Finally a line of nonterminal edges remains, which all have the same label and can be compressed exponentially. This argument is identical to the one in the proof of [36, Theorem 9], only the notation changes to the one used for graphs.
Small Rank
We give an example of graphs which gRePair compresses best, if the maximum rank is limited to . Our example is similar to the one TreeRePair [36, Theorem 10]. The main difference is that we use a graph with edges labeled , instead of nodes. Let be a graph over with nodes and edges . The attachment relation is defined as for , and for . They are labeled by for and
[TABLE]
for . This graph is a tree consisting of a path of edges labeled , and every node attached to one of these -edges, is also attached to an edge labeled , , , , or . These edges appear in this order, i.e., if a node is attached to an edge labeled , then the previous node (one -edge up in the tree) is labeled (), and the next node (one -edge down in the tree) is labeled (). The size of is .
Lemma 9**.**
Given , gRePair produces a grammar of size if the maximal rank is restricted to and compresses at best to of the original size if the maximal rank is unbounded.
The argument for this is mostly identical to [36, Theorem 10]. With a maximal rank of it compresses well, because gRePair then only has limited choice in digrams. It will first replace the pairs of - and -labeled () edges, because every other possible digram has a rank greater than . After this is done, a string-graph remains that can now be compressed exponentially. If the maximal rank is unrestricted, the digrams of higher rank (pairs of -edges in the first iteration) occur more frequently and are thus replaced by gRePair. This will replace all the -edges, but all of the nodes remain in the start graph. Furthermore, in the next step, pairs of nonterminal edges will be replaced as the most frequent digram, again without sharing any nodes. This continues until, after the last iteration, the start graph still has all nodes, and all the edges labeled for . Furthermore, it has 2 nonterminal edges of rank . Thus,
[TABLE]
Regardless of the rules and the pruning step, is already larger than . In particular note that, not counting the nonterminal edges, the size of is still . Therefore
[TABLE]
The pruning step reduces the grammar’s size, but the nodes and terminal edges in the start graph remain. Thus, the compression ratio cannot be better than . Note that for TreeRePair the compression with unbounded rank is at best instead of . The reason for this is, that the size definition for trees only counts edges, whereas we consider nodes and edges in graphs.
4.6 Implementation Details
In this section we describe some of the technical details of our implementation. We outline the involved data structures, and describe our output format.
4.6.1 Data Structures
Our data structures are a direct generalization to graphs of the data structures used for strings [28] and trees [36, Figure 11]. The occurrences are managed using doubly linked lists for every active digram. Of importance is a priority queue, which uses the frequency of a digram as the priority. Following Larsson and Moffat [28] the length of this queue is chosen as , where is the number of edges of the original input graph to gRePair.
4.6.2 Grammar Representation
We encode the start graph and the productions in different ways. As an example, consider again the grammar in Figure 12. The start graph is encoded using -trees [5], using as this provides the best compression. This data structure partitions the adjacency matrix into squares and represents it in a -ary tree. Consider the left adjacency matrix in Figure 17. The -matrix is first expanded with 0-values to the next power of two; i.e., . If one partition has only [math]-entries, a leaf labeled [math] is added to the tree. This happens for the 3rd and 4th partition in this case (the partitions are numbered left to right, top to bottom as indicated in the bottom center of the figure). Thus the 3rd and 4th child of the root are [math]-leafs. The other two have at least one -entry, therefore inner nodes labeled are added and the square is again partitioned into squares. This is continued at most until every square covers exactly one value. At this point the values are added to the tree as leafs. As we need to consider edges with different labels, we use a method similar to the representation of RDF graphs proposed in [1]. Let be the set of all edges labeled . For every label appearing in we encode the subgraph . If , then this is encoded as an adjacency matrix. Otherwise we use an incidence matrix, i.e., a matrix that has one row for every edge and a column for every node. Thus, a 1 in row , column of the incidence matrix means, that edge is attached to node . All of these matrices are encoded as -trees. Figure 17 is an example with two edge labels. Note that this example only uses edges of rank . For a hyperedge , the incidence matrix only provides information on the set of nodes attached to , but not the specific order of . For this reason we also store a permutation for every edge to recover . We count the number of distinct such permutations appearing in the grammar and assign a number to each. Then we store the list encoded in a -fixed length encoding, where is the number of distinct permutations.
For the productions we use a different format, as we expect the right-hand-sides to be very small graphs (due to pruning, they may be larger than just a digram). We store an edge list for every production, encoding the nodes using a variable-length -code [14]. One more bit per node is used to mark external nodes. As the order of the external nodes is also important, we make sure that the order induced by the IDs of the external nodes is the same as the order of the external nodes. Every production begins with the edge count (again, using -codes). For every edge we first use one bit to mark terminal/nonterminal edges, then store the number of attached nodes, followed by the -codes of the list of IDs. Finally, we also use a -code for the edge label. For the production in Figure 12 this leads to the following encoding:
[TABLE]
This is a bit sequence of length 28.
4.7 On the Choice of Grammar Formalism
We discuss briefly the reasoning for some of our choices regarding the grammar formalism used. This includes the choice of hyperedge replacement over a different replacement method, and the restrictions we further enforce on hypergraphs.
4.7.1 Hyperedge vs. Node Replacement
There are two well-known types of context-free graph grammars:
- •
context-free hyperedge replacement grammars (HR grammars for short, see [17]), and
- •
context-free node replacement grammars (NR grammars for short, see [16]).
In terms of graph language generating power, NR grammars are strictly more expressive than HR grammars. For instance, NR grammars can describe the set of all complete graphs, whereas this is not possible with HR grammars. If we use NR grammars to produce bipartite graphs that encode hypergraphs, then the resulting hypergraph language generating power is exactly the same as that of HR grammars (see [17, Theorem 4.28]). Note that for any given SL-HR grammar one can construct an equivalent SL-NR grammar of similar size.
The difference in expressive power comes due to node replacement grammars using a different formalism for derivation. Instead of merging external nodes with the nodes in the nonterminals neighborhood, every rule in an NR-grammar includes a connection relation, which specifies, which nodes in the rule are to be connected to which nodes in the nonterminal’s neighborhood. Consider as an example the grammar given in Figure 18: the initial graph has two nonterminal nodes, both labeled . The rule for has two nonterminal nodes labeled , and the connection relation . When deriving one of the -nodes in we note that the neighborhood of this node consists of only one node with label . The tuple in the connection relation means that every node from the nonterminals neighborhood is connected with every -node in the rule. Figure 19 shows this, and one more derivation step as an example.
The natural choice to define a digram for use of RePair with NR-grammars would be two neighboring nodes and the edge between them. However, due to the connection relation, we would also need to include the neighborhood in this definition, as different occurrences of the same two nodes may need different connection relations. These may be incompatible with each other in some cases, or could be merged into one rule in others. This could make digrams too specific, and occurrences somewhat rare. It is therefore unclear, whether the use of node replacement yields a successful RePair variant, but we note that it is interesting future work, particularly as we believe, that complete graphs can be compressed much better using NR grammars.
Conjecture**.**
Let be a complete graph with nodes. Then there exists an SL-NR grammar such that , and , but there is no SL-HR grammar such that and .
The previously mentioned grammar in Figure 18 is an example for an SL-NR grammar that compresses a complete graph as in the conjecture and can be extended to generate for any . The connection relation for this grammar could also use a shorthand (i.e., could be used for the connection relation of the -rule instead of ). Doing so yields a grammar with rules, each having two nodes, one edge, and one tuple in the connection relation. We currently have no proof for the second part of the conjecture.
4.7.2 On the Conditions (C1) and (C2)
Recall, we also enforce two conditions on hypergraphs, which are not always found in the literature:
for every edge : contains no node twice, and
the string of external nodes contains no node twice.
We refer to these as att-, and ext-distinctness, respectively. We also call a rule att-distinct (ext-distinct), if its right-hand side is att-distinct (ext-distinct). A grammar is att-distinct (ext-distinct), if every rule and the start graph are att-distinct (ext-distinct). As already mentioned in Section 3 these conditions have no effect on the graph language generating power of HR grammars (provided the graphs do not require edges/external nodes that are not att/ext-distinct). For efficiency reasons gRePair does not consider or produce edges that are not att-distinct. This potentially weakens the compression, as nonterminal edges that are not att-distinct could be used to represent occurrences of different digrams using only one rule. For example, consider the digrams g) in Figure 8. Occurrences of them could also be replaced by a rule using the digrams a) from that same figure, by merging the center and right-most external nodes. This is not necessarily always an improvement: in this case we are using a rank-3 nonterminal edge, where a rank-2 edge would have sufficed, thus storing larger nonterminals. It can be shown however, that there are grammars, which are not att-distinct, but smaller than the smallest att-distinct grammar for the same graph. Figure 20 is an example for such a graph. We leave it as an exercise to calculate the maximal impact of att-distinctness with respect to compression. For ext-distinctness, on the other hand, we can show that the condition not only has no effect on compression, but enforcing it makes the grammar smaller.
Lemma 10**.**
Given a non-ext-distinct SL-HR grammar , an ext-distinct SL-HR grammar can be constructed (in linear time), such that and .
Proof.
For a string let be the set of symbols appearing in , and the string derived from by only keeping the first occurrence of every distinct symbol in . For example, for , , and . Let be a rule that is not ext-distinct. We first add a new rule with , and is the same graph as except with . Then we replace the -rule by where . We can now derive every occurrence of in the grammar and remove the -rule altogether. Doing this merges some of the nodes in the grammar, as they were referenced by the same external node. Note that this process reduces the size of the grammar. Let be a node that occurs times in . Then, applying the -rules decreases the node size by . The edge-size also decreases by , as has a smaller rank. ∎
5 Experimental results
We implemented a prototype in Scala (version 2.11.7) using the Graph for Scala library222http://www.scala-graph.org/ (version 1.9.4). The experiments are conducted on a machine running Scientific Linux 6.6 (kernel version 2.6.32), with 2 Intel Xeon E5-2690 v2 processors at 3.00 GHz and 378 GB memory. As we are only evaluating a prototype, we do not mention runtime or peak memory performance, as these can be largely improved by a more careful implementation. We compare to the following compressors:
- •
-tree, for which we use our own Scala-implementation following the description in [5], using the same binary format, however without the further optimization on compressing the leaf-level described there.
- •
The list merge (LM) algorithm by Grabowski and Bieniecki [24]. We use for their chunk size parameter, as in their paper.
- •
The combination of dense substructure removal [6] and -tree by Hernández and Navarro [27] (HN). For the parameters to the algorithm we use , , and , which are the parameters their experiments show to provide the best compression. Note that this compressor uses a -tree implementation with all the optimizations.
The latter two implementations were provided by the authors. We also experimented with RePair on adjacency lists by Claude and Navarro [10], but omit the results here, because on all graphs we tested, stronger compression was achieved by another compared compressor.
As common in graph compression, we present the compression ratios in bpe (bits per edge). Note that our compressor reorders the nodes. We omit the space required to retain the original node IDs, because we assume that they represent arbitrary data values and it is possible to update this mapping. This is particularly true for RDF graphs, as explained in Section 5.3.2.
Note that it is known that using different node orders will change the results for all of these compressors, including gRePair. We discussed the importance of the traversal order, but another area where the node order will have an effect is the final encoding of the start graph using the -tree-structure. To get comparable results, we use the natural order in all cases, when it comes to the encoding. We ran experiments, where the start graph of the grammar produced by gRePair was reordered according to the initially computed FP-order. This improves some of the results, but we consider it future work to test the impact of different orders on the encoding of the start graph (cf the conclusions).
5.1 Datasets
We use three different types of graphs: network graphs (Table 1), RDF graphs (Table 2), and version graphs (Table 3). Each table lists the numbers of nodes and edges and the number of equivalence classes of (see Section 4.5.1) of each graph. For RDF graphs we also list the number of edge labels (i.e., predicates) of each graph. Two of the version graphs also have labeled edges.
We give a short description of each graph: the network graphs are from the Stanford Large Network Dataset Collection333http://snap.stanford.edu/data/index.html and are unlabeled. They are communication networks (Email-EuAll, Wiki-Vote, Wiki-Talk), a web graph (NotreDame) and Co-Authorship networks (CA-AstroPh, CA-CondMat, CA-GrQc). Even if they were advertised as undirected, we considered all of them to be lists of directed edges, to improve the comparability with the other compressors, as these would also assume the input graph to be directed.
The RDF graphs mostly come from the DBPedia project444http://wiki.dbpedia.org/Downloads2015-04, which is an effort of representing ontology information from Wikipedia. We evaluate on specific mapping-based properties (English), which contains infobox data from the English Wikipedia and mapping-based types, which contains the rdf:types for the instances extracted from the infobox data. We use three different versions of the latter: types for instances extracted from the Spanish and Russian Wikipedia pages that do not have an equivalent English page, and types for instances extracted from the German Wikipedia pages that do have an equivalent English page. The Identica-dataset555http://www.it.uc3m.es/berto/RDSZ/ represents messages from the public stream of the microblogging site identi.ca. Its triples map a notice or user with predicates such as creator (pointing to a user), date, content, or name. The Jamendo-dataset666http://dbtune.org/jamendo/ is a linked-data representation of the Jamendo-repository for Creative Commons licensed music. Subjects are artists, records, tags, tracks, signals, or albums. The triples connect them with metadata such as names, birthdate, biography, or date.
Version graphs are disjoint unions of multiple versions of the same graph. Here, Tic-Tac-Toe represents winning positions, and Chess represents legal moves777Both from http://ailab.wsu.edu/subdue/download.htm. The files contain node labels from a finite alphabet, which we ignore here. DBLP60-70 and DBLP60-90 are co-authorship networks from DBLP, created from the XML888http://dblp.uni-trier.de/xml/ (release from August 1st, 2015) file by using author IDs as nodes and creating an edge between two authors who appear as co-authors of some entry in the file. To make version graphs, we created graphs containing the disjoint union of yearly snapshots of the co-authorship network.
5.2 Influence of Parameters
We evaluate how the different parameters for our compressor affect compression. For these experiments every parameter except the one being evaluated is fixed for the runs. Note that this sometimes leads to situations where none of the results in a particular experiment represents the best compression our compressor is able to achieve for the given graph. The parameters evaluated are the maximum rank of a nonterminal and the node order.
5.2.1 Maximum Rank
We test maxRank values from 2 up to 8. The average results for the three types of graphs tested are given in Table 4, as compression in bpe. We did some tests for higher values (up to 16) but only got worse results. The best results are marked in bold. For each class of graphs, a value of achieved the best results on average. We therefore conclude that a value of is a good compromise for our data sets.
5.2.2 Node Order
Recall from Section 4.5.1 that the FP-order is a fixed point computation starting from the node degrees. As this is an iterative process, it can be terminated at any point. We were interested how much difference a fixpoint makes compared to using just the node degrees (). Figure 21 shows the compression ratio of a selection of graphs under the different node orders. The selection aims to be representative for the graphs of the types we evaluated: CA-graphs behave similar to CA-AstroPh, version graphs similar to DBLP60-70, and the RDF graphs similar to Specific properties en. The other graphs in the figure are chosen because they are outliers in their respective category. Our FP-order achieves the best result on almost all of the graphs. On RDF graphs the order generally had only marginal impact: the best and worst results usually are within 0.5 bpe of each other. The Jamendo graph presents an exception here, with the natural order being about 1 bpe better than the closest other result and featuring the largest difference between and FP of all our graphs. Version graphs however benefit hugely from the FP-order, as further discussed in Section 5.3.3. This shows that two or more versions of the same graph are similarly ordered in the FP-order, increasing the likelihood of the compressor of recognizing repeating structures.
There is another interesting observation about the FP-order, or in particular the equivalence relation . It is likely that nodes with a high similarity, i.e., the same neighborhood up to a certain distance, are equivalent in this relation. This implies that graphs with a low number of equivalence classes should compress well, as they would have many repeating substructures. Figure 22 shows this correlation. There is no graph in the lower right corner, i.e., there is no graph with a low number of equivalence classes and low compression.
5.3 Comparison with other Compressors
We compare gRePair with the compressors -tree, LM, and HN listed at the beginning of Section 5. Note that we compare RDF graph compression only against the -tree-method, as LM and HN have not been extended to RDF graphs. While these algorithms all work as in-memory data structures, they produce outputs with file sizes comparable to the in-memory representations. We measure the compression performance based on these file sizes. Where applicable, we furthermore use the dense substructure removal done as a first step in the HN-method (see Section 2) in combination with our compressor, marked as “gRePair+DSR”. To do so, we added specially labeled rank-1 edges to the virtual nodes created by the dense substructure removal, to ensure that the original graph could be restored. Usually, the virtual nodes are identified by having an ID greater than the largest ID occurring in the original graph. As our compressor reorders the nodes, this is no longer feasible, and marking them with additional edges was the most space-efficient method we found.
Let us give an idea of the compression ratio for the graphs according to our size definition. Using gRePair (without DSR), we achieve, on average, a compression ratio () of
- •
for network graphs,
- •
for RDF, and
- •
for version graphs.
The parameters we choose for gRePair are and the FP-order, both being generally the best choice for our dataset. We note that in most results the majority of the file sizes of gRePair’s output () is for the -tree representation of the start graph.
5.3.1 Network Graphs
Our results on network graphs compared to -tree, LM, and HN are summarized in Figure 23. We improve on the plain -tree-representation on all graphs but NotreDame. However, our results are often slightly worse than LM and HN, with Email-EuAll and CA-GrQc being exceptions. That being said, dense substructure removal can be combined with our compressor, using their dense substructure removal as a preprocessing step. This generally improves on our results and achieves the smallest bpe-values for two of the three CA-graphs.
5.3.2 RDF Graphs
Recall from Section 2.2 that the values for subject, predicate, and object of RDF triples are commonly mapped to integers using a dictionary to represent the original values. As in this way dictionary and graph are separate entities, we only focus on compressing the graph. Any method for dictionary compression can be used to additionally compress the dictionary (e.g. [45]) and we omit the space necessary for the dictionary.
Our results in comparison to -tree are given in Table 5. We greatly improve against this representation. For the graphs 2 – 4 (in particular 2 and 3) we are able to produce a representation that is orders of magnitude smaller than the -tree-representation (note that is very low for these graphs). For these two graphs in particular, this is due to the majority of their nodes being laid out in a star pattern: a few hub nodes of very high degree are connected to nodes, most of which are only connected to the hub node. Furthermore, while not acyclic, the graphs are also very tree-like. Structures like these are compressed well by gRePair, because every iteration of the replacement-round of gRePair roughly halves the number of edges around the hub node.
5.3.3 Version Graphs
We describe several experiments over version graphs. First we study how the compressor behaves given a high number of identical copies of the same simple graph. The graph in this case is a directed circle with four nodes and one of the two possible diagonal edges. Figure 24a shows the results of this experiment for identical copies starting from 8 going in powers of 2 up to 4096. Clearly, gRePair is able to compress much better in this case (“exponential compression”), while the file size of other methods rises with roughly the same gradient as the size of the graph. Note that both axes in this graph use a logarithmic scale: in this case, gRePair produces a representation that is orders of magnitude smaller than the other compressors.
Except for identical copies of rather simple graphs, however, we cannot expect to achieve exponential compression on version graphs. Every version has changes and it is not easy to decide which parts of two versions remain the same and can thus be compressed using the same nonterminals. Even if we can guarantee that the same (i.e., isomorphic) substructures are consistently compressed in the same way, the changes between versions might be too big to allow for exponential compression. Our FP-order is inspired by the Weisfeiler-Lehman method [53] (see also [8]), which approximates a test for isomorphism. The results on version graphs, when comparing different orders (see also Section 5.2.2 above), suggest that this is indeed exploited. Figure 24b shows a comparison on the compression of a version graph from the DBLP co-authorship network. We started with a co-authorship network including publications from 1960 and older. To this graph we then add versions with the publications from 1961, 1962,…until 1970 and compress the graphs obtained in this way. The comparison shows that using the FP-order our method achieves better compression than using other orders. Note that the results for BFS or random order are much closer to -trees. Our full results for version graphs are given in Table 6. Note that we compare Tic-Tac-Toe and Chess only against -tree, because these graphs have edge labels. The results show that gRePair compresses version graphs well.
5.4 Results on Synthetic Graphs
We finally describe experiments on synthetic graphs. These show some of the effects described earlier, namely that
some graphs can be compressed exponentially, 2. 2.
the maximal rank can have a big impact, and 3. 3.
different node orders can have a big influence on the compression.
We evaluate two different families of graphs: “grid” and “triangle fractal”. Both have a parameter to achieve different sizes. Let be an -grid graph, i.e., a graph with nodes where node has edges to (unless ), and to (unless ). Following this construction, has nodes, and edges, yielding a size of . Furthermore we define a triangle fractal in the following way: initially is a complete graph with nodes. To define with , we start with and let be the set of edges in that are incident to a node of degree . For each edge in we add a new node and add edges from both ends of to , creating another triangle. Intuitively, we add another triangle at every outer edge. Regarding the size, has nodes, and edges. The natural order (i.e., the order of the node-IDs) is a top-to-bottom, left-to-right order for , while the natural order for is to start with the three nodes of the innermost triangle and then go outward from there in the same way as generated (i.e., in “layers”).
Table 7 lists the results for these graphs using various orders, graph sizes, and values of maxRank. Here we state the compression performance in relation to the size definition, i.e., compression ratio is given by (in percent), where is the compressed representation of .
Some notable results: triangle fractal compresses best at . This is not surprising, as in this case the construction of the graph is exactly reversed (with respect to the recursive definition of ). However, note that at the results are still very close using the FP-order, whereas they get noticeably worse for every other order. Indeed, with the BFS-order even for the optimal result is not found any more. Using a higher rank than 4 is generally detrimental for this graph, because digrams of higher rank occur more often than the ones of rank 2. This is not a problem in itself, as carefully chosen occurrences can still be reduced well, but there are many more sets of occurrences for rank-3 digrams, than there are for rank-2 digrams. Which one is found by gRePair depends on the node order. This graph family shows that
maxRank can have a huge impact (compare 2 vs. 4 for the natural order), and 2. 2.
the node order can have a huge impact (compare FP vs. BFS order).
Grids are the opposite, when it comes to maxRank: best compression is achieved for unbounded rank, while gives almost no compression (independent of the node order). For unbounded rank, gives close to no compression, FP compresses to , and natural order compresses to (for ). The latter “exponential compression” is achieved, by treating the grid as identical lines of length . Node orders that traverse the grid in such a way (like the natural order, which goes row by row, or the BFS-order which follows the rows “in parallel”), are likely to find digram occurrences corresponding to this structure. Accordingly, the BFS-order also achieves good compression, in particular, it has the best result for with or . Note however, that for with BFS-order and unbounded maxRank the rank was actually bounded to , because the computation takes too long otherwise.
6 Query Evaluation
In this section we investigate two types of queries that can be performed over SL-HR grammars: neighborhood queries and speed-up queries. Neighborhood queries allow to traverse the edges of a graph (in any direction). Using them, any arbitrary graph algorithm can be performed on the compressed representation given by an SL-HR grammar. However, this comes at a price: a considerable slow-down is to be expected in comparison to running over an uncompressed graph representation, because a partial decompression is required in order to obtain the neighboring nodes. In contrast, speed-up queries, as their name suggests, can run faster on an SL-HR grammar than on an uncompressed graph representation. Examples of speed-up queries are counting the number of connected components of the graph, checking regular path properties in the graph, or checking reachability between two nodes. These queries can be evaluated in polynomial time over the grammar, and allow speed-ups proportional to the compression ratio. The results in this section have not been implemented. Over grammar-compressed trees, the performance of simple speed-up queries is evaluated in [40].
6.1 Neighborhood Queries
For a node of a hypergraph we denote by the neighborhood of . For simple graphs we also define and , the incoming and outgoing neighborhoods of , respectively. Furthermore we let be the set of edges incident with .
Let be an SL-HR grammar. We assume that every right-hand side in contains at most two nonterminal edges (but arbitrarily many terminal edges). This can be achieved by replacing pairs of nonterminal edges with a single new one: for let be the nonterminal edges in for some . Then replace by a new edge with , and is a graph containing nonterminal edges such that replacing generates the original . This procedure can recursively be applied until every right-hand side has at most two nonterminal edges. This increases the size of the grammar by at most a factor of 2.
Recall that the nodes in the start graph are numbered and that there is an order on the nonterminal edges in so that the nodes in are numbered , where , and similarly, nodes in are numbered from to . Given a node ID, i.e., a number in , computing its outgoing neighbors consists of two steps:
compute a grammar representation (-representation) of , and 2. 2.
from a -representation, compute the outgoing neighbors (as IDs in the decompressed graph) of the represented node.
A -representation is a path in the derivation tree of that “derives the node ”. Such a path is of the form where is a (possibly empty) string of the form . If is empty, then must be a node in . If not, then is a nonterminal edge of . If is the label of , then is a nonterminal edge in labeled , etc. Finally, is an internal node in . Let be the number of nonterminal edges in and . The -representation of can be computed using Algorithm 2, where refers to the number of internal nodes in . The complexity of this algorithm depends on the specific implementation of Steps 5 and 6, and the computation of . Specifically, it is possible to precompute for every and the first ID of every edge in (Step 6). Doing so needs an additional space, but now Step 5 can be implemented using a binary search and only needs time. Ignoring the preprocessing, a computation of Algorithm 2 takes time and space. If we do not want to use the additional space, a computation of takes time (it can be done bottom-up). Steps 5 and 6 can now be implemented using a linear search through the nonterminal edges of in their derivation order as defined in Section 3.2. This leads to a total running time of . In the following we refer to the runtime of Algorithm 2 by to reflect this ambiguity.
Given the -representation , the outgoing neighbors are computed using Algorithm 4. This algorithm uses the function getID, detailed in Algorithm 3. The complexity of the latter again depends on the implementation of Step 15 and . By the same reasoning as above we get a runtime of for getID. Finally getOutNeighborhood uses time, where is the out-degree of the node represented by within . Note that for the algorithms to be well defined, they have to work with . In this case, we implicitly assume and , i.e., we are working on the startgraph.
Proposition 11**.**
Let be an SL-HR grammar and . Let be the number of in (or out) neighbors of in . The node IDs of these nodes can be computed in time .
Note that for string grammars, data structures have been presented that guarantee constant time per move from one letter to the next (or previous) [22]. This result has been extended to grammar-compressed trees [39], and we hope that it can also be generalized to SL-HR grammar-compressed graphs.
6.1.1 Speed-Up Queries
One attractive feature of straight-line context-free grammars is the ability to execute finite automata over them without prior decompression. This was first proved for strings (see [32]) and was later extended to trees (and various models of tree automata, see [37]). The idea is to run the automaton in one pass, bottom-up, through the grammar. As an example, consider the grammar and from the Introduction, and an automaton that accepts strings (over ) with an odd number of ’s. Thus, has states (where is initial and is final) and the transitions , , and for . Since the actual active states are not known during the bottom-up run through the grammar, we need to run the automaton in every possible state over a rule. For the nonterminal we obtain and , i.e., running in state over the string produced by brings us to state , and starting in brings us to . Since is the start nonterminal, we are only interested in starting the automaton in its initial state . We obtain the run , i.e., the automaton arrives in its final state and hence the grammar represents a string with odd number of ’s. It should be clear that the running time of this process is , where is the set of states of the automaton, and is the grammar.
Unfortunately, for graphs there does not exists an accepted notion of finite-state automaton. Nevertheless, properties that can be checked in one pass through the derivation tree of a graph grammar have been studied under various names: “compatible”, “finite”, and “inductive”, and it was later shown that these notions are essentially equivalent [26]. Courcelle and Mosbah [11] show that all properties definable in “counting monadic second-order logic” (CMSO) belong to this class, and by their Proposition 3.1, the complexity of evaluating a CMSO property over a derivation tree can be done in , where is an upper bound on the complexity of evaluation on each right-hand side of the rules in . This is done by a bottom-up computation on the nodes of , where every node can use the results of its children to achieve a correct computation for the subtree rooted at .
Proposition 12**.**
Let be a fixed CMSO property. For a given SL-HR grammar it can be decided in time whether or not holds on , where is an upper bound on the time needed to evaluate one right-hand side of the grammar.
Note that we state the time complexity based on instead of . This needs an adjustment of the proof, as the derivation tree cannot be explicitly constructed, because it may be exponentially larger than . However, a derivation DAG (directed acyclic graph) still retains all information necessary for the algorithm stated in [11] to work. It has a hierarchy, that makes the bottom-up computation possible, and the while there is ambiguity in the parent relationship, the children for every node are well defined. A computation for one right-hand side depending on the results of that computation on the children in the derivation tree, can therefore be done in the derivation DAG as well.
Note that Proposition 12 is often stated under a fixed tree decomposition of width of the graph and then simply becomes . The CMSO (or compatible or finite) graph properties have been extended to functions from graphs to natural numbers, see e.g., Section 5 of [11]. They can be evaluated with a similar complexity as in Proposition 12. For the same explanation as above, this result can be applied to SL-HR grammars. Without stating this result explicitly, we mention some of the well-known CMSO functions: (1) maximal and minimal degree, (2) number of connected components, (3) number of simple cycles, (4) number of simple paths from a source to a target, and (5) maximal and minimal length of a simple cycle.
Beware that in Proposition 12 need not be linear in the size of a right-hand side, but is rather a generic upper bound. Courcelle and Mosbah [11] show (for their Proposition 3.1) linearity for evaluations using a certain Boolean algebra, and cubic complexity for sets of cardinalities. For universal evaluation they give an exponential upper bound. Lohrey [33] proves that the problem in Proposition 12 becomes -complete for grammars where both the rank and the number of nonterminals per right-hand side are bounded by a constant. Without these restrictions the problem is complete for . Note that -completeness is already true for explicitly represented graphs.
The specific complexity thus depends on the problem. We show for two problems that they can be solved in linear time with grammar-compressed graphs as input.
6.2 Reachability Queries
An important class of queries are reachability queries. For a given graph and nodes and such a query asks if is reachable from , i.e., if there exists a path from to in . It is well known that this problem can be solved in time (e.g., by doing a BFS-traversal in time). How can we solve this problem on an SL-HR grammar ? Certainly, -reachability is CMSO definable and therefore Proposition 12 gives us an upper bound of . The following direct linear time algorithm essentially uses the same method already applied by Lengauer and Wanke [30]. Their formalism is slightly different from ours however, as it uses an encoding of the hypergraphs using bipartite graphs, and the algorithm as stated only decides reachability in undirected graphs. We therefore restate the algorithm using the following notion, which will again be used in the next section:
Definition 13**.**
Let be a graph with set of external nodes. Then the skeleton graph is the directed, unlabeled graph such that for all , if and only if is reachable from in .
The edges of a skeleton graph are computed as follows. First, assume that is a terminal graph. We determine the strongly connected components of in linear time (e.g., using Tarjan’s algorithm [50]). Let be the corresponding graph which has as nodes the strongly connected components of . We remove from each strongly connected component that does not contain external nodes. This is done by inserting for every pair of edges such that is an edge from a component into and is an edge from to a component (with ), an edge from component to component . Finally, we replace each component by a cycle of the external nodes of that component, and, for an edge from a component to a component we add an edge from an arbitrary external node of to one of .
Theorem 14**.**
Let be a graph and an SL-HR grammar with . Given nodes , it can be determined in time whether or not is reachable from in .
Proof.
We first compute -representations , of and in time, as described in Section 6.1. We traverse bottom-up with respect to in one pass and compute for each nonterminal its skeleton graph . After having computed in time the skeleta for all nonterminals, we can solve a reachability query as follows. Let be the graph obtained from by replacing each nonterminal edge by its skeleton graph; clearly, it can be obtained in time.
Case 1: Assume that and are of the form and , i.e., both nodes are in the start graph. It should be clear that is reachable from in if and only if is reachable from in . The latter is checked in time.
Case 2: Let and . Let be the label of for and let be the label of for . We determine the set of external nodes of the right-hand side of that are reachable from in . This is done by replacing the (at most two) nonterminal edges in by their skeleton graphs, and then running a standard reachability test. We now move up the derivation tree (viz. to the left in ), at each step computing a subset of the external nodes of : we locate the nodes corresponding in and determine the set of external nodes reachable from these. Finally, we obtain a set of nodes in (all incident with the edge ). In a similar way we compute a set of nodes in that are incident with (and can reach ). Finally, we check if a node in is reachable from a node in . This is done by adding edges over that form a cycle, and edges over that form a cycle. We now pick arbitrary nodes in and in and check if is reachable from in . ∎
Furthermore, by Lengauer and Wagner [29] the reachability problem on grammar-compressed graphs is -complete.
6.3 Regular Path Queries
Regular path queries (RPQ) are a well-known (see, e.g., [54]) way to query graph data: the query is given as a regular expression over the edge alphabet of the graph . Given two nodes in the problem is to decide whether there exists a path from to such that its edge labels, taken as a string, match the regular expression . Such queries are of relevance in modern applications: version 1.1 of the SPARQL query-language999https://www.w3.org/TR/sparql11-query/, for RDF data, introduces property paths, which are a variant of regular path queries. It is well-known that the problem whether a regular path query holds for two given nodes in a graph is decidable in time where is an NFA deciding the language of . Such an automaton can be constructed in linear time and with , e.g., using Thompson’s construction [51]. The algorithm works as follows:
- •
Consider to be an NFA with initial state and final state .
- •
Construct the product automaton of and deciding .
- •
Test the product automaton for emptiness.
If the final test is true, then there is no such path. Otherwise it exists and we say and satisfy . We generalize this method to grammar-compressed graphs. Given a grammar and a regular expression , we first compute an SL-HR grammar which generates the graph-structure (ignoring initial and final states for now) of the product automaton of and . Then, using the result from the previous section, we decide reachability from the initial to the final state. To make this construction easier to read, we will relax one condition we enforced on hypergraphs before: nodes in the grammar are named not just by IDs, but by a tuple of ID and state. In fact, for a node-ID and a set of states , we generate nodes for every . We refer to as the node’s ID, and as the node’s state. Consequently, this affects how nodes are named during a derivation step. The renaming for nodes always only affects the ID portion of the node, i.e., for any only renamings of the form will be allowed. The rules by which the ID is renamed still follow the method laid out in Section 3.1. However, as there may now be distinct nodes and for two different states , we also require that and , i.e., nodes with the same ID will keep this property.
We recall some necessary definitions. A nondeterministic finite automaton (NFA) over an alphabet is a tuple , where is a finite set of states, is the transition relation, is the initial state, and is the final state. A tuple is called a configuration of . A configuration can follow a configuration , denoted , if
for some and , or 2. 2.
and .
We denote the transitive closure of by . The language decided by is denoted as . Let and be two NFA’s. By we denote the product construction defined by and , where
[TABLE]
Note that, while the product construction defines a system of state transitions, it does not set initial and final states by default. However, the NFA does decide the language . This property is used to decide whether there exists a path satisfying a regular path query between two given nodes. Such automata without initial/final states are also just called transition systems. The product construction still works if either or both the automata are transition systems. A simple graph with edge labels from defines a transition system with and and .
Consider the product construction of and for some simple graph and an NFA . It represents every possible run of on any path within , only by setting an initial and final state we choose specific start and end-nodes within , and initial/final states of . Thus, for any two nodes of we can check whether there is a path between them that accepts with the same product construction. We next extend this notion to SL-HR grammars representing simple graphs, by describing a construction which, given an SL-HR grammar and an NFA , generates an SL-HR grammar representing the product construction of and . Thus, we achieve a compressed representation of all runs of on paths of .
Lemma 15**.**
Given an SL-HR grammar that generates a simple graph and an NFA , we can compute in time an SL-HR grammar , such that .
Proof.
The construction is straightforward: for every rule of , the product construction of and is computed. Due to this, the rank increases: if , then . Accordingly, some care has to be taken with nonterminal edges, as these also increase in rank. They need to be attached to the same node-IDs as before, but once for every state in . To make sure that nodes attached to nonterminal edges and external nodes match up correctly, we enforce some (arbitrary) order on .
Let and let be a total order on . For a string let be the string . For every , contains a rule defined as follows. Let and
[TABLE]
such that
[TABLE]
For the external nodes we set . This defines the rules of . Clearly, we do not add any nonterminals, and thus have . Finally is constructed from and using the same construction as described above for the right-hand sides of the rules (i.e., ). ∎
It should be clear that . It can now be used to decide regular path queries:
Theorem 16**.**
Given an SL-HR grammar , where is a simple graph, an RPQ , and a pair of nodes from , it can be decided in time whether and satisfy .
Proof.
First compute an NFA from by using Thompson’s well known construction [51]. Note that . Using Lemma 15 compute the grammar generating the product construction of and . Now, using Theorem 14, decide reachability from to in . Note that are node-IDs in , but a -representation of can be converted into a -representation of by using . If a path exists, we can conclude that there is a path from to satisfying the RPQ . Otherwise, it does not exist. ∎
As an example showing how powerful this construction is, consider the string , which can be represented by a grammar as and the obvious SL-HR grammar representing the string-graph encoding this string. The minimal NFA deciding has 5 states. The product construction of and would therefore have 205 states, whereas the grammar only has 95 states, and just replacing the nonterminals with their skeleta in the small start graph (20 states), e.g., when testing for reachability between and , immediately shows that there is an accepting run from the first to the last node (i.e., nodes [math] and , respectively) of the graph.
We can also use this “product grammar” to decide a more general problem. In a way, the product construction encodes every pair of nodes that fulfills the query . Thus, the product grammar is a compressed representation of every such pair. We can therefore decide, whether there exists a pair of nodes at all, that satisfies the query. Intuitively, this is done by computing, bottom-up, for every rule, which external nodes can be reached from any initial state, and from which external nodes a final state can be reached. If an external node appears in both these lists at some point, we know that there is a path between some initial and final state.
Theorem 17**.**
Given an SL-HR grammar , where is a simple graph, and an RPQ , it can be decided in time , if there exist of nodes in satisfying .
Proof.
As before, construct from , and from and according to Lemma 15. We now compute some helpful information in a bottom-up pass over the grammar, i.e., iterating over the rules in the reverse -order. For a rule we first compute whether any node can be reached from any node (where and are nodes in ) within , using the methods from Theorem 14. If this is the case, a pair satisfying exists. Otherwise, we compute two sets of nodes and . Intuitively, contains the external nodes of that can be reached from some (an initial state) while contains the external nodes of from which a node (a final state) is reachable.
and for a rule are computed in the following way: first we compute two sets of nodes, and . The set contains every node with the following properties:
- •
there exists a nonterminal edge such that , and
- •
if is on position in , then the th external node of is in .
The set is defined analogously, but using instead of . Now we replace every nonterminal edge of by their skeleton (see Definition 13) and let this graph be . To we add a new node and an edge from to every node in . Every external node now reachable from belongs to the set . This set can be computed by a single BFS-traversal starting in . To compute we add a new node to that has an edge from every node in . Every external node that can reach is part of . This can be computed by complementing all the edges and again starting a BFS traversal from .
If, for any , , then a pair of nodes exists within such that they satisfy . Otherwise, we compute and for the start graph , replace the nonterminals in by their skeleta, and add both and to as above. If is now reachable from , then there exist nodes within satisfying . Otherwise no such nodes exist. ∎
Note that the constructions in this section all generalize to hypergraphs (instead of simple graphs) in a straightforward way. There is some ambiguity on how to define transition systems using hyperedges. We suggest to use a similar approach as for the definition of paths within hypergraphs (cf. Section 3). A hyperedge in a transition system would thus be considered as directed with one source node (the first one it is attached to) and multiple target nodes. Semantically, a rank () hyperedge labeled and attached to nodes would then be the same as simple edges all labeled , starting in , and using as target nodes.
7 Conclusions
We present a generalization to (hyper)graphs of the RePair compression scheme as known for strings and trees. Our generalization produces from a given graph a straight-line hyperedge replacement grammar (SL-HR grammar). We prove some theoretical results about SL-HR grammars, for instance, if the given graph is a string or a tree, then an SL-HR grammar cannot compress much better than an ordinary string or tree grammar; thus, in these cases the use of graph grammars does not offer stronger compression than the existing native grammars. In terms of the RePair compression, we prove that (as for trees) the choice of the maximal rank of a grammar can heavily influence the compression behavior.
We then study an implementation of RePair. We observe that for graphs, finding a digram with a maximal number of non-overlapping occurrences is computationally hard; using state-of-the-art algorithms it requires at least cubic time. We therefore introduce an approximation which counts greedily and heavily depends on the order in which the graph is traversed. We experiment with several node orders and find that an order that generalizes the node degree order and is similar to the one used in the Weisfeiler-Lehman approximative isomorphism test [53] gives the best compression results. We compare our compressor to state-of-the-art graph compressors. Over network graphs (which have no edge labels), we do not obtain a conclusive answer: sometimes our compressor gives strongest compression, sometimes the other compressors do. Over RDF graphs (which are edge-labeled) our compressor gives the best results, sometimes factors of several magnitudes smaller than other compressors. We also obtain the best results for “version graphs” which are disjoint unions of versions of the same graph (which are very similar).
We prove that, as in the case of strings and trees, there exist interesting “speed-up” algorithms. Such algorithms can run faster on the compressed grammar than on the original, by a factor proportional to the compression ration of the grammar. We show that reachability between two given nodes offers such an algorithm, as well as evaluating regular path queries over two nodes. The latter asks if there exists a path so that the string of path labels matches a given regular expression. We show that even without a candidate pair of nodes given, we can determine within similar time bounds, whether or not there exists any pair of nodes in the original graph with a path between them matching the regular expression.
For future work, there are several paths to follow. It would be interesting to consider a RePair compression scheme for graphs that is based on node replacement (NR) graph grammars. NR graph grammars can compress some graphs, e.g. cliques, much stronger than HR grammars. On the other hand, rules of NR grammars are more complex and expensive to store. For our compressor, other node orders should be considered which can give rise to stronger compression. Especially for version graphs we believe that different orders could give better results; one possibility, for instance, is to compute the edit distance between two versions of a graph, and to use it to compute a node order that is beneficial for our RePair compressor. In terms of speed-up algorithms there is much work to be done. Foremost, we would like to implement our algorithms and show that they can be faster than existing state-of-the art graph databases. Fundamentally of utmost interest and importance is the study of the complexity of the isomorphism problem for SL-HR grammars. Can it be decided in polynomial time if two graphs represented by SL-HR grammars are isomorphic? For strings and trees similar results hold (see [32] and [7]), but note that even extending the latter result from (ordered ranked) trees to arbitrary trees is non-trivial [38] and that the complexity of deciding the isomorphism of two explicitly given graphs is currently not known to be polynomial.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Álvarez-García, N. R. Brisaboa, J. D. Fernández, M. A. Martínez-Prieto, and G. Navarro. Compressed vertical partitioning for efficient RDF management. Knowl. Inf. Syst. , 44(2):439–474, 2015.
- 2[2] A. Apostolico and G. Drovandi. Graph Compression by BFS. Algorithms , 2(3):1031–1044, 2009.
- 3[3] P. Boldi, M. Santini, and S. Vigna. Permuting web graphs. In Algorithms and Models for the Web-Graph , pages 116–126. 2009.
- 4[4] P. Boldi and S. Vigna. The webgraph framework I: compression techniques. In WWW , pages 595–602, 2004.
- 5[5] N. R. Brisaboa, S. Ladra, and G. Navarro. Compact representation of web graphs with extended functionality. Inf. Syst. , 39:152–174, 2014.
- 6[6] G. Buehrer and K. Chellapilla. A Scalable Pattern Mining Approach to Web Graph Compression with Communities. In WSDM , pages 95–106, 2008.
- 7[7] G. Busatto, M. Lohrey, and S. Maneth. Efficient memory representation of XML document trees. Inf. Syst. , 33(4-5):456–474, 2008.
- 8[8] J. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables for graph identifications. Combinatorica , 12(4):389–410, 1992.
