Edge Replacement Grammars: A Formal Language Approach for Generating   Graphs

Revanth Reddy; Sarath Chandar; Balaraman Ravindran

arXiv:1902.07159·cs.SI·February 25, 2019

Edge Replacement Grammars: A Formal Language Approach for Generating Graphs

Revanth Reddy, Sarath Chandar, Balaraman Ravindran

PDF

TL;DR

This paper introduces Restricted Probabilistic Edge Replacement Grammars (RPERGs), a formal graph generation model that efficiently learns from data and captures key structural properties of real-world networks, outperforming existing methods.

Contribution

It proposes RPERG, a novel restricted grammar model for graph generation, with an efficient learning algorithm and improved performance over existing graph generative models.

Findings

01

RPERGs outperform existing graph generation methods on real datasets.

02

The model captures structural properties like degree distribution and spectral characteristics.

03

It improves upon the Hyperedge Replacement Grammar based models.

Abstract

Graphs are increasingly becoming ubiquitous as models for structured data. A generative model that closely mimics the structural properties of a given set of graphs has utility in a variety of domains. Much of the existing work require that a large number of parameters, in fact exponential in size of the graphs, be estimated from the data. We take a slightly different approach to this problem, leveraging the extensive prior work in the formal graph grammar literature. In this paper, we propose a graph generation model based on Probabilistic Edge Replacement Grammars (PERGs). We propose a variant of PERG called Restricted PERG (RPERG), which is analogous to PCFGs in string grammar literature. With this restriction, we are able to derive a learning algorithm for estimating the parameters of the grammar from graph data. We empirically demonstrate on real life datasets that RPERGs…

Tables3

Table 1. Table 1: Dataset Statistics for real world graphs.

Dataset	Nodes	Edges	Diameter	Clust. Coeff.
Arxiv	5242	14496	17	0.529
Routers	6474	13895	9	0.252
Enron	36692	183831	11	0.497
DBLP	317080	1049866	21	0.632

Table 2. Table 2: Cosine Distance between the eigenvector centrality of original graph and graphs from generator.

Dataset	RPERG	HRG	Chung-Lu	Kronecker
Arxiv	0.0025	0.0161	0.3496	0.3406
Routers	0.0247	0.0411	0.0379	0.0614
Enron	0.00007	0.0002	0.0052	0.0676
DBLP	0.0079	0.0649	0.5854	0.4997

Table 3. Table 3: Graphlet Correlation Distance values.

Dataset	RPERG	HRG	Chung-Lu	Kronecker
Arxiv	1.086	1.094	1.792	2.071
Routers	1.293	1.404	1.975	2.776
Enron	0.487	0.525	1.319	2.83
DBLP	0.409	1.602	1.738	2.821

Equations6

p (g) = i = 1 \prod n p (A_{i} \to R_{i})

p (g) = i = 1 \prod n p (A_{i} \to R_{i})

p (G) = A \to R \in P \prod p (A \to R)^{c (A \to R)}

p (G) = A \to R \in P \prod p (A \to R)^{c (A \to R)}

p_{M L}^{A \to R} = \frac{c _{D} ( A \to R )}{\sum _{R^{'} : A \to R^{'}} c _{D} ( A \to R ^{'} )}

p_{M L}^{A \to R} = \frac{c _{D} ( A \to R )}{\sum _{R^{'} : A \to R^{'}} c _{D} ( A \to R ^{'} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Edge Replacement Grammars : A Formal Language Approach for Generating Graphs

Revanth Reddy1 Sarath Chandar211footnotemark: 1 Balaraman Ravindran1,3

1Department of Computer Science and Engineering, Indian Institute of Technology Madras

2Mila, Université de Montréal

3Robert Bosch Centre for Data Science and AI, Indian Institute of Technology Madras

[email protected], [email protected], [email protected] Both authors contributed equally.

Abstract

Graphs are increasingly becoming ubiquitous as models for structured data. A generative model that closely mimics the structural properties of a given set of graphs has utility in a variety of domains. Much of the existing work require that a large number of parameters, in fact exponential in size of the graphs, be estimated from the data. We take a slightly different approach to this problem, leveraging the extensive prior work in the formal graph grammar literature. In this paper, we propose a graph generation model based on Probabilistic Edge Replacement Grammars (PERGs). We propose a variant of PERG called Restricted PERG (RPERG), which is analogous to PCFGs in string grammar literature. With this restriction, we are able to derive a learning algorithm for estimating the parameters of the grammar from graph data. We empirically demonstrate on real life datasets that RPERGs outperform existing methods for graph generation. We improve on the performance of the state-of-the-art Hyperedge Replacement Grammar based graph generative model. Despite being a context free grammar, the proposed model is able to capture many of the structural properties of real networks, such as degree distributions, power law and spectral characteristics.

**Keywords: **Graph Generative Models, Graph Mining, Graph Grammars

1 Introduction

Graphs are used to represent various structured data. A variety of networks ranging from social networks to biological networks can be represented as graphs with nodes representing entities and edges representing the relationship between them. Because of the widespread use of graphs as a representation language, many of the usual machine learning tasks are now being specialized for graphs.

One such machine learning task is to estimate the parameters of the generative model of a graph. A good generative model should be able to capture the structural properties of the graph, like degree distribution, community structure, smaller diameter, eigen distributions and so on. The advantages of having a good generative model for a class of graphs are several-fold:

•

We can use the generative model to generate realistic graphs and run simulation studies on it, instead of running experiments on the real network, which might not be feasible always.

•

If we are able to fit the model more accurately, we can use the model to compress the graph data, by just saving the model instead of the entire graph data.

•

We can do graph classification if we can learn a generative model for a class of graphs, by determining the notion of likelihood of the test graph as per the given model.

•

The model can also be used to anonymize the graph data, by generating graphs similar to the original graphs and keeping the original graphs confidential. This will be more helpful for medical data.

These advantages make the problem of designing generative models for graphs an important research problem in network sciences and various graph generation models have been proposed in the past. The earliest generative model in a probabilistic setting was the E-R random graph model [9] . However, the model fails to match several network properties. Specifically, this model does not simulate heavy tailed degree distributions. To overcome this, several other models were proposed. Most of these models belong to the family of preferential attachment models [4, 5] which employ the “rich get richer” phenomenon, which leads to power law distributions. There are several variations of “rich get richer” models like the “copying model” [14], the “winner does not take all” model [18], the “forest fire” model [17] and so on. There is also a different class of models that simulate the “small world network” [22]. For a detailed survey of the existing statistical network models, refer [10].

Most of these models match one or few of the properties of the natural graph. There has been significant interest to come up with a single model that can simulate most of the graph properties. Kronecker graph generators [15] is an example. However they have few limitations. For example, the number of nodes is predetermined. A recursive realistic graph generator using random typing is proposed in [2]. Even though there are many such models, designing a model which has a fast and scalable learning procedure, while also capturing all the structural properties of the network is still a challenging problem.

In this work, we propose a graph generation model based on graph grammars. Unlike other graph generation models, we view graph generation process as a formal language derivation process. We assume that there is an underlying grammar which is generating this graph and the graph evolves according to the grammar rules. So, the problem of graph generation is now reduced to that of inducing the grammar which generated this data, allowing us to leverage extensive prior work in the formal graph grammar literature. Here, the graph generation process is viewed as a derivation from a single edge using a probabilistic edge replacement grammar. The likelihood of a graph belonging to a particular family is the probability of the derivation under the appropriate grammar. The idea of using graph grammars for graph generation was also explored in [1] where authors propose a hyperedge replacement grammar (HRG) based generative model. We compare our approach with their approach.

Edge Replacement Grammars are graph grammar formalisms where the rules replace an edge in a graph with another graph. We propose a variant of Probabilistic Edge Replacement Grammar called Restricted Probabilistic Edge Replacement Grammar (RPERG). RPERG is analogous to PCFG in string grammar literature. This will become evident once we define RPERGs formally. We tested the capabilities of the model by fitting it onto several real world datasets. Experimental results demonstrate that the model is able to capture most of the statistical and structural properties of the graph better than existing graph generators. The major advantages of this model over the existing models are as follows:

•

The model makes no assumptions on the underlying graph family.

•

The model assumes no specific parametric form. The parameters of this model are the grammar rules and the number of rules is determined by the complexity of the data.

•

The model parameters are more easily interpretable. They are nothing but the statistically significant subgraph patterns that repeat itself in the graph. They can also be considered as the motifs in the graph.

Contributions of this paper are several-fold:

•

We define a family of graphs called “non-squeezable graphs” and provide a complete characterization of the family.

•

We propose a PCFG equivalent grammar in graph grammar literature based on non-squeezable graphs, which we call as Restricted Probabilistic Edge Replacement Grammar (RPERG).

•

We provide a maximum-likelihood learning methodology to learn the grammar from the given data.

•

The proposed model captures the structural properties of the graph better than existing state-of-the-art graph generators.

The rest of the paper is organized as follows. In Section 2, we define the basic terminology related to Edge Replacement Grammars and briefly describe the existing Hyperedge Replacement Grammar (HRG)[1] based approach. Section 3 introduces the family of non-squeezable graphs and gives an algorithm for learning RPERGs from a set of graphs. In Section 4, we provide results for performance of the proposed models on various datasets. Section 5 concludes the paper and gives directions for future work.

2 Background

2.1 Edge Replacement Grammars

We define Edge Replacemet Grammars (ERGs) along the lines of Hyperedge Replacement Grammars (HRGs) by [7]. For the sake of simplicity, we state our definitions in terms of edge labeled undirected graphs. The concepts can be easily extended to accommodate node labels as well as directed edges.

Definition 1.

An edge replacement grammar (ERG) is a tuple $\mathcal{G}$ = $\langle N,T,P,S\rangle$ where

•

N and T are finite disjoint sets of non-terminal and terminal edge labels.

•

S $\in$ N is the start edge label.

•

P is a finite set of productions of the form A $\rightarrow$ R, where $A\in$ N and R is a graph with edge labels drawn from N $\cup$ T.

We say that a graph $X^{\prime}$ is derived from a graph $X$ in ERG $\mathcal{G}$ , if we can obtain $X^{\prime}$ by applying a series of production rules starting from $X$ . We denote this by $X\Longrightarrow_{\mathcal{G}}^{*}X^{\prime}$ . Figure 1(a) gives an example ERG and Figure 1(b) gives a sample derivation using the grammar. Another important thing to note is that the paper makes the assumption that T = $\{\epsilon\}$ i.e all the edge labels are non-terminal edge labels.

Definition 2.

A Probabilistic Edge Replacement Grammar (PERG) consists of

•

An edge replacement grammar $\mathcal{G}$ = $\langle N,T,P,S\rangle$

•

A parameter p(A $\rightarrow$ R) for each rule $A\rightarrow R\in P$ , which is the conditional probability of choosing this rule given that the non-terminal being expanded is A. For any $X\in N$ , $\sum_{A\rightarrow R:A=X}p(A\rightarrow R)=1$

Let $G_{\mathcal{G}}$ be the set of all graphs that can be generated from the grammar $\mathcal{G}$ . For any graph $g\in G_{\mathcal{G}}$ generated by applying the rules $A_{1}\rightarrow R_{1}$ , $A_{2}\rightarrow R_{2}$ ,…, $A_{n}\rightarrow R_{n}$ , the probability of $g$ under PERG is given by

[TABLE]

If we assign probabilities 0.2, 0.4, 0.4 to the three rules in Figure-1a respectively, then the probability of the graph generated in Figure-1b is given by $0.4*0.4*(0.2)^{4}$ . The sum of probabilities of all $g\in G_{\mathcal{G}}$ will be 1. Here, probability of $g$ under $\mathcal{G}$ is the probability of generating the graph $g$ by sampling rules from the grammar $\mathcal{G}$ .

2.2 HRG based approach

HRG based graph generative model [1] has been shown to outperform existing Chung-Lu [8] and Kronecker [16] models. In this section, we give a brief overview of the HRG based approach. First, we introduce clique trees and then define hyperedge replacement grammars. The content in this section is based on [1].

All graphs can be decomposed into a clique tree. A network’s clique tree encodes robust and precise information about the network. Here, we just give a brief definition of clique trees. For more information, we refer the reader to Chapters 9,10 of [13].

Definition 3.

A clique tree of a graph H = (V,E) is a tree T, each of whose nodes $\eta$ is labelled with a $V_{\eta}\subseteq V$ and $E_{\eta}\subseteq E$ , such that the following properties hold:

•

Vertex Cover: For each $v\in V$ , there is a vertex $\eta\in T$ such that $v\in V_{\eta}$ .

•

Edge Cover: For each hyperedge $e_{i}=\{v_{1},...,v_{k}\}\in E$ , there is exactly one node $\eta\in T$ such that $e\in E_{\eta}$ . Moreover, $v_{1},...,v_{k}\in V_{\eta}$

A hyperedge is an edge which can connect any number of vertices. If a hyperedge edge $e$ connects vertices $v_{1},v_{2},...,v_{i}$ , then it is denoted as: $e=\{v_{1},v_{2},...,v_{i}\}$ . Here, $|e|=i$ . A hypergraph is a graph $H=(V,E)$ in which each edge is a hyperedge.

Definition 4.

A hyperedge replacement grammar is a tuple $\mathcal{G}$ = $\langle N,T,S,P\rangle$ , where

•

N is a finite set of non-terminal symbols. Each nonterminal A has a non negative integer rank, which we write $|e|$ .

•

T is a finite set of terminal symbols.

•

$S\in N$ * is a distinguished starting nonterminal, and $|S|=0$ *

•

P is a finite set of production rules $A\rightarrow R$ , where

–

A is a nonterminal symbol.

–

R is a hypergraph whose edges are labelled by symbols from $T\cup N$ . If an edge e is labelled by a non-terminal B, we must have $|e|=|B|$ .

–

Exactly $|A|$ vertices of R are designated external vertices. The other vertices in R are called internal vertices.

The first step in learning an HRG from a graph is to compute a clique tree from the original graph. Finding the minimal-width clique tree is NP-complete [3]. [1] uses a Maximum Cardinality Search (MCS) heuristic introduced by [21] to compute a clique tree with a reasonably-low, but not necessarily minimal, width. Then, this clique tree induces an HRG in a natural way as shown below. The approach differs based on the type of node of the clique tree that is being processed. We refer the reader to [1] for a more detailed discussion and visualization of the HRG learning process.

•

Interior Node: Let $\eta$ be an interior node of the clique tree $T$ , let $\eta^{\prime}$ be its parent, and let $\eta_{1},\eta_{2},...,\eta_{m}$ be its children. Node $\eta$ corresponds to an HRG production rule $A\rightarrow R$ as follows. First, $|A|$ = $|V_{\eta^{\prime}}\cap V_{\eta}|$ . Then, $R$ is formed by:

–

Adding an isomorphic copy of the vertices in $V_{\eta}$ and the edges in $E_{\eta}$ .

–

Marking the (copies of) vertices in $V_{\eta^{\prime}}\cap V_{\eta}$ as external vertices.

–

Adding, for each $\eta_{i}$ , a nonterminal hyperedge connecting the (copies of) vertices in $V_{\eta}\cap V_{\eta_{i}}$

•

Root Node: The RHS is computed similar to the interior node case except that it has no external vertices. The start non-terminal $S$ is the LHS and it has rank 0.

•

Leaf Node: The LHS and RHS are calculated in the same way as the interior node case except that no new non-terminal hyperedges are added to the RHS, as there are no children.

3 Approach

3.1 Non-squeezable graphs

Learning PERG from the graph data is hard, since the RHS of rules can be any subgraph. So we define a restricted version of PERGs, which we call Restricted PERGs (RPERGs). Before defining RPERG, we introduce a new operation in connected graphs, called squeezing.

Definition 5.

*Let $u$ , $v$ be a pair of vertices in the graph $G$ . Let $g_{1},g_{2},...,g_{t}$ be the connected components obtained by removing $u,v$ from $G$ . A squeezing operation with respect to $u,v$ is an operation where one of the components $g_{i}$ is replaced by an edge between $u,v$ . *

Here, $t$ is the number of connected components obtained after removing $u$ , $v$ from G. When $t=1$ , the entire graph will be squeezed into a single edge. Squeezing can be viewed as the reverse operation of edge expansion. The following is a special case for the squeezing operation. If $t\geq 3$ and $g_{1},...,g_{t}$ are isolated vertices, then the squeeze operation with respect to $u,v$ replaces the entire graph with the edge $u,v$ . Figure 2 gives some examples for squeezing.

A squeezing operation in which the entire graph is squeezed into a single edge is called a trivial squeeze. Now we will define a class of graphs called non-squeezable graphs.

Definition 6.

*A non-squeezable graph is a graph in which the only squeeze operation that is possible is the trivial squeeze. *

A graph is squeezable if there are non-trivial squeezes possible. Figure 3 gives examples for some non-squeezable graphs and squeezable graphs. Triangle and star graphs are considered to be degenerate cases for non-squeezable graphs. We will now try to characterize the class of graphs that are non-squeezable.

Proposition 1.

*All $k$ -vertex connected graphs for $k\geq 3$ are non-squeezable. *

Proof.

The proof is based on the definition of squeezing operation. For any $k$ -connected graph with $k\geq 3$ , we need atleast 3 vertices to disconnect the graph into two components. Squeeze operation essentially finds a partition of the graph into two parts and squeezes one of them into an edge. This is not possible when $k\geq 3$ , since you cannot find a pair of vertices that partitions the graph into two parts. Note that the reverse of this proposition is not true. Figure 3-c is a counter-example which is 1-connected and non-squeezable.

Proposition 2.

*Triangle and Star graphs are the only set of graphs which are k-connected with $k<3$ and also non-squeezable. *

Proof.

The proof is based on the following lemma.

LEMMA 3.1. If $G=(V,E)$ is a non-squeezable graph, then $\forall$ separating pairs ( $u,v$ ) in $G$ , $\forall$ $x$ in $V\setminus\{u,v\}$ , $u$ separates $x$ and $v$ . Or, $\forall$ separating pairs ( $u,v$ ) in $G$ , $\forall$ x in $V\setminus\{u,v\}$ , v separates x and u.

Proof.

The proof is by contradiction. Let $G$ be a non-squeezable graph. Let us assume that for all the vertices except $x$ , $u$ separates $x$ and $v$ . Now $v$ separates $x$ and $u$ . Or $x$ is directly connected to $v$ . This means that we can squeeze the sub-graph $u-v-x$ to $u-x$ . This contradicts our assumption that $G$ is a non-squeezable graph. Thus, the theorem is true. Proposition 2 follows from this theorem.

Proposition 3.

*Any graph G can be squeezed into a single edge by successively squeezing all the non-squeezable sub-graphs in G. *

This proposition is trivial to prove. Thus non-squeezable graphs can be considered as the atomic blocks from which the graphs are constructed.

Definition 7.

*Squeeze Minor is a non-squeezable sub-graph that we squeeze during the squeezing operation. *

Proposition 4.

*The multi-set of squeeze minors that are obtained by successive squeezing of non-squeezable graphs in a graph is unique. *

Proof.

We give a sketch of the proof. The multi-set of squeeze minors can contain only stars, triangles and triconnected components. Note that the stars in this multi set will be unique since it corresponds to the cut vertices of the graph which are unique. Now, we need to prove that the triangles and the triconnected components in the multi set are unique. If we disconnect the graph at the cut vertices, we will get biconnected components. Consider an arbitrary sequence of squeezes which results in the squeezing of a biconnected component into one of the edges in a star. Consider some squeeze in that sequence. That squeeze is possible only since the component being squeezed is triconnected. The component was either tri-connected to begin with or became triconnected by the previous squeeze operations which introduced virtual edges. Recursively, these triconnected components are also unique. Same argument holds for triangles also.

If $G_{\mathcal{G}}$ is the set of all graphs that can be generated using the RPERG $\mathcal{G}$ , then from proposition 4, we can say that, for any $g\in G_{\mathcal{G}}$ , $g$ cannot be generated by applying different sets of rules.

3.2 Learning the Grammar

Now we will define a restricted version of PERG, called RPERG.

Definition 8.

*A Restricted Probabilistic Edge Replacement grammar (RPERG) is a PERG such that for every rule $A\rightarrow R\in$ RPERG, R is a non-squeezable graph. *

RPERGs can be viewed as analogous to PCFGs in the string grammar literature, while PERGs are analogous to Tree Substitution Grammars (TSG). This is more intuitive in the sense that in PCFG, RHS of the rules can contain only a 2-level tree, while TSGs can contain any sub-tree as RHS. Similarly, RPERGs can contain only non-squeezable graph fragments in RHS, while PERGs can contain any graph fragment in RHS.

In this section, we will see an algorithm to learn RPERG from a set of graphs. Let $D=(g_{1},g_{2},...g_{n})$ be a set of graphs. We assume that the graphs are undirected and the edges are of the same type and are un-weighted. Given this data, we need to learn the RPERG that could have generated this data. We also assume that all the edges in the data are non-terminal edges. So, all the rules will have only non-terminal edges.

Consider a graph $G$ . Let $c(A\rightarrow R)$ be the count of the occurrence of the non-squeezable sub-graph $R$ in $G$ . Now, the probability of this graph $G$ under an RPERG is given as,

[TABLE]

For a model built on a set of graphs D, the maximum likelihood estimation of the parameters of the model is given by,

[TABLE]

where $c_{D}(A\rightarrow R)$ is the count of the occurrences of the sub-graph $R$ in the data $D$ .

Now the learning problem has been reduced to getting the counts of non-squeezable components in a graph. Let us first consider a simple approach to count the non-squeezable components from a graph. We can first find a non-squeezable sub-graph, squeeze it and repeat the same until we squeeze the entire graph into an edge. But, finding a non-squeezable subgraph by repeated squeezing is computationally expensive. We will propose a more efficient algorithm to count all non-squeezable sub-graphs based on Proposition 1 and the intuitions given in Propositions 3 and 4.

The learning algorithm is given in Algorithm 1 . In the algorithm, star(n) denotes a star network with n+1 nodes. The statement $C(A\rightarrow R)+=1$ increments the count of the rule $A\rightarrow R$ , if it is already present in the set of learnt rules or it will add the rule to the rule set and set count to 1. We assume that the Stack data structure and the grammar rule set are shared between Main and Get_Components functions.

Finding split pairs in a biconnected component is the most non-trivial step in the learning algorithm. Any naive implementation of this module would take $O(n^{3})$ time. A linear time algorithm, which is linear in the size of the graph is provided by [11]. We have used a publicly available implementation of this algorithm (https://github.com/adrianN/Triconnectivity) to find split pairs.

Note that the algorithm is inherently parallelizable. Once we split the graph, we can parallelly learn rules from individual sub-graphs. This will make the algorithm even faster.

3.3 The Generative Model

In the previous section, we have seen an algorithm for learning RPERGs. Given a set of graphs, the learning algorithm will learn an RPERG from the graphs. In this section, we propose two different generative models for graphs based on the learnt RPERG.

Since we have assumed that the graphs have only one type of link, the rules contain only one non-terminal label, namely $A$ . We consider the absence of a label for an edge as a terminal; in other words, $\epsilon$ is the only terminal. The learning algorithm will consider all the edges in the given graph to be non-terminal edges. So, in the learnt grammar, all the edges in the RHS of the rules will be non-terminal edges.

3.3.1 Proposed Model - ERGM-1

The first model is based on grammar derivation. If we start with an edge labelled with A and apply the rules from the learnt grammar successively, the derivation will not terminate since none of the learnt rules have terminal edges in the RHS. So we append the learnt model with an additional rule which converts a labelled edge to an un-labelled edge. We assign probability $p$ to this rule and re-normalize the probability of other rules accordingly. We call this grammar, modified RPERG. Now, the generation of a new graph is nothing but the successive application of the rules until we get all terminal edges. The model is described in Algorithm 2.

This model gives the notion of likelihood of a graph being a member of a particular class of graphs. Given a set of graphs belonging to a class, we can learn a grammar for the class and parse the given graph with that grammar. The probability of the graph gives us some idea about the membership of the graph to this class. Although we cannot exactly control the size of the graph with this model, the coarse size of the graph can be approximately controlled by the parameter $p$ .

3.3.2 Proposed Model - ERGM-2

The second model uses the learnt RPERG directly without doing any additions. We start with an edge and randomly choose an edge and replace it with a sub-graph based on rule sampled from the distribution of rules, until we get a graph of required size. By size, here we mean the number of nodes. Then we stop expanding and convert all non-terminal edges to terminal edges. The model is described in Algorithm 3. This model can be used to generate a graph of required size, with desired properties that are learnt from a set of graphs.

4 Experiments

Here, we show that RPERGs contain rules that capture the structure of the graph. We test our proposed model by fitting it onto several real life graphs. First, we learn the grammar from the graph. Then, we generate graphs from the learnt grammar using the generative model. In our experiments, we use $ERGM$ - $2$ to generate the graphs. In this section, we compare our approach against existing state-of-the-art graph generators.

4.1 Real World Datasets

The datasets considered in this paper are the same as those used in [1]. The networks vary not only in the number of vertices and edges, but also in the clustering coefficient, diameter, degree distribution and many other graph properties. Table 1 gives the statistics for these networks. The Arxiv GR-QC covers scientific collaborations in the General Relativity and Quantum Cosmology section of Arxiv; the Internet Routers is network of autonomous systems of the internet connected with each other; Enron Emails is email correspondence graph of Enron corporation; DBLP is co-authorship graph from DBLP dataset. The graphs were obtained from SNAP and KONECT repositories.

4.2 Comparison with existing models

We compare several properties of graphs from four different graph generators (RPERG, HRG[1], Chung-Lu[8], Kronecker[16]) with the original graph G. The HRG based approach has already been introduced in Section 2.2. The Chung-Lu model takes a degree distribution as input and generates a new graph with similar degree distribution and size. Kronecker model first learns an initiator matrix and then performs a recursive multiplication of that initiator matrix to create an adjacency matrix of the approximate graph. We use KronFit[16] to learn the $2\times 2$ initiator matrix.

In this section, we generate 10 graphs each for the graph generators and plot the mean values for different properties. Figures 4,5,6 contain plots of graph properties (Degree Distribution, Network Values, Hop Plot, Mean Clustering Coefficient, Scree Plot, Node Triangle Participation) for Arxiv, Routers and Enron datasets respectively.

Degree Distribution: It is the distribution of number of edges connecting to each vertex in the graph. From the plots, we can see that each of the generators give graphs that are slightly different from original graph, but all of the them capture the power law degree distribution.

Network Values: This is a plot of the eigen components of the eigen vector corresponding to the largest eigen value as a function of their rank. From the plots, it can be seen that RPERG performs consistently well across all graphs but the difference between generators is difficult to discern. To more concretely compare the eigenvectors, the cosine distance between the eigenvector centrality of the original graph and the model’s generated graphs is shown in Table 2. It can be seen that the distance values are lowest for RPERG.

**Hop Plot: **Hop plot shows the number of vertex-pairs that are reachable within $x$ hops. It gives a sense about the distribution of the shortest path lengths in the network and about how quickly nodes’ neighborhoods expand with the number of hops. Similar to [17, 1], we generate hop plot by choosing 50 random nodes and performing a complete breadth-first traversal over each graph. From the plots, we can see that hop-plots of RPERG are consistently similar to the original graph.

**Mean Clustering Coefficient: **Clustering coefficient is one particular measure of community structure that has been widely used in literature[19, 12]. We plot the average clustering coefficient of the nodes as a function of its degree in the graph. From the plots, it can be seen that RPERG matches the community structure of the original graph. Similar to [20], we see that Chung-Lu and Kronecker models perform poorly in this task.

**Scree plot: **This is a plot of the eigen values of the graph adjacency matrix as a function of their rank, which has been found to obey power law[6]. From the plots, we can see that RPERG is closest to original graph in terms of eigen distributions.

**Node Triangle Participation: **This is a plot of the number of triangles versus the number of nodes that participate in that triangles. It is a measure of transitivity in networks[16] since edges in real-world networks tend to cluster[22] and form triads of connected nodes. From the plots, it can be seen that RPERG consistently captures the node triangle participation of the original graph.

Graphlet Correlation Distance (GCD): [23] has identified a new metric called GCD. It computes the distance between two graphlet correlation matrices. GCD measures the frequency of the various graphlets present in each graph, i.e the number of edges, wedges, triangles, squares, 4-cliques, etc., and compares the graphlet frequencies between two graphs. Because GCD is a distance metric, lower values are better. Table 3 compares the GCD of original graph with the graphs generated using RPERG, HRG, Chung-Lu and Kronecker. It can be seen that GCD values are lowest for our model.

4.3 Runtime Analysis

The overall runtime of the RPERG model can be split into two parts: (1) Rule extraction, and (2) Graph generation. Let the given graph G contain n vertices and m edges. Each iteration of the while loop in line 5 of Alg.1 requires $O(n+m)$ time for finding a split pair and $O(n+m)$ time for finding all cut-vertices. The runtime of RPERG learning process depends on the type of split obtained at each iteration. In the worst case, size of the graph reduces by 1 node, after splitting it, at each iteration, and time complexity is $O(m\cdot n)$ . Conversely, best case time complexity is $O(m\cdot logn)$ . By comparison, HRG rule extraction takes $O(m\cdot\Delta)$ time, where $\Delta$ is maximum degree of G, Kronecker learns model in $O(m)$ , Chung-Lu does not learn a model, but takes the degree sequence as input.

For RPERG and HRG, since graph generation is a straightforward application of the grammar rules, the time complexity is linear in the number of edges of the output graph. For Kronecker, graph generation is in $O(m)$ whereas it takes $O(n+m)$ for Chung-Lu model.

5 Conclusion

We propose a graph generation model based on probabilistic graph grammars. We characterize the notion of non-squeezable graphs and restrict our attention only to edge replacement rules that introduce non-squeezable components. From our experiments, we find that the graphs generated by our model more closely resemble the original graph compared to those obtained by existing graph generators. Even though the grammar is context free, it is able to capture most of the statistical properties of the graph. We observe that our algorithm is easily parallelizable as we can simultaneously run on all the graphs when we have multiple graphs.

There are several extensions for the model that are possible. We can try to model preferential attachment by using a context sensitive grammar. Tackling graphs with multiple types of links (heterogeneous links) is also a challenging problem. In our algorithm, we stop finding split pairs when we find the first split pair. This can be improved further by not stopping and continue finding a pair which splits the graphs into reasonable two halves.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Aguiñaga, R. Palacios, D. Chiang, and T. Weninger , Growing graphs from hyperedge replacement graph grammars , in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, 2016, pp. 469–478.
2[2] L. Akoglu and C. Faloutsos , Rtg: a recursive realistic graph generator using random typing , in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2009, pp. 13–28.
3[3] S. Arnborg, D. G. Corneil, and A. Proskurowski , Complexity of finding embeddings in ak-tree , SIAM Journal on Algebraic Discrete Methods, 8 (1987), pp. 277–284.
4[4] A.-L. Barabási and R. Albert , Emergence of scaling in random networks , science, 286 (1999), pp. 509–512.
5[5] A.-L. Barabási, E. Ravasz, and T. Vicsek , Deterministic scale-free networks , Physica A: Statistical Mechanics and its Applications, 299 (2001), pp. 559–564.
6[6] D. Chakrabarti, Y. Zhan, and C. Faloutsos , R-mat: A recursive model for graph mining , in Proceedings of SDM 2004, SIAM, 2004, pp. 442–446.
7[7] D. Chiang, J. Andreas, D. Bauer, K. M. Hermann, B. Jones, and K. Knight , Parsing graphs with hyperedge replacement grammars , in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2013, pp. 924–932.
8[8] F. Chung and L. Lu , Connected components in random graphs with given expected degree sequences , Annals of combinatorics, 6 (2002), pp. 125–145.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Edge Replacement Grammars : A Formal Language Approach for Generating Graphs

Abstract

1 Introduction

2 Background

2.1 Edge Replacement Grammars

Definition 1**.**

Definition 2**.**

2.2 HRG based approach

Definition 3**.**

Definition 4**.**

3 Approach

3.1 Non-squeezable graphs

Definition 5**.**

Definition 6**.**

Proposition 1**.**

Proof.

Proposition 2**.**

Proof.

Proof.

Proposition 3**.**

Definition 7**.**

Proposition 4**.**

Proof.

3.2 Learning the Grammar

Definition 8**.**

3.3 The Generative Model

3.3.1 Proposed Model - ERGM-1

3.3.2 Proposed Model - ERGM-2

4 Experiments

4.1 Real World Datasets

4.2 Comparison with existing models

4.3 Runtime Analysis

5 Conclusion

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Definition 5.

Definition 6.

Proposition 1.

Proposition 2.

Proposition 3.

Definition 7.

Proposition 4.

Definition 8.