TL;DR
This paper introduces an autoregressive model for sequences of graphs with variable topology, using graph neural networks to predict future graphs, outperforming existing baselines on synthetic data.
Contribution
It formalizes an autoregressive model for general graph families and employs GNNs to learn and predict graph sequences, a novel approach in this domain.
Findings
Significantly better performance than baselines on synthetic graph data.
Effective modeling of variable-topology graph sequences.
Demonstrates the potential of GNNs in sequence prediction tasks.
Abstract
This paper proposes an autoregressive (AR) model for sequences of graphs, which generalises traditional AR models. A first novelty consists in formalising the AR model for a very general family of graphs, characterised by a variable topology, and attributes associated with nodes and edges. A graph neural network (GNN) is also proposed to learn the AR function associated with the graph-generating process (GGP), and subsequently predict the next graph in a sequence. The proposed method is compared with four baselines on synthetic GGPs, denoting a significantly better performance on all considered problems.
| Hyperparameter | Value |
|---|---|
| Weight for L2 | |
| Learning rate | |
| Batch size | |
| Early stopping | epochs |
| Loss | MSE () | log-loss () | Accuracy () | |
|---|---|---|---|---|
| 1 | 1.108 | 0.715 | 0.334 | 0.86 |
| 5 | 0.341 | 0.076 | 0.227 | 0.92 |
| 10 | 0.326 | 0.090 | 0.197 | 0.92 |
| 20 | 0.336 | 0.105 | 0.194 | 0.92 |
| 50 | 0.479 | 0.193 | 0.244 | 0.90 |
| 100 | 0.541 | 0.250 | 0.246 | 0.90 |
| Loss | MSE () | log-loss () | Accuracy () | |
|---|---|---|---|---|
| 11 | 0.200 | 0.018 | 0.154 | 0.95 |
| 15 | 0.278 | 0.041 | 0.195 | 0.92 |
| 20 | 0.283 | 0.038 | 0.204 | 0.92 |
| 30 | 0.341 | 0.081 | 0.222 | 0.90 |
| 60 | 1.366 | 0.983 | 0.345 | 0.85 |
| 110 | 4.432 | 3.950 | 0.418 | 0.82 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Autoregressive Models for Sequences of Graphs
††thanks: This research is funded by the Swiss National Science Foundation project 200021_172671. We gratefully acknowledge partial support of the Canada Research Chairs program. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Daniele Zambon1
Lorenzo Livi 1Equal contribution. Università della Svizzera italiana, Lugano, Switzerland
University of Manitoba, Winnipeg, Canada
University of Exeter, Exeter, United Kingdom
Daniele Grattarola1
Cesare Alippi
Università della Svizzera italiana, Lugano, Switzerland
Università della Svizzera italiana, Lugano, Switzerland
Politecnico di Milano, Milano, Italy
Abstract
This paper proposes an autoregressive (AR) model for sequences of graphs, which generalises traditional AR models. A first novelty consists in formalising the AR model for a very general family of graphs, characterised by a variable topology, and attributes associated with nodes and edges. A graph neural network (GNN) is also proposed to learn the AR function associated with the graph-generating process (GGP), and subsequently predict the next graph in a sequence. The proposed method is compared with four baselines on synthetic GGPs, denoting a significantly better performance on all considered problems.
Index Terms:
graph, structured data, stochastic process, recurrent neural network, graph neural network, autoregressive model, prediction.
I Introduction
Several physical systems can be described by means of autoregressive (AR) models and their variants. In that case, a system is represented as a discrete-time signal in which, at every time step, the observation is modelled as a realisation of a random variable that depends on preceding observations. In this paper, we consider the problem of designing AR predictive models where the observed entity is a graph.
In the traditional setting for AR predictive models, each observation generated by the process is modelled as a vector, intended as a realisation of random variable , so that
[TABLE]
In other terms, each observation is obtained from the regressor
[TABLE]
through an AR function of order affected by an additive stationary random noise . Given (1), the prediction of is often taken as
[TABLE]
where the expectation is with respect to noise at time . The predictor from (2) is optimal when considering the -norm loss between and .
Predicting the temporal evolution of a multivariate stochastic process is a widely explored problem in system identification and machine learning. However, several recent works have focused on problems where the vector representation of a system can be augmented by also considering the relations existing among variables [1], ending up with a structured representation that can be naturally described by a graph. Such structured information provides an important inductive bias that the learning algorithms can take advantage of. For instance, it is difficult to predict the chemical properties of a molecule by only looking at its atoms; on the other hand, by explicitly representing the chemical bonds, the description of the molecule becomes more complete, and the learning algorithm can take advantage of that information. Similarly, several other machine learning problems benefit from a graph-based representation, e.g., the understanding of visual scenes [2], the modelling of interactions in physical and multi-agent systems [3, 4], or the prediction of traffic flow [5]. In all these problems (and many others [1]), the dependencies among variables provide a strong prior that has been successfully leveraged to significantly surpass the previous state of the art.
In this paper, we consider attributed graphs where each node and edge can be associated with a feature vector. As such, a graph with nodes can be seen as a tuple , where
[TABLE]
represents the set of nodes with -dimensional attributes, and
[TABLE]
models the set of -dimensional edge attributes [6]. By this formulation also categorical attributes can be represented, via vector encoding. We denote with the set of all possible graphs with vector attributes. Graphs in can have different order and topology, as well as variable node and edge attributes. Moreover, a correspondence among the nodes of different graphs might not be present, or can be unknown; in this case, we say that the nodes are non-identified.
Here we build on the classic AR model (1) operating on discrete-time signals, and propose a generalised formulation that allows us to consider sequences of generic attributed graphs. The formulation is thus adapted by modelling a graph-generating process (GGP) that produces a sequence of graphs in , where each graph is obtained through an AR function
[TABLE]
on the space of graphs. In order to formulate an AR model equivalent to (1) for graphs, the AR function must be defined on the space of graphs, and the concept of additive noise must also be suitably adapted to the graph representation.
Previous art on predictive autoregression in the space of graphs is scarce, with [7] being the most notable contribution to the best of out knowledge. However, the work in [7] only considers binary graphs governed by a vector AR process of order on some graph features, and does not allow to consider attributes or non-identified nodes.
The novel contribution of this work is therefore two-fold. First, we formulate an AR system model generating graph-valued signals as a generalisation of the numerical case. In this formulation, we consider graphs in a very general form. In particular, our model deals with graphs characterised by:
- •
directed and undirected edges;
- •
identified and non-identified nodes;
- •
variable topology (i.e., connectivity);
- •
variable order;
- •
node and edge attributes (not necessarily numerical vectors, but also categorical data is possible).
Our model also accounts for the presence of an arbitrary stationary noise, which is formulated as a graph-valued random variable that can act on the topology, node attributes, and edge attributes of the graphs.
Second, we propose to learn the AR function (3) using the recently proposed framework of graph neural networks (GNNs) [1], a family of learnable functions that are designed to operate directly on arbitrary graphs. This paper represents a first step towards modelling graph-valued stochastic processes, and introduces a possible way of applying existing tools for deep learning to this new class of prediction problems.
The rest of the paper is structured as follows: Section II-A formulates the problem of defining AR models in the space of graphs and Section II-B introduces the proposed architecture for the GNN; Section III reports the experimental analysis performed to validate the methodology.
II Neural Graph Autoregression
II-A Autoregressive Model for a GGP
Due to the lack of basic mathematical operators in the vast family of graphs we consider here, the generalisation of model (1) to account for graph data is non-trivial. For instance, we have to deal with the fact that the sum between two graphs is not defined, although it can be formulated in some particular cases, e.g., when two graphs have the same order, identified nodes and numerical attributes [8].
Let be a discrete signal where each sample is described by a graph data structure. As done for the numerical case, we model each observation of the process as a realisation of a random variable , dependent on a graph-valued regressor
[TABLE]
through an AR function on the space of graphs. Similarly to (1), is modelled as
[TABLE]
where
[TABLE]
is a function that models the effects of noise graph111In the present paper, with noise graph we mean a graph-valued random variable distributed according to a stationary graph distribution [9]. on graph . Function in (4) is necessary because, as mentioned above, the sum between graphs and is not generally defined.
Assumptions made on the noise in (1) have to be adapted as well. In the classic formulation (1), the condition of unbiased noise is formulated as or, equivalently, as . In the space of graphs the assumption of unbiased noise can be extended as
[TABLE]
where is the set of mean graphs according to Fréchet [10], defined as the set of graphs minimising:
[TABLE]
Function in (6) is a pre-metric distance between two graphs, and is a graph distribution defined on the Borel sets of space . Examples of graph distances that can be adopted are the graph edit distances [11, 12, 13], or any distance derived by positive semi-definite kernel functions [14]. Depending on the graph distribution , there can be more than one graph minimising (6), hence ending up with as a set. We stress that for a sufficiently small Fréchet variation of the noise graph (see Eq. (7)) and a metric , the Fréchet mean graph exists and is unique.
Equation (6) holds only when
[TABLE]
which can be interpreted as the graph counterpart of in (1). The variance in the graph domain can be expressed in terms of the Fréchet variation:
[TABLE]
The final AR system model in the graph domain becomes
[TABLE]
Notice that the proposed graph autoregressive model (8) is a proper generalisation of model (1). In fact, it can be shown that (8) reduces to (1), when considering – or more generally, – in place of , and choosing as the sum (see Appendix -A for a proof).
Given a GGP modelled by (8), we can predict graph at time as that graph minimising quantity
[TABLE]
where the expectation is taken with respect to . Therefore, we obtain that the optimal prediction is attained at graph
[TABLE]
II-B Learning the AR function with a Graph Neural Network
Given a GGP described by an AR model of order , the task of predicting the next graph in the sequence can be formulated as that of approximating in (8), as the optimal prediction is given by (9). In order to approximate we propose to use a GNN, which can be seen as a family of models
[TABLE]
parametrised in vector . The family of models receives regressor and outputs the predicted graph
[TABLE]
As the order is usually unknown, the value is a hyperparameter that must be appropriately chosen.
We propose a possible architecture for , composed of three main blocks:
each graph is mapped to an abstract vector representation, using graph convolutions [16, 17, 18] and graph pooling [19, 15] layers; 2. 2.
a predictive model is applied to the resulting vector sequence; 3. 3.
the predicted graph is obtained by mapping the predicted vector back to the graph domain.
The full GNN is therefore obtained by the composition of three blocks, denoted , , and . Although each block has its own parameters and , the network can be trained end-to-end as a single family of models with . The three blocks are defined as follows (see also Figure 1 for a schematic view of the architecture).
The first block of the network converts the input sequence to a sequence of -dimensional vectors. This operation can be described by map:
[TABLE]
which is implemented as a GNN alternating graph convolutional layers for extracting local patterns, and graph pooling layers to eventually compress the representation down to a single vector. By mapping graphs to a vector space, we go back to the numerical setting of (1). Moreover, the vector representation of the graphs takes into account the relational information that characterises the GGP.
By applying to each graph in the regressor (with the same ), we obtain a sequence
[TABLE]
of -dimensional vectors, which is then processed by block
[TABLE]
The role of this block is to produce the vector representation of the predicted graph, while also capturing the temporal dependencies in the input sequences. Here, we formulate the block as a recurrent network, but any method to map the sequence to a prediction is suitable (e.g., fully connected networks).
Finally, we convert the vector representation to the actual prediction in the space of graphs, using a multi-head dense network similar to the ones proposed in [20, 21]:
[TABLE]
Note that generating a graph by sampling its topology and attributes with a dense network has known limitations and implicitly assumes node identity and a maximum order. While this solution is more than sufficient for the experiments conducted in this paper, and greatly simplifies the implementation of the GNN, other approaches can be used when dealing with more complex graphs, like the GraphRNN decoder proposed in [22].
III Experiments
The experimental section aims at showing that AR models for graphs are effective. In particular, we show that the proposed neural graph autoregressive (NGAR) model can be effectively trained, and that it provides graph predictions that significantly improve over simpler baselines.
The experiments are performed on sequences of attributed graphs with identified nodes and a variable topology. Each node is associated with a vector attribute of dimension , and no edge attributes; however, the extension to graphs with edge attributes, as well as graphs of variable order and non-identified nodes, is straightforward (e.g., see [20] for an example). The sequences are produced by two synthetic GGPs that we generate with a controlled memory order (details follow in Section III-B), allowing us to have a ground truth for the analysis.
The comparative analysis of the tested methods is performed by considering a graph edit distance (GED) [11]
[TABLE]
between the ground truth and the prediction made by the models222Although the NGAR method was trained with a specific loss function (see Sec.III-C), here we considered the GED measure instead, in order to provide a fair comparison of the methods.. We also analyse the performance of NGAR in terms of prediction loss and accuracy, in order to show the relative performance of the model on problems of different complexity. Finally, we report a qualitative assessment of the predictions of NGAR, by visualising the graphs predicted by the GNN and the true observations from the GGP.
The rest of this section introduces the baselines (Section III-A), the synthetic GGPs (Section III-B) and the implementation details for the NGAR architecture (Section III-C). Finally, we discuss the results of the experimental analysis in Section III-D.
III-A Baseline methods
We consider four baselines commonly applied in the numerical case, that can be easily adapted to our setting. We denote our proposed method as NGAR, while the four baselines as Mean, Mart, Move, and VAR, respectively.
Mean
the first baseline assumes that the GGP is stationary, with independent and identically distributed graphs. In this case, the optimal prediction is the mean graph:
[TABLE]
where , and indicates here the distribution (supposed stationary) of the graphs.
Mart
the second baseline assumes that the GGP is a martingale, s.t. , and predicts as:
[TABLE]
i.e., the graph at the previous time step.
Move
the third baseline considers the preceding graphs , and predicts to be the Fréchet sample mean graph of
[TABLE]
VAR
the fourth baseline is a vector AR model (VAR) of order , which treats each graph as the vectorisation of its node features and adjacency matrix , concatenated in a single vector:
[TABLE]
First, we compute a prediction defined by the linear model
[TABLE]
from the regressor and, subsequently, the actual graph prediction is re-assembled from .
We mention that baseline VAR can be adopted in these experiments only because we are considering graphs with numerical attributes and fixed order; in more general settings this would not be possible.
III-B Graph-sequence generation
We consider two different GGPs, both based on a common framework where a multivariate AR model of the form
[TABLE]
produces a sequence of vectors , which are then used to generate the graph sequence. From each , a node-feature matrix is built, and the adjacency matrix is created as the Delaunay triangulation of the rows of , interpreted as points in . Here, is a noise term whose components are randomly drawn from a stationary normal distribution .
The result of this process is a sequence of graphs , where each graph depends on the previous graphs. By choosing different implementations of , we are able to generate different GGPs. Note that the noise and function in (8) are never made explicit in this procedure, but instead they are determined by the Gaussian noise perturbation introduced by , which affects the node attributes and, consequently, causes possible changes in the topology. Section III-B1 and Section III-B2 describe the two procedures.
III-B1 Rotational model
the function in (12) is taken of the form
[TABLE]
where and is a rotation matrix depending on the regressor. Matrix is block-diagonal with blocks , , of size defined as:
[TABLE]
[TABLE]
The parameters are randomly generated by a uniform distribution in , while is set to .
The node attribute matrices are finally obtained by arranging each in a matrix, and the regression function can be interpreted as an independent and variable rotation of each node feature (see Figure 2).
III-B2 Partially Masked Linear Dynamical System
we consider a discrete-time linear dynamical system
[TABLE]
where , and is an orthonormal random correlation matrix computed with the method proposed by Davies and Higham in [23].
Although the dynamical system (13) depends on exactly one previous time step, the partial observation of the first components of results in a dynamical system of order [24]. Similarly to the rotational model, then, node attributes are obtained by reshaping the masked vectors in to matrices (see Figure 3). We refer to this setting as a partially masked linear dynamical system of complexity (PMLDS()).
The size of the original linear system represents an index of complexity of the problem, as the system’s memory is dependent on it: given , a higher will result in more complicated dynamics of , and vice versa (see Figure 4). However, because the evolution of the GGP is controlled by hidden variables that the NGAR model never sees at training time, the problem is closer to real-world scenarios where processes are rarely fully observable.
III-C Graph neural network architecture
We use the same GNN architecture for both settings. Input graphs are represented by two matrices:
- •
a binary adjacency matrix, ;
- •
a matrix representing the attributes of each node, .
All hyperparameters of were found through a grid search on the values commonly found in the literature, using the validation loss as a metric, and are summarised in Table I. All sub-networks were structured with two layers to provide sufficient nonlinearity in the computation, and in particular and have been shown in the literature to be effective architectures for processing small graphs [19, 20]. Our network consists of two graph convolutional layers with 128 channels, ReLU activations, and L2 regularization. Here, we use convolutions based on a first-order polynomial filter, as proposed in [25], but any other method is suitable (e.g., we could use the edge-conditioned convolutions proposed in [26] in order to consider edge attributes). The graph convolutions are followed by a gated global pooling layer [15], with soft attention and channels. The block is applied in parallel (i.e., with shared weights) to all graphs in the input sequence, and the resulting vector sequence is fed to a 2-layer LSTM block with 256 units and hyperbolic tangent activations. Note that we keep a fixed for NGAR, Move, and VAR. The output of the LSTM block is fed to a fully connected network of two layers with 256 and 512 units, with ReLU activations, and L2 regularisation. Finally, the network has two parallel output layers with and units respectively, to produce the adjacency matrix and node attributes of the predicted graph. The output layer for has sigmoid activations, and the one for is a linear layer. The network is trained until convergence using Adam [27], monitoring the validation loss on a held-out 10% of the training data for early stopping. We jointly minimise the mean squared error for the predicted node attributes, and the log-loss for the adjacency matrix. For each problem, we evolve the GGPs for steps and test the model on 10% of the data.
III-D Results
Tables II and III report the test performance of NGAR, in terms of prediction loss and accuracy. We can see that as and increase, the test performance gets worse, which is in agreement with the fact that and can be seen as indexes of complexity of the problem. In Table II, for the rotational GGP with we observe an unexpected high test loss, which might be associated with overfitting. We also see from Figure 5 that the test performance seems to be consistently better when the complexity of the problem is lower than the order of NGAR, i.e., and (c.f. Sec. III-B1 and Sec. III-B2).
The second part of the experimental analysis aims at comparing the NGAR method with the baselines in terms of prediction error, assessed as the GED between predicted and ground-truth graphs; Figures 6 and 7 show a comparison of the baselines on different levels of complexity ( and ). Among the baselines, we cannot clearly identify one with significantly better performance. Given the performance of Mean and Mart for all and , we see that neither GGP can be modelled as stationary or a martingale. Mean and Move performed almost the same in all experiments, while and Mart in some cases performed significantly better then the other baselines, e.g., see the rotational GGP with . Despite VAR being a method adapted from a widely used model in multivariate AR problems, here it was not able to result in relevant graph predictions. On the other hand, NGAR consistently outperformed the baselines on almost all setting.
Finally, a qualitative assessment of the predictions is presented in Figures 8 and 9, which highlights a good performance of NGAR even on sequences of graphs that cannot be intuitively predicted.
IV Conclusion and future work
In this paper, we formalised an autoregressive framework for a stochastic process in which the observations are graphs. The framework considers a generic family of attributed graphs, in which nodes and edges can be associated with numerical and non-numerical attributes, the topology is allowed to change over time, and non-identified nodes can be considered, as well. We show that the proposed model for graphs is a non trivial generalisation of the classic multivariate autoregressive models, by introducing the Fréchet statistics. We proposed also to address the task of predicting the next graph in a graph sequence with a GNN, hence leveraging deep learning methods for processing structured data. The GNN implementation proposed here is based on graph convolutional layers and recurrent connections, however when requested by the application, different architectures can be adopted.
Finally, we performed an experimental campaign in which we demonstrate the applicability of the proposed graph AR model, as well as that the proposed GNN can be trained to address prediction tasks. The proposed method is compared with four baselines on synthetic GGPs, to rely on a ground-truth for the analysis. The obtained results are promising, showing that the GNN was able to learn a non-trivial AR function, especially when compared to simpler, statistically motivated baselines. Possible application scenarios to be explored in future works include the prediction of the dynamics in social networks, load forecasting in power distribution grids, and modelling the behaviour of brain networks.
As a possible extension of this work, we intend to formulate the GNN architecture to work entirely on the space of graphs, without mapping the representation to a vector space. This would require a method for aggregating the graphs of a sequence (e.g., via weighted sum). However, such an operation is currently missing in the literature.
-A Equivalence between (1) and (8).
Let be the Euclidean space , and , then
[TABLE]
Regarding assumption (5), we see that the integral in (6) (known as Fréchet function of ) becomes
[TABLE]
and
[TABLE]
We conclude that (1) and (8) are equivalent, in fact
[TABLE]
and
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al. , “Relational inductive biases, deep learning, and graph networks,” ar Xiv preprint ar Xiv:1806.01261 , 2018.
- 2[2] D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia, “Discovering objects and their relations from entangled scene representations,” ar Xiv preprint ar Xiv:1702.05068 , 2017.
- 3[3] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al. , “Interaction networks for learning about objects, relations and physics,” in Advances in neural information processing systems , 2016, pp. 4502–4510.
- 4[4] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational inference for interacting systems,” ar Xiv preprint ar Xiv:1802.04687 , 2018.
- 5[5] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “High-order graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting,” ar Xiv preprint ar Xiv:1802.07007 , 2018.
- 6[6] L. Livi and A. Rizzi, “The graph matching problem,” Pattern Analysis and Applications , vol. 16, no. 3, pp. 253–283, 2013.
- 7[7] E. Richard, S. Gaïffas, and N. Vayatis, “Link prediction in graphs with autoregressive features,” The Journal of Machine Learning Research , vol. 15, no. 1, pp. 565–593, 2014.
- 8[8] L. W. Beineke, R. J. Wilson, and P. J. Cameron, Topics in algebraic graph theory . Cambridge University Press, 2004, vol. 102.
