Graph Neural Networks with convolutional ARMA filters

Filippo Maria Bianchi; Daniele Grattarola; Lorenzo Livi; Cesare Alippi

arXiv:1901.01343·cs.LG·April 7, 2021

Graph Neural Networks with convolutional ARMA filters

Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, Cesare Alippi

PDF

1 Repo

TL;DR

This paper introduces a novel graph convolutional layer based on ARMA filters, offering more flexible frequency response, robustness to noise, and improved global structure capture, outperforming polynomial-based GNNs.

Contribution

The paper presents a new ARMA-based graph convolutional layer with a recursive, distributed implementation that enhances flexibility, robustness, and transferability in graph neural networks.

Findings

01

ARMA layer outperforms polynomial filters in experiments

02

Improved robustness to noise in graph data

03

Effective across multiple downstream tasks

Abstract

Popular graph neural networks implement convolution operations on graphs based on polynomial spectral filters. In this paper, we propose a novel graph convolutional layer inspired by the auto-regressive moving average (ARMA) filter that, compared to polynomial ones, provides a more flexible frequency response, is more robust to noise, and better captures the global graph structure. We propose a graph neural network implementation of the ARMA filter with a recursive and distributed formulation, obtaining a convolutional layer that is efficient to train, localized in the node space, and can be transferred to new graphs at test time. We perform a spectral analysis to study the filtering effect of the proposed ARMA layer and report experiments on four downstream tasks: semi-supervised node classification, graph signal classification, graph classification, and graph regression. Results show…

Tables12

Table 1. Table 1: Node classification accuracy.

Method	Cora	Citeseer	Pubmed	PPI
5pt. GAT	83.1 $\pm$ 0.6	70.9 $\pm$ 0.6	78.5 $\pm$ 0.3	81.3 $\pm$ 0.1
GraphSAGE	73.7 $\pm$ 1.8	65.9 $\pm$ 0.9	78.5 $\pm$ 0.6	70.0 $\pm$ 0.0
GIN	75.1 $\pm$ 1.7	63.1 $\pm$ 2.0	77.1 $\pm$ 0.7	78.1 $\pm$ 2.6
5pt. GCN	81.5 $\pm$ 0.4	70.1 $\pm$ 0.7	79.0 $\pm$ 0.5	80.8 $\pm$ 0.1
Chebyshev	79.5 $\pm$ 1.2	70.1 $\pm$ 0.8	74.4 $\pm$ 1.1	86.4 $\pm$ 0.1
CayleyNet	81.2 $\pm$ 1.2	67.1 $\pm$ 2.4	75.6 $\pm$ 3.6	84.9 $\pm$ 1.2
ARMA	83.4 $\pm$ 0.6	72.5 $\pm$ 0.4	78.9 $\pm$ 0.3	90.5 $\pm$ 0.3

Table 2. Table 2: Graph signal classification accuracy.

GNN layer	MNIST	20news
5pt. GCN	98.48 $\pm$ 0.2	65.45 $\pm$ 0.2
Chebyshev	99.14 $\pm$ 0.1	68.24 $\pm$ 0.2
CayleyNet	99.18 $\pm$ 0.1	68.84 $\pm$ 0.3
ARMA	99.20 $\pm$ 0.1	70.02 $\pm$ 0.1

Table 3. Table 3: Graph classification accuracy.

Method	Enzymes	Proteins	D&D	MUTAG	BHard
GAT	51.7 $\pm 4.3$	72.3 $\pm 3.1$	70.9 $\pm 4.0$	87.3 $\pm 5.3$	30.1 $\pm 0.7$
GraphSAGE	60.3 $\pm 7.1$	70.2 $\pm 3.9$	73.6 $\pm 4.1$	85.7 $\pm 4.7$	71.8 $\pm 1.0$
GIN	45.7 $\pm 7.7$	71.4 $\pm 4.5$	71.2 $\pm 5.4$	86.3 $\pm 9.1$	72.1 $\pm 1.1$
GCN	53.0 $\pm 5.3$	71.0 $\pm 2.7$	74.7 $\pm 3.8$	85.7 $\pm 6.6$	71.9 $\pm 1.2$
Chebyshev	57.9 $\pm 2.6$	72.1 $\pm 3.5$	73.7 $\pm 3.7$	82.6 $\pm 5.2$	71.3 $\pm 1.2$
CayleyNet	43.1 $\pm 10.7$	65.6 $\pm 5.7$	70.3 $\pm 11.6$	87.8 $\pm 10.0$	70.7 $\pm 2.4$
ARMA	60.6 $\pm 7.2$	73.7 $\pm 3.4$	77.6 $\pm 2.7$	91.5 $\pm 4.2$	74.1 $\pm 0.5$

Table 4. Table 4: Graph regression mean squared error.

Property	GCN	Chebyshev	CayleyNet	ARMA
mu	0.445 $\pm 0.007$	0.433 $\pm 0.003$	0.442 $\pm 0.009$	0.394 $\pm 0.005$
alpha	0.141 $\pm 0.016$	0.171 $\pm 0.008$	0.118 $\pm 0.005$	0.098 $\pm 0.005$
HOMO	0.371 $\pm 0.030$	0.391 $\pm 0.012$	0.336 $\pm 0.007$	0.326 $\pm 0.010$
LUMO	0.584 $\pm 0.051$	0.528 $\pm 0.005$	0.679 $\pm 0.148$	0.508 $\pm 0.011$
gap	0.650 $\pm 0.070$	0.565 $\pm 0.015$	0.758 $\pm 0.106$	0.552 $\pm 0.013$
R2	0.132 $\pm 0.005$	0.294 $\pm 0.022$	0.185 $\pm 0.043$	0.119 $\pm 0.019$
ZPVE	0.349 $\pm 0.022$	0.358 $\pm 0.001$	0.555 $\pm 0.174$	0.338 $\pm 0.001$
U0_atom	0.064 $\pm 0.003$	0.126 $\pm 0.017$	1.493 $\pm 1.414$	0.053 $\pm 0.004$
Cv	0.192 $\pm 0.012$	0.215 $\pm 0.010$	0.184 $\pm 0.009$	0.163 $\pm 0.007$

Table 5. Table 5: Summary of the node classification datasets.

Dataset	Nodes	Edges	Node attr.	Avg. SP	Node classes
5pt. Cora	2708	5429	1433	5.87 $\pm$ 1.52	7 (single label)
Citeseer	3327	9228	3703	6.31 $\pm$ 2.00	6 (single label)
Pubmed	19717	88651	500	6.34 $\pm$ 1.22	3 (single label)
PPI	56944	818716	50	2.76 $\pm$ 0.56	121 (multi-label)

Table 6. Table 6: Hyperparameters for node classification.

Dataset	L₂ reg.	$p_{drop}$	lr	GCN	Cheby.	Cayley		ARMA
Dataset	L₂ reg.	$p_{drop}$	lr	$L$	$K$	$K$	$T$	$K$	$T$
Cora	5e-4	0.75	0.01	1	2	1	5	2	1
Citeseer	5e-4	0.75	0.01	1	3	1	5	3	1
Pubmed	5e-4	0.25	0.01	1	3	2	5	1	1
PPI	0.0	0.25	0.01	2	3	3	5	3	2

Table 7. Table 7: Summary of the graph regression dataset.

Samples	Avg. nodes	Avg. edges	Node attr.
5pt. 133,885	8.79	27.61	1

Table 8. Table 8: Hyperparameters for graph classification and graph regression.

Dataset	GCN	Cheby.	Cayley		ARMA
Dataset	$L$	$K$	$K$	$T$	$p_{d r o p}$	$K$	$T$
QM9	3	3	3	3	0.75	3	3

Table 9. Table 9: Summary of the graph classification datasets.

Dataset	Samples	Classes	Avg. nodes	Avg. edges	Node attr.	Node labels
5pt. Bench-hard	1,800	3	148.32	572.32	–	yes
Enzymes	600	6	32.63	62.14	18	no
Proteins	1,113	2	39.06	72.82	1	no
D&D	1,178	2	284.32	715.66	–	yes
MUTAG	188	2	17.93	19.79	–	yes

Table 10. Table 10: Hyperparameters for graph classification and graph regression.

Dataset	GCN	Cheby	Cayley		ARMA
Dataset	$L$	$K$	$K$	$T$	$p_{d r o p}$	$K$	$T$
Bench-hard	2	2	2	10	0.4	1	2
Enzymes	2	2	2	10	0.6	2	2
Proteins	4	4	4	10	0.6	4	4
D&D	4	4	4	10	0.0	4	4
MUTAG	4	4	4	10	0.0	4	4

Table 11. Table 11: Summary of the graph signal classification datasets.

Dataset	Nodes	Edges	Avg. SP	Class	Train	Val	Test
5pt. MNIST	784	5,928	12.36 $\pm$ 5.45	10	55 $k$	5 $k$	10 $k$
20news	10 $k$	249,944	4.21 $\pm$ 0.94	20	10,168	7,071	7,071

Table 12. Table 12: Hyperparameters for graph signal classification.

Dataset	L₂ reg.	lr	$p_{drop}$	GCN	Cheby.	Cayley		ARMA
Dataset	L₂ reg.	lr	$p_{drop}$	$L$	$K$	$K$	$T$	$K$	$T$
MNIST	5e-4	1e-3	0.5	3	25	12	11	5	10
20news	1e-3	1e-3	0.7	1	5	5	10	1	1

Equations51

\overset{ˉ}{X}

\overset{ˉ}{X}

= U diag [h (λ_{1}), \dots, h (λ_{M})] U^{T} X .

h_{POLY} (λ) = k = 0 \sum K w_{k} λ^{k},

h_{POLY} (λ) = k = 0 \sum K w_{k} λ^{k},

\overset{ˉ}{X}

\overset{ˉ}{X}

= k = 0 \sum K w_{k} L^{k} X .

\overset{ˉ}{X} = σ (k = 0 \sum K - 1 T_{k} (\tilde{L}) X W_{k}),

\overset{ˉ}{X} = σ (k = 0 \sum K - 1 T_{k} (\tilde{L}) X W_{k}),

\overset{ˉ}{X} = σ (\hat{A} X W) .

\overset{ˉ}{X} = σ (\hat{A} X W) .

h_{ARMA \textsubscript K} (λ) = \frac{\sum _{k = 0}^{K - 1} p _{k} λ ^{k}}{1 + \sum _{k = 1}^{K} q _{k} λ ^{k}},

h_{ARMA \textsubscript K} (λ) = \frac{\sum _{k = 0}^{K - 1} p _{k} λ ^{k}}{1 + \sum _{k = 1}^{K} q _{k} λ ^{k}},

\overset{ˉ}{X} = (I + k = 1 \sum K q_{k} L^{k})^{- 1} (k = 0 \sum K - 1 p_{k} L^{k}) X .

\overset{ˉ}{X} = (I + k = 1 \sum K q_{k} L^{k})^{- 1} (k = 0 \sum K - 1 p_{k} L^{k}) X .

\overset{ˉ}{X}^{(t + 1)} = a M \overset{ˉ}{X}^{(t)} + b X,

\overset{ˉ}{X}^{(t + 1)} = a M \overset{ˉ}{X}^{(t)} + b X,

M = \frac{1}{2} (λ_{max} - λ_{min}) I - L .

M = \frac{1}{2} (λ_{max} - λ_{min}) I - L .

\overset{ˉ}{X} = t \to \infty lim [(a M)^{t} \overset{ˉ}{X}^{(0)} + b i = 0 \sum t (a M)^{i} X] .

\overset{ˉ}{X} = t \to \infty lim [(a M)^{t} \overset{ˉ}{X}^{(0)} + b i = 0 \sum t (a M)^{i} X] .

h_{ARMA_{1}} (μ_{m}) = \frac{b}{1 - a μ _{m}} .

h_{ARMA_{1}} (μ_{m}) = \frac{b}{1 - a μ _{m}} .

\overset{ˉ}{X} = k = 1 \sum K m = 1 \sum M \frac{b _{k}}{1 - a _{k} μ _{m}} u_{m} u_{m}^{T} X,

\overset{ˉ}{X} = k = 1 \sum K m = 1 \sum M \frac{b _{k}}{1 - a _{k} μ _{m}} u_{m} u_{m}^{T} X,

h_{ARMA_{K}} (μ_{m}) = k = 1 \sum K \frac{b _{k}}{1 - a _{k} μ _{m}} .

h_{ARMA_{K}} (μ_{m}) = k = 1 \sum K \frac{b _{k}}{1 - a _{k} μ _{m}} .

\overset{ˉ}{X}^{(t + 1)} = σ (\tilde{L} \overset{ˉ}{X}^{(t)} W + X V),

\overset{ˉ}{X}^{(t + 1)} = σ (\tilde{L} \overset{ˉ}{X}^{(t)} W + X V),

\overset{ˉ}{X}_{a}^{(t + 1)} - \overset{ˉ}{X}_{b}^{(t + 1)}_{2} =

\overset{ˉ}{X}_{a}^{(t + 1)} - \overset{ˉ}{X}_{b}^{(t + 1)}_{2} =

= σ (\tilde{L} \overset{ˉ}{X}_{a}^{(t)} W + X V) - σ (\tilde{L} \overset{ˉ}{X}_{b}^{(t)} W + X V)_{2} \leq

\leq \tilde{L} \overset{ˉ}{X}_{a}^{(t)} W + X V - \tilde{L} \overset{ˉ}{X}_{b}^{(t)} W - X V_{2} =

= \tilde{L} \overset{ˉ}{X}_{a}^{(t)} W - \tilde{L} \overset{ˉ}{X}_{b}^{(t)} W_{2} \leq

\leq \tilde{L}_{2} ∥ W ∥_{2} \overset{ˉ}{X}_{a}^{(t)} - \overset{ˉ}{X}_{b}^{(t)}_{2} .

\exists T_{ϵ} < \infty s.t. \overset{ˉ}{X}^{(t + 1)} - \overset{ˉ}{X}^{(t)}_{2} \leq ϵ, \forall t \geq T_{ϵ} .

\exists T_{ϵ} < \infty s.t. \overset{ˉ}{X}^{(t + 1)} - \overset{ˉ}{X}^{(t)}_{2} \leq ϵ, \forall t \geq T_{ϵ} .

\overset{ˉ}{X} = \frac{1}{K} k = 1 \sum K \overset{ˉ}{X}_{k}^{(T)},

\overset{ˉ}{X} = \frac{1}{K} k = 1 \sum K \overset{ˉ}{X}_{k}^{(T)},

\overset{ˉ}{X} = w_{0} X + 2 Re {k = 1 \sum K w_{k} (L + i I)^{k} (L - i I)^{- k}} X .

\overset{ˉ}{X} = w_{0} X + 2 Re {k = 1 \sum K w_{k} (L + i I)^{k} (L - i I)^{- k}} X .

\overset{ˉ}{X} \approx σ w_{0} X + 2 Re ⎩ ⎨ ⎧ k = 1 \sum K w_{k} (t = 1 \sum T \hat{L}^{t})^{k} ⎭ ⎬ ⎫ X,

\overset{ˉ}{X} \approx σ w_{0} X + 2 Re ⎩ ⎨ ⎧ k = 1 \sum K w_{k} (t = 1 \sum T \hat{L}^{t})^{k} ⎭ ⎬ ⎫ X,

U^{T} \overset{ˉ}{X}

U^{T} \overset{ˉ}{X}

m = 1 \sum M u_{m}^{T} \overset{ˉ}{X}

\tilde{h}_{m}

\tilde{h}_{m}

a_{ij} = {10 if v_{j} \in N (v_{i}); otherwise .

a_{ij} = {10 if v_{j} \in N (v_{i}); otherwise .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmlc/dgl/tree/master/examples/pytorch/arma
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGraph Neural Network · ARMA GNN · Convolution

Full text

Graph Neural Networks with Convolutional ARMA Filters

Filippo Maria Bianchi

Daniele Grattarola

Lorenzo Livi

Cesare Alippi

Abstract

Popular graph neural networks implement convolution operations on graphs based on polynomial spectral filters. In this paper, we propose a novel graph convolutional layer inspired by the auto-regressive moving average (ARMA) filter that, compared to polynomial ones, provides a more flexible frequency response, is more robust to noise, and better captures the global graph structure. We propose a graph neural network implementation of the ARMA filter with a recursive and distributed formulation, obtaining a convolutional layer that is efficient to train, localized in the node space, and can be transferred to new graphs at test time. We perform a spectral analysis to study the filtering effect of the proposed ARMA layer and report experiments on four downstream tasks: semi-supervised node classification, graph signal classification, graph classification, and graph regression. Results show that the proposed ARMA layer brings significant improvements over graph neural networks based on polynomial filters.

Convolutional neural networks, Spectral graph convolution, graph filtering

\sidecaptionvpos

figurec

1 Introduction

Graph Neural Networks (GNNs) are a class of models lying at the intersection between deep learning and methods for structured data, which perform inference on discrete objects (nodes) by accounting for arbitrary relationships (edges) among them (Bronstein et al., 2017; Battaglia et al., 2018). A GNN combines node features within local neighborhoods on the graph to learn node representations that can be directly mapped into categorical labels or real values (Scarselli et al., 2009; Klicpera et al., 2019), or combined to generate graph embeddings for graph classification and regression (Perozzi et al., 2014; Duvenaud et al., 2015; Yang et al., 2016; Hamilton et al., 2017; Bacciu et al., 2018).

The focus of this work is on GNNs that implement a convolution in the spectral domain with a non-linear trainable filter (Bruna et al., 2013; Henaff et al., 2015). Such a filter selectively shrinks or amplifies the Fourier coefficients of the graph signal (an instance of the node features) and then maps the node features to a new space. To avoid the expensive spectral decomposition and projection in the frequency domain, state-of-the-art GNNs implement graph filters as low-order polynomials that are learned directly in the node domain (Defferrard et al., 2016; Kipf & Welling, 2016a, b). Polynomial filters have a finite impulse response and perform a weighted moving average filtering of graph signals on local node neighborhoods (Tremblay et al., 2018), allowing for fast distributed implementations such as those based on Chebyshev polynomials and Lanczos iterations (Susnjara et al., 2015; Defferrard et al., 2016; Liao et al., 2019). Polynomial filters have limited modeling capabilities (Isufi et al., 2016) and, due to their smoothness, cannot model sharp changes in the frequency response (Tremblay et al., 2018). Crucially, polynomials with high degree are necessary to reach high-order neighborhoods, but they tend to be more computationally expensive and, most importantly, overfit the training data making the model sensitive to changes in the graph signal or the underlying graph structure. A more versatile class of filters is the family of Auto-Regressive Moving Average filters (ARMA) (Narang et al., 2013), which offer a larger variety of frequency responses and can account for higher-order neighborhoods compared to polynomial filters with the same number of parameters.

In this paper, we address the limitations of existing graph convolutional layers inspired by polynomial filters and propose a novel GNN convolutional layer based on ARMA filters. Our ARMA layer implements a non-linear and trainable graph filter that generalizes the convolutional layers based on polynomial filters and provides the GNN with enhanced modeling capability, thanks to a flexible design of the filter’s frequency response. The ARMA layer captures global graph structures with fewer parameters, overcoming the limitations of GNNs based on high-order polynomial filters.

ARMA filters are not localized in node space and require to compute a matrix inversion, which is intractable in the context of GNNs. To address this issue, the proposed ARMA layer relies on a recursive formulation, which leads to a fast and distributed implementation that exploits efficient sparse operations on tensors. The resulting filters are not learned in the Fourier space induced by a given Laplacian, but are localized in the node space and are independent of the underlying graph structure. This allows our GNN to handle graphs with unseen topologies during the test phase of inductive inference tasks.

The performance of the proposed ARMA layer is evaluated on semi-supervised node classification, graph signal classification, graph classification, and graph regression tasks. Results show that a GNN equipped with ARMA layers outperforms GNNs with polynomial filters in every downstream task.

2 Background: graph spectral filtering

We assume a graph with $M$ nodes to be characterized by a symmetric adjacency matrix ${\mathbf{A}}\in\mathbb{R}^{M\times M}$ and refer to graph signal ${\mathbf{X}}\in\mathbb{R}^{M\times F}$ as the instance of all features (vectors in $\mathbb{R}^{F}$ ) associated with the graph nodes. Let ${\mathbf{L}}={\mathbf{I}}_{M}-{\mathbf{D}}^{-1/2}{\mathbf{A}}{\mathbf{D}}^{-1/2}$ be the symmetrically normalized Laplacian of the graph (where ${\mathbf{D}}$ is the degree matrix), with spectral decomposition ${\mathbf{L}}=\sum_{m=1}^{M}\lambda_{m}\mathbf{u}_{m}\mathbf{u}^{T}_{m}$ . A graph filter is an operator that modifies the components of ${\mathbf{X}}$ on the eigenvectors basis of ${\mathbf{L}}$ , according to a frequency response $h$ acting on each eigenvalue $\lambda_{m}$ . The filtered graph signal reads

[TABLE]

This formulation inspired the seminal work of Bruna et al. (Bruna et al., 2013) that implemented spectral graph convolutions in a neural network. Their GNN learns end-to-end the parameters of a filter implemented as $h=\mathbf{B}\mathbf{c}$ , where $\mathbf{B}\in\mathbb{R}^{M\times K}$ is a cubic B-spline basis and $\mathbf{c}\in\mathbb{R}^{K}$ is a vector of control parameters. Such filters are not localized, since the full projection on the eigenvectors yields paths of infinite length and the filter accounts for interactions of each node with the whole graph, rather than those limited to the node neighborhood. Since this contrasts with the local design of classic convolutional filters, a follow-up work (Henaff et al., 2015) introduced a parametrization of the spectral filters with smooth coefficients to achieve spatial localization. However, the main issue with the spectral filtering in Eq. (1) is computational complexity: not only the eigendecomposition of ${\mathbf{L}}$ is computationally expensive, but a double product with ${\mathbf{U}}$ is computed whenever the filter is applied. Notably, ${\mathbf{U}}$ in Eq. (1) is full even when ${\mathbf{L}}$ is sparse. Finally, since these spectral filters depend on a specific Laplacian spectrum, they cannot be transferred to graphs with another structure. For this reason, this spectral GNN cannot be used in downstream tasks such as graph classification or graph regression, where each datum is a graph with a different topology.

2.1 GNNs based on polynomial filters and limitations

The desired filter response $h(\lambda)$ can be approximated by a polynomial of order $K$ ,

[TABLE]

which performs a weighted moving average of the graph signal (Tremblay et al., 2018). These filters overcome important limitations of the spectral formulation, as they avoid the eigendecomposition and their parameters are independent of the Laplacian spectrum (Zhang et al., 2018). Polynomial filters are localized in the node space, since the output at each node in the filtered signal is a linear combination of the nodes with their $K$ -hop neighborhoods.

The order of the polynomial $K$ is assumed to be small and independent of the number $M$ of nodes in the graph.

To express polynomial filters in the node space, we first recall that the $k$ -th power of any diagonalizable matrix, such as the Laplacian, can be computed by taking the power of its eigenvalues, i.e., ${\mathbf{L}}^{k}={\mathbf{U}}\,\text{diag}[\lambda_{1}^{k},\dots,\lambda_{M}^{k}]\,{\mathbf{U}}^{T}$ . It follows that the filtering operation becomes

[TABLE]

Eq. (2) and (3) represent a generic polynomial filter. Among the existing classes of polynomials, Chebyshev polynomials are often used in signal processing as they attenuate unwanted oscillations around the cut-off frequencies (Shuman et al., 2011), which, in our case, are the eigenvalues of the Laplacian. Fast localized GNN filters can approximate the desired filter response by means of the Chebyshev expansion $T_{k}(x)=2xT_{k-1}(x)-T_{k-2}(x)$ (Defferrard et al., 2016), resulting in convolutional layers that perform the filtering operation

[TABLE]

where $\tilde{{\mathbf{L}}}=2{\mathbf{L}}/\lambda_{\text{max}}-{\mathbf{I}}_{M}$ , $\sigma(\cdot)$ is a non-linear activation (e.g., a sigmoid or a ReLU function), and $\mathbf{W}_{k}\in\mathbb{R}^{F_{\text{in}}\times F_{\text{out}}}$ are the $k$ trainable weight matrices that map the node features from $\mathbb{R}^{F_{\text{in}}}$ to $\mathbb{R}^{F_{\text{out}}}$ .

The output of a $k$ -degree polynomial filter is a linear combination of the input within each vertex’s $k$ -hop neighborhood. Since the input beyond the $k$ -hop neighborhood has no impact on the output of the filtering operation, to capture larger structures on the graph it is necessary to adopt high-degree polynomials. However, high-degree polynomials have poor interpolatory and extrapolatory performance since they overfit the known graph frequencies, i.e., the eigenvalues of the Laplacian. This hampers the GNN’s generalization capability as it becomes sensitive to noise and small changes in the graph topology. Moreover, evaluating a polynomial with a high degree is computationally expensive both during training and inference (Isufi et al., 2016). Finally, since polynomials are very smooth, they cannot model filter responses with sharp changes.

A particular first-order polynomial filter has been proposed by (Kipf & Welling, 2016a) for semi-supervised node classification. In their GNN model, called Graph Convolutional Network (GCN), the convolutional layer is a simplified version of a Chebyshev filter, obtained from Eq. (4) by considering $K=1$ and by setting $\mathbf{W}=\mathbf{W}_{0}=-\mathbf{W}_{1}$

[TABLE]

Additionally, $\tilde{{\mathbf{L}}}$ is replaced by $\hat{{\mathbf{A}}}=\tilde{{\mathbf{D}}}^{-1/2}\tilde{{\mathbf{A}}}\tilde{{\mathbf{D}}}^{-1/2}$ , with $\tilde{{\mathbf{A}}}={\mathbf{A}}+\gamma{\mathbf{I}}_{M}$ (usually, $\gamma=1$ ). The modified adjacency matrix $\hat{{\mathbf{A}}}$ contains self-loops that compensate for the removal of the term of order 0 in the polynomial, by ensuring that a node is part of its first-order neighborhood and that its features are preserved (to some extent) after convolution. Higher-order neighborhoods can be reached by stacking multiple GCN layers. On one hand, GCNs reduce overfitting and the heavy computational load of Chebyshev filters with high-order polynomials. On the other hand, since each GCN layer performs a Laplacian smoothing, after few convolutions the node features becomes too smoothed over the graph (Li et al., 2018) and the initial node features are lost.

3 Rational filters for graph signals

An ARMA filter can approximate well any desired filter response $h(\lambda)$ thanks to a rational design that, compared to polynomial filters, can model a larger variety of filter shapes (Tremblay et al., 2018). The filter response of an ARMA filter of order $K$ , denoted in the following as ARMAK, reads

[TABLE]

which translates to the following filtering relation in the node space

[TABLE]

Notice that by setting $q_{k}=0$ , for every $k$ , one recovers a polynomial filter, which is considered as the MA term of the model. The inclusion of the additional AR term encoded by these coefficients makes the ARMA model robust to noise and allows to capture longer dynamics on the graph since $\bar{{\mathbf{x}}}$ depends, in turn, on several steps of propagation of the node features. This is the key to capturing longer dependencies and more global structures on the graph, compared to a polynomial filter with the same degree.

The matrix inversion in Eq. (7) is slow to compute and yields a dense matrix that prevents us from using sparse multiplications to implement the GNN. In this paper, we follow a straightforward approach to avoid computing the inverse, which can be easily extended to a neural network implementation. Specifically, we approximate the effect of an ARMA1 filter by iterating, until convergence, the first-order recursion

[TABLE]

where

[TABLE]

The recursion in Eq. (8) is adopted in graph signal processing to apply a low-pass filter on a graph signal (Loukas et al., 2015; Isufi et al., 2016), but it is also equivalent to the recurrent update used in Label Propagation (Zhou et al., 2004) and Personalized Page Rank (Page et al., 1999) to propagate information on a graph by means of a random walk with a restart probability.

Following the derivation in (Isufi et al., 2016), we analyze the frequency response of an ARMA1 filter from the convergence of Eq. (8):

[TABLE]

The eigenvectors of ${\mathbf{M}}$ and ${\mathbf{L}}$ are the same, while the eigenvalues are related as follows: $\mu_{m}=(\lambda_{\text{max}}-\lambda_{\text{min}})/2-\lambda_{m}$ , where $\mu_{m}$ and $\lambda_{m}$ represent the $m$ -th eigenvalue of ${\mathbf{M}}$ and ${\mathbf{L}}$ , respectively. Since $\mu_{m}\in[-1,1]$ , for $\lvert a\rvert<1$ the first term of Eq. (10), $(a\mathbf{M})^{t}$ , goes to zero when $t\rightarrow\infty$ , regardless of the initial point $\bar{{\mathbf{X}}}^{(0)}$ . The second term, $b\sum_{i=0}^{t}(a\mathbf{M})^{i}$ , is a geometric series that converges to the matrix $b({\mathbf{I}}-a\mathbf{M})^{-1}$ , with eigenvalues $b/(1-a\mu_{m})$ . It follows that the frequency response of the ARMA1 filter is

[TABLE]

By summing $K$ ARMA1 filters, it is possible to recover the analytical form of the ARMAK filter in Eq. (7). The resulting filtering operation is

[TABLE]

with

[TABLE]

Different orders ( $\leq K$ ) of the numerator and denominator in Eq. (6) are trivially obtained by setting some coefficients to 0. It follows that an ARMA filter generalizes a polynomial filter when all coefficients $q_{k}$ are set to zero.

4 The ARMA neural network layer

In graph signal processing, the filter coefficients $a$ and $b$ in Eq. (8) are optimized with linear regression to reproduce a desired filter response $h^{*}(\lambda)$ , which must be provided a priori by the designer (Isufi et al., 2016). Here, we consider a machine learning approach that does not require to specify the target response $h^{*}(\lambda)$ but in which the parameters are learned end-to-end from the data by optimizing a task-dependent loss function. Importantly, we also introduce non-linearities to enhance the representation capability of the filter response that can be learned.

The proposed neural network formulation of the ARMA1 filter implements the recursive update of Eq. (8) with a Graph Convolutional Skip (GCS) layer, defined as

[TABLE]

where $\sigma(\cdot)$ is a non-linearity such as ReLU, sigmoid, or hyperbolic tangent (tanh), ${\mathbf{X}}$ are the initial node features, and ${\mathbf{W}}\in\mathbb{R}^{F_{\text{out}}\times F_{\text{out}}}$ and ${\mathbf{V}}\in\mathbb{R}^{F_{\text{in}}\times F_{\text{out}}}$ are trainable parameters. The modified Laplacian matrix $\tilde{{\mathbf{L}}}$ is defined by setting $\lambda_{\text{min}}=0$ and $\lambda_{\text{max}}=2$ in Eq. (9) and then $\tilde{{\mathbf{L}}}=\mathbf{M}$ . This is a reasonable simplification since the spectrum of ${\mathbf{L}}$ lies in $[0,2]$ and the trainable parameters ${\mathbf{W}}$ and ${\mathbf{V}}$ can compensate for the small offset introduced. The unfolded recursion in Eq. (14) corresponds to a stack of GCS layers that share the same parameters.

Each GCS layer is localized in the node space, as it performs a filtering operation that depends on local exchanges among neighboring nodes and, through the skip connection, also on the initial node features ${\mathbf{X}}$ . The computational complexity of the GCS layer is linear in the number of edges (both in time and space) since the layer can be efficiently implemented as a sparse multiplication between $\tilde{\mathbf{L}}$ and $\bar{{\mathbf{X}}}^{(t)}$ .

The neural network formulation of an ARMA1 filter is obtained by iterating Eq. (14) until convergence, i.e., until $\|\bar{{\mathbf{X}}}^{(T+1)}-\bar{{\mathbf{X}}}^{(T)}\|<\epsilon$ , where $\epsilon$ is a small positive constant and $T$ is the convergence time. The convergence of the update in Eq. (14), which draws a connection to the original recursive formulation of the ARMA1 filter, is guaranteed by Theorem 1.

Theorem 1.

It is sufficient that $\|{\mathbf{W}}\|_{2}<1$ and that $\sigma(\cdot)$ is a non-expansive map for Eq. (14) to converge to a unique fixed point, regardless of the initial state $\bar{{\mathbf{X}}}^{(0)}$ .

Proof.

Let $\bar{\mathbf{X}}_{a}^{(0)}$ and $\bar{\mathbf{X}}_{b}^{(0)}$ be two different initial states and $\left\|{\mathbf{W}}\right\|_{2}<1$ . After applying Eq. (14) for $t+1$ steps, we obtain states ${\mathbf{X}}_{a}^{(t+1)}$ and $\bar{\mathbf{X}}_{b}^{(t+1)}$ . If the non-linearity $\sigma(\cdot)$ is a non-expansive map, such as the ReLU function, the following inequality holds:

[TABLE]

If the non-linearity $\sigma(\cdot)$ is also a squashing function (e.g., sigmoid or tanh), then the first inequality in (15) is strict.

Since the largest singular value of $\tilde{{\mathbf{L}}}$ is $\leq 1$ by definition, it follows that $\left\|\tilde{{\mathbf{L}}}\right\|_{2}\left\|{\mathbf{W}}\right\|_{2}<1$ and, therefore, (15) implies that Eq. (14) is a contraction mapping. The convergence to a unique fixed point and, thus, the inconsequentiality of the initial state, follow by the Banach fixed-point theorem (Goebel & Kirk, 1972). ∎

From Theorem 1 it follows that it is possible to choose an arbitrary $\epsilon>0$ for which

[TABLE]

Therefore, we can easily implement a stopping criterion for the iteration, which is met in finite time.

Similar to the formulation of the ARMA filter in Eq. (12), the output of the ARMAK convolutional layer is obtained by combining the outputs of $K$ parallel stacks of GCS layers.

4.1 Implementation

Each GCS stack $k$ may require a different and possibly high number of iterations $T_{k}$ to converge, depending on the value of the node features ${\mathbf{X}}$ and the weight matrices ${\mathbf{W}}_{k}$ and ${\mathbf{V}}_{k}$ . This makes the implementation of the neural network cumbersome, because the computational graph is dynamic and changes every time the weight matrices are updated with gradient descent during training. Moreover, to train the parameters with backpropagation through time the neural network must be unfolded many times if $T_{k}$ is large, introducing a high computational cost and the vanishing gradient issue (Bianchi et al., 2017).

One solution is to follow the approach of Reservoir Computing, where the weight matrices ${\mathbf{W}}_{k}$ and ${\mathbf{V}}_{k}$ in each stack are randomly initialized and left untrained (Lukoševičius & Jaeger, 2009; Gallicchio & Micheli, 2020). We notice that the random weights initialization guarantees that the $K$ GCS stacks implement different filtering operations. To compensate for the lack of training, high-dimensional features are exploited to generate rich latent representations that disentangle the factors of variations in the data (Tiňo, 2020). However, randomized architectures with high-dimensional feature spaces are memory inefficient and computationally expensive at inference time.

A second approach, considered in this work, is to drop the requirement of convergence altogether and fix the number of iterations to a constant value $T$ , so that $T_{k}=T$ in each GCS stack $k$ . In this way, we obtain a GNN that is easy to implement, fast to train and evaluate, and not affected by stability issues. Notably, the constraint $\|{\mathbf{W}}\|_{2}<1$ of Theorem 1 can be relaxed by adding to the loss function an L2 weight decay regularization term.

Even by stacking a small number $T$ of GCS layers, we expect the GNN to learn a large variety of node representations thanks to the non-linearity and the trainable parameters (Raghu et al., 2017). As non-linearity we adopt the ReLU function that, compared to the squashing non-linearities, improves training efficiency by facilitating the gradient flow (Goodfellow et al., 2016).

Given the limited number of iterations, the initial state $\bar{\mathbf{X}}^{(0)}$ now influences the final representation $\bar{\mathbf{X}}^{(T)}$ . A natural choice is to initialize the state with $\bar{\mathbf{X}}^{(0)}=\boldsymbol{0}\in\mathbb{R}^{M\times F_{\text{out}}}$ or with a linear transformation of the node features $\bar{\mathbf{X}}^{(0)}={\mathbf{X}}{\mathbf{W}}^{(0)}$ , where ${\mathbf{W}}^{(0)}\in\mathbb{R}^{F_{\text{in}}\times F_{\text{out}}}$ replaces ${\mathbf{W}}$ in the first layer of the stack. We adopted the latter initialization so that the node features are propagated also by the first GCS layer. We also note that it is possible to set ${\mathbf{W}}^{(0)}={\mathbf{V}}$ to reduce the number of trainable parameters.

The output of the ARMAK convolutional layer is computed as

[TABLE]

where $\bar{{\mathbf{X}}}_{k}^{(T)}$ is the output of the last GCS layer in the $k$ -th stack. Fig. 1 depicts a scheme of the proposed ARMA graph convolutional layer.

To encourage each GCS stack to learn a filtering operation with a response different from the other stacks, we apply stochastic dropout to the skip connections ${\mathbf{X}}{\mathbf{V}}_{k}$ in each GCS layer. This leads to learning a heterogeneous set of features that, when combined to form the output of the ARMAK layer, yield powerful and expressive node representations. We notice that the parameter sharing in each layer of the GCS stack endows the GNN with a strong regularization that helps to prevent overfitting and greatly reduces the model complexity, in terms of the number of trainable parameters. Finally, since the GCS stacks are independent of each other, the computation of an ARMA layer can be distributed across multiple processing units.

4.2 Properties and relationship with other approaches

Contrarily to filters defined directly in the spectral domain (Bruna et al., 2013), ARMA filters do not explicitly depend on the eigenvectors and the eigenvalues of ${\mathbf{L}}$ , making them robust to perturbations in the underlying graph structure. For this reason, as formally proven for generic rational filters (Levie et al., 2019a), the proposed ARMA filters are transferable, i.e., they can be applied to graphs with different topologies not seen during training.

The skip connections in our architecture allow stacking many GCS layers without the risk of over-smoothing the node features. Due to the weight sharing, the ARMA architecture has similarities with the recurrent neural networks with residual connections used to process sequential data (Wu et al., 2016).

Similarly to GNNs operating directly in the node domain (Scarselli et al., 2009; Gallicchio & Micheli, 2010), each GCS layer computes the filtered signal $\bar{\mathbf{x}}_{i}^{(t+1)}$ at vertex $i$ as a combination of signals $\mathbf{x}_{j}^{(t)}$ in its 1-hop neighborhood, $j\in\mathcal{N}(i)$ . Such a commutative aggregation solves the problem of undefined vertex ordering and varying neighborhood sizes, making the proposed operator permutation equivariant.

The skip connections in ARMA inject in each GCS layer $t$ of the stack the initial node features ${\mathbf{X}}$ . This is different from a skip connection that either takes the output of the previous layer ${\mathbf{X}}^{(t-1)}$ as input (Pham et al., 2017; Hamilton et al., 2017), or connects all the layers in a GNN stack directly to the output (Wu et al., 2018).

The ARMA layer can naturally deal with a time-varying topology and graph signals (Holme, 2015; Grattarola et al., 2019) by replacing the constant term ${\mathbf{X}}$ in Eq. (14) with a time-dependent input ${\mathbf{X}}^{(t)}$ .

Finally, we discuss the relationship between the proposed ARMA GNN and CayleyNets (Levie et al., 2019b), a GNN architecture that also approximates the effect of a rational filter. Specifically, the filtering operation of a Cayley polynomial in the node space is

[TABLE]

To approximate the matrix inversion in Eq. (17) with a sequence of differentiable operations, CayleyNets adopt a fixed number $T$ of Jacobi iterations. In practice, the Jacobi iterations approximate each term $({\mathbf{L}}+i{\mathbf{I}})({\mathbf{L}}-i{\mathbf{I}})^{-1}$ as a polynomial of order $T$ with fixed coefficients. Therefore, the resulting filtering operation performed by a CayleyNet assumes the form

[TABLE]

where $\hat{{\mathbf{L}}}$ is an operator with the same sparsity pattern of ${\mathbf{L}}$ . We note that Eq. (17) and (18) slightly simplify the original formulation presented by Levie et al. (Levie et al., 2019b), but allow us to better understand what type of operation is actually performed by the CayleyNet. Specifically, Eq. (18) implements a polynomial filter of order $KT$ , such as the one in Eq. (3).

For this reason, CayleyNets share strong similarities with the Chebyshev filter in Eq. (4), as it uses a (high-order) polynomial to propagate the node features on the graph for $KT$ hops before applying the non-linearity. On the other hand, each of the $K$ parallel stacks in the proposed ARMA layer propagates the current node representations $\bar{{\mathbf{X}}}^{(t)}$ only for 1 hop and combines them with the node features ${\mathbf{X}}$ before applying the non-linearity.

5 Spectral analysis of the ARMA layer

In this section we show how the proposed ARMA layer can implement filtering operations with a large variety of frequency responses. The filter response of the ARMA filter derived in Sec. 3 cannot be exploited to analyze our GNN formulation, due to the presence of non-linearities. Therefore, we first recall that a filter changes the components of a graph signal ${\mathbf{X}}$ on the eigenbase induced by ${\mathbf{L}}$ (which is the same as the one induced by $\tilde{{\mathbf{L}}}$ , according to Sylvester’s theorem). By referring to Eq. (1), ${\mathbf{X}}$ is first projected on the eigenspace of ${\mathbf{L}}$ by ${\mathbf{U}}^{T}$ , then the filter $h(\lambda_{m})$ changes the value of the component of ${\mathbf{X}}$ on each eigenvector ${\mathbf{u}}_{m}$ , finally ${\mathbf{U}}^{T}$ maps back to the node space. By left-multiplying ${\mathbf{U}}^{T}$ in Eq. (1) we obtain

[TABLE]

When $\bar{\mathbf{X}}$ is the output of the ARMA layer, the term ${\mathbf{U}}^{T}\bar{\mathbf{X}}$ defines how the original components, ${\mathbf{U}}^{T}{\mathbf{X}}$ , are changed by the GNN. Therefore, we can compute numerically the unknown filter response of the ARMA layer as the ratio between ${\mathbf{U}}^{T}\bar{\mathbf{X}}$ and ${\mathbf{U}}^{T}{\mathbf{X}}$ . We define the empirical filter response $\tilde{h}$ as

[TABLE]

where $\bar{{\mathbf{x}}}_{f}$ is column $f$ of the output $\bar{\mathbf{X}}_{k}$ , ${\mathbf{x}}_{f}$ is column $f$ of the graph signal ${\mathbf{X}}$ , and ${\mathbf{u}}_{m}$ is an eigenvector of ${\mathbf{L}}$ .

The empirical filter response allows us to analyze the type of filtering implemented by an ARMA layer. We start by comparing the recursion in Eq. (8), which converges to an ARMA1 filter with response $\{h_{\text{ARMA}_{1}}(\mu_{m})\}_{m=1}^{M}$ according to Eq. (11), with the empirical response $\{\tilde{h}_{m,k}\}_{m=1}^{M}$ of the $k$ -th GCS stack. To facilitate the interpretation of the results, we set the number of output features of the GCS layer to $F_{\text{out}}=1$ by letting ${\mathbf{W}}=a$ and ${\mathbf{V}}=b\boldsymbol{1}_{F_{\text{in}}}$ in Eq. (14). Notice that we are keeping the notation consistent with Eq. (8), where $a$ and $b$ are the parameters of the ARMA1 filter. In the following we consider the graph and the node features from the Cora citation network. We remark that the examples in this section are not related to the results on the semi-supervised node classification task presented in Sec. 6 and any other dataset could have been used instead of Cora.

Fig. 2(a, b) show the empirical responses $\tilde{h}_{1}$ and $\tilde{h}_{2}$ of two different GCS stacks, when varying the number of layers $T$ . As $T$ increases, $\tilde{h}_{1}$ and $\tilde{h}_{2}$ become more similar to the analytical responses of the ARMA1 filters, depicted as a black line in the two figures. This supports our claim that $\tilde{h}$ can estimate the unknown response of the GNN filtering operation.

Fig. 2(d, e) show how the two GCS stacks modify the components of ${\mathbf{X}}$ on the Fourier basis. In particular, we depict in black the components ${\mathbf{u}}_{m}^{T}{\mathbf{X}}$ , $m=1,\dots,M$ associated with each graph frequency $\mu_{m}$ . In colors, we depict the components ${\mathbf{u}}_{m}^{T}\bar{\mathbf{X}}$ , which show how much the GCS stacks filter the components associated with each frequency. The responses and the signal components in Fig. 2(a) and 2(d) are obtained for $a=0.99$ and $b=0.1$ , while in Fig. 2(b) and 2(e) for $a=0.7$ and $b=0.15$ .

In Fig 2(c), we show the empirical response resulting from a stack of GCNs. As also highlighted in recent work (Wu et al., 2019; Maehara, 2019), the filtering obtained by stacking one or more GCNs has the undesired effect of symmetrically amplifying the lowest and also the highest frequencies of the spectrum. This is due to the GCN filter response, which is $(1-\lambda)^{T}$ in the linear case and can assume negative values when $T$ is odd. The effect is mitigated by summing $\gamma{\mathbf{I}}_{M}$ to the adjacency matrix, which adds self-loops with weight $\gamma$ and shrinks the spectral domain of the graph filter. For high values of $\gamma$ , the GCN acts more as a low-pass filter that prevents high-frequency oscillations. This is due to the self-loops that limit the spread of information across the graph and the communication between neighbors. However, even after adding $\gamma{\mathbf{I}}_{M}$ , GCN cuts almost completely the medium frequencies and then amplifies again the higher ones, as shown in Fig. 2(f).

A stack of GCNs lacks flexibility in implementing different filtering operations, as the only degree of freedom to modify a GCN’s response consists of manually tuning the hyperparameter $\gamma$ to shrink the spectrum. On the other hand, different GCS stacks can generate heterogeneous filter responses, depending on the value of the trainable parameters in each stack. This is what provides powerful modeling capability to the proposed ARMA layer, which can learn a large variety of filter responses that selectively shrink or amplify the Fourier components of the graph by combining $K$ GCS stacks.

Similarly to an ARMA1 filter, each GCS stack behaves as a low-pass filter that gradually dampens the Fourier components as their frequency increases. However, we recall that high-pass and band-pass filters can be obtained as a linear combination of low-pass filters (Oppenheim et al., 2001). To show this behavior in practice, in Fig. 3 we report the empirical filter responses and modified Fourier components obtained with two different ARMAK filters, for $K=3$ .

6 Experiments

We consider four downstream tasks: node classification, graph signal classification, graph classification, and graph regression. Our experiments focus on comparing the proposed ARMA layer with GNNs layers based on polynomial filters, namely Chebyshev (Defferrard et al., 2016) and GCN (Kipf & Welling, 2016a), and CayleyNets (Levie et al., 2019b) that, like ARMA, are based on rational spectral filters. As additional baselines, we also include Graph Attention Networks (GAT) (Velickovic et al., 2017), GraphSAGE (Hamilton et al., 2017), and Graph Isomorphism Networks (GIN) (Xu et al., 2019). The comparison with these methods helps to frame the proposed ARMA GNN within the current state of the art. We also mention that other GNNs with graph convolutional filters related to our method have appeared while our work was under review (Ioannidis et al., 2020; Gama et al., 2019; Zou & Lerman, 2020; Gao et al., 2019).

To ensure a fair and meaningful evaluation, we compare the performance obtained with a fixed GNN architecture, where we only change only the graph convolutional layers. In particular, we fixed the GNN capacity (number of hidden units), used the same splits in each dataset, and the same training and evaluation procedures. Finally, in all experiments we used the same polynomial order $K$ for polynomial/rational filters, or a stack of $K$ layers for GCN, GAT, GIN, and GraphSAGE layers. The details of every dataset considered in the experiments and the optimal hyperparameters for each model are deferred to Sec. 7.

Public implementations of the ARMA layer are available in the open-source GNN libraries Spektral (Grattarola & Alippi, 2020) (TensorFlow/Keras) and PyTorch Geometric (Fey & Lenssen, 2019) (PyTorch).

6.1 Node classification

First, we consider transductive node classification on three citation networks: Cora, Citeseer, and Pubmed. The input is a single graph described by an adjacency matrix $\mathbf{A}\in\mathbb{R}^{M\times M}$ , the node features $\mathbf{X}\in\mathbb{R}^{M\times F_{\text{in}}}$ , and the labels $\mathbf{y}_{l}\in\mathbb{R}^{M_{l}}$ of a subset of nodes $M_{l}\subset M$ . The targets are the labels $\mathbf{y}_{u}\in\mathbb{R}^{M_{u}}$ of the unlabelled nodes. The node features are sparse bag-of-words vectors representing text documents. The binary undirected edges in $\mathbf{A}$ indicate citation links between documents. The models are trained using 20 labels per document class ( $\mathbf{y}_{l}$ ) and the performance is evaluated as classification accuracy on $\mathbf{y}_{u}$ .

Secondly, we perform inductive node classification on the protein-protein interaction (PPI) network dataset. The dataset consists of 20 graphs used for training, 2 for validation, and 2 for testing. Contrarily to the transductive setting, the testing graphs (and the associated node features) are not observed during training. Additionally, each node can belong to more than one class (multi-label classification).

We use a 2-layers GNN with 16 hidden units for the citation networks and 64 units for PPI. In the citation networks high dropout rates and L2-norm regularization are exploited to prevent overfitting. Tab. 6.1 reports the classification accuracy obtained by a GNN equipped with different graph convolutional layers.

Transductive node classification is a semi-supervised task that demands using a simple model with strong regularization to avoid overfitting on the few labels available. This is the key of GCN’s success when compared to more complex filters, such as Chebyshev. Thanks to its flexible formulation, the proposed ARMA layer can implement the right degree of complexity and performs well on each task. On the other hand, since the PPI dataset is larger and more labels are available during training, less regularization is required and the more complex models are advantaged. This is reflected by the better performance achieved by Chebyshev filters and CayleyNets, compared to GCN. On PPI, ARMA significantly outperforms every other model, due to its powerful modeling capability that allows learning filter responses with different shapes. Since each layer in GAT, GraphSAGE, and GIN combines the features of a node only with those from its 1st order neighborhood, similarly to a GCN, these architectures need to stack more layers to reach higher-order neighborhoods and suffer from the same oversmoothing issue.

We notice that the optimal depth $T$ of the ARMA layer reported in Tab. 6 is low in every dataset. We argue that a reason is the small average shortest path in the graphs (see Tab. 7.1). Indeed, most nodes in the graphs can be reached with only a few propagation steps, which is not surprising since many real networks are small-world (Watts & Strogatz, 1998).

Fig. 4 shows the training times of the GNN model configured with GCN, Chebyshev, CayleyNet, and ARMA layers. The ARMA layer exploits sparse operations that are linear in the number of nodes in ${\mathbf{L}}$ and can be trained in a time comparable to a Chebyshev filter. On the other hand, CayleyNet is slower than other methods, due to the complex formulation based on the Jacobi iterations that results in a high order polynomial.

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bacciu et al. (2018) Bacciu, Davide, Errica, Federico, and Micheli, Alessio. Contextual graph markov model: A deep and generative approach to graph processing. In Proceedings of the 35th International Conference on Machine Learning . ACM, 2018.
2Battaglia et al. (2018) Battaglia, Peter W, Hamrick, Jessica B, Bapst, Victor, Sanchez-Gonzalez, Alvaro, Zambaldi, Vinicius, Malinowski, Mateusz, Tacchetti, Andrea, Raposo, David, Santoro, Adam, Faulkner, Ryan, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261 , 2018.
3Bianchi et al. (2017) Bianchi, Filippo Maria, Maiorino, Enrico, Kampffmeyer, Michael C, Rizzi, Antonello, and Jenssen, Robert. Recurrent neural networks for short-term load forecasting: an overview and comparative analysis . Springer, 2017.
4Bronstein et al. (2017) Bronstein, Michael M, Bruna, Joan, Le Cun, Yann, Szlam, Arthur, and Vandergheynst, Pierre. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine , 34(4):18–42, 2017.
5Bruna et al. (2013) Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and Le Cun, Yann. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203 , 2013.
6Defferrard et al. (2016) Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, Pierre. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems , pp. 3844–3852, 2016.
7Dhillon et al. (2007) Dhillon, Inderjit S, Guan, Yuqiang, and Kulis, Brian. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence , 29(11):1944–1957, 2007.
8Duvenaud et al. (2015) Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy, Aspuru-Guzik, Alán, and Adams, Ryan P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems , pp. 2224–2232, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Graph Neural Networks with Convolutional ARMA Filters

Abstract

1 Introduction

2 Background: graph spectral filtering

2.1 GNNs based on polynomial filters and limitations

3 Rational filters for graph signals

4 The ARMA neural network layer

Theorem 1**.**

Proof.

4.1 Implementation

4.2 Properties and relationship with other approaches

5 Spectral analysis of the ARMA layer

6 Experiments

6.1 Node classification

Theorem 1.