Invariance-Preserving Localized Activation Functions for Graph Neural   Networks

Luana Ruiz; Fernando Gama; Antonio G. Marques; Alejandro Ribeiro

arXiv:1903.12575·eess.SP·February 19, 2020

Invariance-Preserving Localized Activation Functions for Graph Neural Networks

Luana Ruiz, Fernando Gama, Antonio G. Marques, Alejandro Ribeiro

PDF

TL;DR

This paper introduces trainable, structure-aware localized activation functions for GNNs that preserve permutation invariance and enhance model capacity across various tasks.

Contribution

It proposes graph median and max filters as nonlinear activation functions that consider graph structure, maintaining invariance and improving GNN performance.

Findings

01

Localized activation functions improve model capacity.

02

Enhanced performance in source localization, authorship attribution, recommendation, and classification.

03

Modified backpropagation enables training of these functions.

Abstract

Graph signals are signals with an irregular structure that can be described by a graph. Graph neural networks (GNNs) are information processing architectures tailored to these graph signals and made of stacked layers that compose graph convolutional filters with nonlinear activation functions. Graph convolutions endow GNNs with invariance to permutations of the graph nodes' labels. In this paper, we consider the design of trainable nonlinear activation functions that take into consideration the structure of the graph. This is accomplished by using graph median filters and graph max filters, which mimic linear graph convolutions and are shown to retain the permutation invariance of GNNs. We also discuss modifications to the backpropagation algorithm necessary to train local activation functions. The advantages of localized activation function architectures are demonstrated in four…

Tables7

Table 1. Table I : Complexity comparison in a single forward pass and a single backward pass of data for architectures with ReLu and localized activation functions.

	Complexity
Activation	Forward pass	Backpropagation
Pointwise	$𝒪 (N)$	$𝒪 (N)$
Local	$𝒪 (N K^{'} d^{K^{'}} \log (d^{K^{'}}))$	$𝒪 (K^{'} N)$

Table 2. Table II : Number of parameters in the convolutional layers of the 3 GNNs simulated in the recommender systems problem: ReLU, 1-hop median and 1-hop max.

Activation	Number of parameters
ReLU	$160$
$K$ -hop median/max	$162$

Table 3. Table III : Average test RMSEs for ReLU, 1-hop max and 1-hop median by user, over 5 data splits. Number of samples in training and test sets.

	Test RMSE			Samples
User	ReLU	1h-max	1h-med	Train	Test
405	$1.4121$	$1.3009$	$1.4135$	$664$	$73$
655	$0.7809$	$0.7384$	$0.7512$	$617$	$68$
13	$1.3492$	$1.4189$	$1.3087$	$573$	$63$
450	$0.9441$	$0.9121$	$0.9315$	$486$	$54$
276	$0.7409$	$0.7651$	$0.7277$	$467$	$51$
5 users	$1.0454$	$1.0271$	$1.0265$	$2805$	$311$
10 users	$1.0261$	$0.9895$	$1.0049$	$4960$	$551$
15 users	$1.0171$	$0.9841$	$0.9949$	$6842$	$760$
20 users	$0.9867$	$0.9684$	$0.9621$	$8617$	$957$

Table 4. Table IV : Comparison of test RMSEs obtained across different methods for best user data split —localized activation GNNs, high order user-based linear graph filters [ 31 ] , multi-graph CNNs [ 32 ] .

User	Local activation (min.)	[31]	[32]
405	$1.2681$	$1.4620$	$1.2753$
655	$0.6138$	$0.8555$	$0.7284$
13	$1.1659$	$1.5831$	$1.3954$
450	$0.9230$	$1.0341$	$0.8986$
276	$0.6504$	$1.1796$	$0.9559$

Table 5. Table V : Average test RMSEs for ReLU, 1-hop max and 1-hop median by movie, over 5 data splits. Number of samples in training and test sets.

	Test RMSE			Samples
Movie	ReLU	1h-max	1h-med	Train	Test
Star Wars	$0.9505$	$0.9946$	$0.9269$	$525$	$58$
Contact	$1.1337$	$1.0836$	$1.0855$	$459$	$50$
Fargo	$1.0411$	$1.0994$	$1.0403$	$458$	$50$
Return of the Jedi	$0.9294$	$0.9236$	$0.9577$	$457$	$50$
Liar Liar	$1.1926$	$1.1908$	$1.1988$	$437$	$48$
5 movies	$1.0494$	$1.0584$	$1.0418$	$2333$	$259$
10 movies	$1.1045$	$1.0963$	$1.0974$	$4377$	$486$
15 movies	$1.0800$	$1.0712$	$1.0793$	$6185$	$687$
20 movies	$1.0615$	$1.0520$	$1.0580$	$7845$	$871$

Table 6. Table VI : Comparison of test RMSEs obtained across different methods —localized activation GNNs, high order movie-based linear graph filters [ 31 ] , multi-graph CNNs [ 32 ] .

Movie	Local activation (min.)	[31]	[32]
Star Wars	$0.6823$	$0.7690$	$0.7462$
Contact	$1.0290$	$1.0300$	$0.9746$
Fargo	$0.8518$	$1.0684$	$0.8420$
Return of the Jedi	$0.8402$	$0.8564$	$0.8550$
Liar Liar	$1.1693$	$1.1708$	$1.1697$

Table 7. Table VII : Cora test classification accuracy for ReLU 1 subscript ReLU 1 \mbox{ReLU}_{1} , ReLU 2 subscript ReLU 2 \mbox{ReLU}_{2} and 1 1 1 -hop and 2 2 2 -hop max and median GNNs.

	ReLU		Max		Median
Architecture	$L = 1$	$L = 4$	1h	2h	1h	2h
Accuracy (%)	$72.4$	$46.5$	$80.5$	$77.7$	$78.8$	$78.4$

Equations73

[A]_{ij} = a_{j i}, [N]_{ij} = 1, for all (j, i) \in E .

[A]_{ij} = a_{j i}, [N]_{ij} = 1, for all (j, i) \in E .

[S]_{ij} = s_{ij} \neq = 0, if i = j, or (j, i) \in E .

[S]_{ij} = s_{ij} \neq = 0, if i = j, or (j, i) \in E .

\mathcal{N}_{i}^{k}:=\left\{j\ :\ \big{[}{\mathbf{S}}^{k}\big{]}_{ij}\neq 0\right\}\ .

\mathcal{N}_{i}^{k}:=\left\{j\ :\ \big{[}{\mathbf{S}}^{k}\big{]}_{ij}\neq 0\right\}\ .

z = k = 0 \sum K - 1 h_{k} S^{k} x := h *_{S} x

z = k = 0 \sum K - 1 h_{k} S^{k} x := h *_{S} x

u_{ℓ}^{f} = g = 1 \sum F_{ℓ - 1} (h_{ℓ}^{f g} *_{S} x) = g = 1 \sum F_{ℓ - 1} k = 0 \sum K - 1 h_{ℓ k}^{f g} S^{k} x_{ℓ - 1}^{g} .

u_{ℓ}^{f} = g = 1 \sum F_{ℓ - 1} (h_{ℓ}^{f g} *_{S} x) = g = 1 \sum F_{ℓ - 1} k = 0 \sum K - 1 h_{ℓ k}^{f g} S^{k} x_{ℓ - 1}^{g} .

x_{ℓ}^{f} = σ (u_{ℓ}^{f})

x_{ℓ}^{f} = σ (u_{ℓ}^{f})

Φ (x; S, H) = x_{L} = \hat{y} (x) .

Φ (x; S, H) = x_{L} = \hat{y} (x) .

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}}\Big{)}

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}}\Big{)}

z^{'} = k = 0 \sum K - 1 h_{k} S^{'}^{k} x^{'} = k = 0 \sum K - 1 h_{k} (P^{T} S P)^{k} P^{T} x .

z^{'} = k = 0 \sum K - 1 h_{k} S^{'}^{k} x^{'} = k = 0 \sum K - 1 h_{k} (P^{T} S P)^{k} P^{T} x .

{\mathbf{z}}^{\prime}\ =\ {\mathbf{P}}^{\mathsf{T}}\bigg{(}\sum_{k=0}^{K-1}h_{k}{\mathbf{S}}^{k}{\mathbf{x}}\bigg{)}\ =\ {\mathbf{P}}^{\mathsf{T}}{\mathbf{z}}

{\mathbf{z}}^{\prime}\ =\ {\mathbf{P}}^{\mathsf{T}}\bigg{(}\sum_{k=0}^{K-1}h_{k}{\mathbf{S}}^{k}{\mathbf{x}}\bigg{)}\ =\ {\mathbf{P}}^{\mathsf{T}}{\mathbf{z}}

med (X) = x_{[n \div 2 + 1]} .

med (X) = x_{[n \div 2 + 1]} .

\big{[}{\mathbf{z}}\big{]}_{i}=\big{[}\text{med}({\mathbf{S}},{\mathbf{x}})\big{]}_{i}=\text{med}\big{(}\{x_{j}:j\in{\mathcal{N}}_{i}\}\big{)}\ .

\big{[}{\mathbf{z}}\big{]}_{i}=\big{[}\text{med}({\mathbf{S}},{\mathbf{x}})\big{]}_{i}=\text{med}\big{(}\{x_{j}:j\in{\mathcal{N}}_{i}\}\big{)}\ .

z := k = 0 \sum K w_{k} med (S^{k}, x) .

z := k = 0 \sum K w_{k} med (S^{k}, x) .

x_{ℓ}^{f} = k = 0 \sum K w_{ℓ k}^{f} med (S^{k}, u_{ℓ}^{f}) .

x_{ℓ}^{f} = k = 0 \sum K w_{ℓ k}^{f} med (S^{k}, u_{ℓ}^{f}) .

Φ (x; S, H, W) = x_{L} = \hat{y} (x)

Φ (x; S, H, W) = x_{L} = \hat{y} (x)

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}},{\mathcal{W}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}}\Big{)}

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}},{\mathcal{W}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}}\Big{)}

z^{'} = k = 0 \sum K ω_{k} med (S^{' k}, x^{'}) = k = 0 \sum K ω_{k} med ((P^{T})^{k} S^{k} P^{k}, P^{T} x)

z^{'} = k = 0 \sum K ω_{k} med (S^{' k}, x^{'}) = k = 0 \sum K ω_{k} med ((P^{T})^{k} S^{k} P^{k}, P^{T} x)

{\mathbf{z}}^{\prime}\ =\ {\mathbf{P}}^{\mathsf{T}}\bigg{(}\sum_{k=0}^{K}\omega_{k}\text{med}({\mathbf{S}}^{k},{\mathbf{x}})\bigg{)}\ =\ {\mathbf{P}}^{\mathsf{T}}{\mathbf{z}}

{\mathbf{z}}^{\prime}\ =\ {\mathbf{P}}^{\mathsf{T}}\bigg{(}\sum_{k=0}^{K}\omega_{k}\text{med}({\mathbf{S}}^{k},{\mathbf{x}})\bigg{)}\ =\ {\mathbf{P}}^{\mathsf{T}}{\mathbf{z}}

max (X) = x_{[n]} .

max (X) = x_{[n]} .

\big{[}{\mathbf{z}}\big{]}_{i}=\big{[}\text{max}({\mathbf{S}},{\mathbf{x}})\big{]}_{i}=\text{max}\big{(}\{x_{j}:j\in{\mathcal{N}}_{i}\}\big{)}\ .

\big{[}{\mathbf{z}}\big{]}_{i}=\big{[}\text{max}({\mathbf{S}},{\mathbf{x}})\big{]}_{i}=\text{max}\big{(}\{x_{j}:j\in{\mathcal{N}}_{i}\}\big{)}\ .

z := k = 0 \sum K w_{k} max (S^{k}, x) .

z := k = 0 \sum K w_{k} max (S^{k}, x) .

x_{ℓ}^{f} = k = 0 \sum K w_{ℓ k}^{f} max (S^{k}, u_{ℓ}^{f}) .

x_{ℓ}^{f} = k = 0 \sum K w_{ℓ k}^{f} max (S^{k}, u_{ℓ}^{f}) .

Φ (x; S, H, W) = x_{L} = \hat{y} (x)

Φ (x; S, H, W) = x_{L} = \hat{y} (x)

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}},{\mathcal{W}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}}\Big{)}

\Phi\Big{(}{\mathbf{P}}^{\mathsf{T}}{\mathbf{x}};{\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}},{\mathcal{H}},{\mathcal{W}}\Big{)}={\mathbf{P}}^{\mathsf{T}}\Phi\Big{(}{\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}}\Big{)}

z^{'} = k = 0 \sum K ω_{k} max (S^{' k}, x^{'}) = P^{T} z .

z^{'} = k = 0 \sum K ω_{k} max (S^{' k}, x^{'}) = P^{T} z .

u^{f} (x) = g = 1 \sum G (h^{f g} *_{S} x^{g}) = g = 1 \sum G k = 0 \sum K - 1 h_{k}^{f g} S^{k} x^{g}

u^{f} (x) = g = 1 \sum G (h^{f g} *_{S} x^{g}) = g = 1 \sum G k = 0 \sum K - 1 h_{k}^{f g} S^{k} x^{g}

z^{f} (x) = k^{'} = 0 \sum K^{'} w_{k^{'}}^{f} σ (S^{k^{'}}, u^{f} (x))

z^{f} (x) = k^{'} = 0 \sum K^{'} w_{k^{'}}^{f} σ (S^{k^{'}}, u^{f} (x))

{\bar{J}}\big{(}{\mathbf{y}},{\hat{\mathbf{y}}}({\mathbf{x}})\big{)}=\sum_{{\mathcal{T}}}J\big{(}{\mathbf{y}}_{m},{\hat{\mathbf{y}}}({\mathbf{x}}_{m})\big{)}\ .

{\bar{J}}\big{(}{\mathbf{y}},{\hat{\mathbf{y}}}({\mathbf{x}})\big{)}=\sum_{{\mathcal{T}}}J\big{(}{\mathbf{y}}_{m},{\hat{\mathbf{y}}}({\mathbf{x}}_{m})\big{)}\ .

\frac{\partial J ˉ}{\partial h _{k}^{f g}}

\frac{\partial J ˉ}{\partial h _{k}^{f g}}

\frac{\partial J ˉ}{\partial w _{k^{'}}^{f}}

\frac{\partial J ˉ}{\partial h _{k}^{f g}} = T \sum \frac{\partial J}{\partial y ^} \frac{\partial y ^}{\partial h _{k}^{f g}}_{x_{m}} = T \sum \frac{\partial J}{\partial y ^ ^{f}} \frac{\partial y ^ ^{f}}{\partial h _{k}^{f g}}_{x_{m}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Invariance-Preserving Localized Activation Functions for Graph Neural Networks

Luana Ruiz, Fernando Gama, Antonio G. Marques and Alejandro Ribeiro This work in this paper was supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181, ISTC-WAS and Intel DevCloud, and Spanish MINECO grant TEC2016-75361-R. Preliminary results have been submitted for publication at the ICASSP19 conference [1]. L. Ruiz, F. Gama and A. Ribeiro are with the Dept. of Electrical and Systems Eng., Univ. of Pennsylvania., A. G. Marques is with the Dept. of Signal Theory and Comms., King Juan Carlos Univ. Email: {rubruiz,fgama,aribeiro}@seas.upenn.edu, [email protected].

Abstract

Graph signals are signals with an irregular structure that can be described by a graph. Graph neural networks (GNNs) are information processing architectures tailored to these graph signals and made of stacked layers that compose graph convolutional filters with nonlinear activation functions. Graph convolutions endow GNNs with invariance to permutations of the graph nodes’ labels. In this paper, we consider the design of trainable nonlinear activation functions that take into consideration the structure of the graph. This is accomplished by using graph median filters and graph max filters, which mimic linear graph convolutions and are shown to retain the permutation invariance of GNNs. We also discuss modifications to the backpropagation algorithm necessary to train local activation functions. The advantages of localized activation function architectures are demonstrated in four numerical experiments: source localization on synthetic graphs, authorship attribution of 19th century novels, movie recommender systems and scientific article classification. In all cases, localized activation functions are shown to improve model capacity.

Index Terms:

deep learning, convolutional neural networks, graph signal processing, nonlinear graph filters, activation functions, max filters, median filters

I Introduction

With a local structure that repeats itself at every point, images and time signals are characterized by their regularity. However, many problems in contemporary information processing —such as analyzing texts or designing recommender systems— leverage data with support on irregular structures. These types of data are best represented as graph signals, which explains the growing interest in devising architectures capable of exploiting the structural information carried by the graph topology. In particular, graph convolutional neural networks (GNNs) [2, 3, 4, 5, 6, 7, 8] have been at the center of attention in an effort to reproduce the remarkable success that convolutional neural networks (CNNs) achieved in processing images and time signals [9].

CNNs are made out of stacked layers that each comprise two basic operations: a bank of trainable linear convolutional filters and a fixed nonlinear activation function. GNNs retain this basic architecture but replace standard convolutions with graph convolutional filters [7]. Graph convolutional filters are built by reinterpreting a linear graph diffusion operator as a shift. Following this interpretation, a convolution is simply defined as the sum of scaled shift compositions. This allows us to further define linear transforms that are polynomials on the shift (diffusion) operator and which were seen to be proper generalizations of linear time invariant filters [10]. Regarding activation functions, GNNs utilize the same functions as CNNs —rectified linear units (ReLUs), sigmoids, and hyperbolic tangents—, and, like in CNNs, these functions are either applied locally [4, 5] or within local neighborhoods when mixed with nonlinear pooling operations [7, 3, 2].

Since the notion of a neighborhood in an image or a time signal is always the same, there is no reason to adapt the activation function for different tasks. However, the structure of a neighborhood may vary significantly from graph to graph. This motivates the design of activation functions that are adapted to the graph structure. In this paper, we propose not only to adapt the nonlinear activation function to the structure of the graph, but also to make it multiresolution and trainable by assigning linear weights to the value of the nonlinear function at $1,2,\ldots,K$ -hop node neighborhoods. In particular, we introduce two types of local activation functions based on median and maximum graph filters [11, 12]. Median and maximum graph filters are the graph signal processing (GSP) counterparts of the rank filters studied in traditional signal processing [13]. Similarly to linear graph filters, median and maximum filters encode graph structural information, but they do so through an implicit, nonlinear dependence on node neighborhoods of the graph. As such, they can be used to design multiresolution activation functions that learn to assign different weights to different neighborhood resolutions (Section III).

A fundamental consideration in the introduction of trainable activation functions is retaining the permutation invariance of GNNs [14, 15, 16]. Observe that this form of invariance is with respect to a reordering of the nodes’ labels that retains the structure of the graph. This is different from the definition of permutation invariance used in set theory, which requires invariance with respect to all possible reorderings [17, 18]. Indeed, since graph convolutional filters are polynomials on the diffusion operator, they can be readily shown to be invariant to permutations or node relabelings (Proposition 1). This is an important property because it renders the processing of graph signals independent of the choice of node labels and is an effective way of exploiting the internal symmetries of graph signals. Pointwise activation functions are independent of the graph structure and, as such, do not affect permutation invariance. We will show here that nonlinear activation functions based on median and maximum filters, although local, are still invariant to permutations (Propositions 2 and 3).

It is important to mention that the use of trainable activation functions in CNNs [19, 20, 21, 22] and GNNs [23] has been pursued with success but that the existing literature focuses on learning pointwise activations functions. In particular, [19] uses a sigmoid function that is exponentiated by a trainable parameter to change the activation function slope in feedforward neural networks. In [20], the ReLU activation is paired with a regularizer whose importance is controlled by a trainable linear weight variable. Perhaps the best known example is that of maxout units [21, 22], which replace pointwise max functions by the maximum among a number of filters at a given layer. In the specific case of activation functions for GNNs, the work in [23] proposes learning a pointwise activation function parametrized by a dictionary. All of these papers differ from the goal of our paper, which is to design activation functions that learn to assign different weights to different neighborhood resolutions.

This paper is organized as follows. Section II describes a basic GNN architecture and lays the ground for the localized activation functions introduced in Section III. In Sections III-A and III-B respectively, we define a class of linearly parametrized multiresolution median and max graph filters, which are further interpreted as permutation invariant localized activation functions whose linear parameters can be trained to learn appropriate localized activation functions for GNNs. In Section IV-A, we address backpropagation training of the multi-hop median and multi-hop max operators. We conclude Section IV with considerations on the additional computational complexity incurred by the localized activation functions we propose (Section IV-B). In Section V, the performance of our localized activation functions is evaluated on both synthetic and real-world datasets, including the problems of authorship attribution, movie recommendation systems and scientific article classification. Concluding remarks are presented in Section VI.

Notation. Bold uppercase ${\mathbf{A}}$ is a matrix and bold lowercase ${\mathbf{x}}$ is a vector. We use $[{\mathbf{A}}]_{ij}$ to denote the $(i,j)$ entry of ${\mathbf{A}}$ and $[{\mathbf{x}}]_{i}$ for the $i$ th entry of ${\mathbf{x}}$ . Transposition of vectors and matrices are written as ${\mathbf{x}}^{\mathsf{T}}$ and ${\mathbf{A}}^{\mathsf{T}}$ , respectively. The operation $\text{diag}({\mathbf{A}})$ is a column vector with entries $[\text{diag}({\mathbf{A}})]_{i}=[{\mathbf{A}}]_{ii}$ . The operation $\text{diag}({\mathbf{x}})$ yields a diagonal matrix with $[\text{diag}({\mathbf{x}})]_{ii}=[{\mathbf{x}}]_{i}$ . Calligraphic uppercase ${\mathcal{T}}$ is a set with cardinality $|{\mathcal{T}}|$ .

II Convolutional Processing of Graph Signals

We consider graph signals ${\mathbf{x}}=[x_{1},\ldots,x_{N}]^{\mathsf{T}}\in{\mathbb{R}}^{N}$ in which the component $x_{i}$ is associated with the $i$ th node of a weighted and directed graph ${\mathcal{G}}$ . The graph is composed of a vertex set ${\mathcal{V}}=\{1,\dots,N\}$ , an edge set ${\mathcal{E}}\subseteq{\mathcal{V}}\times{\mathcal{V}}$ of ordered pairs $(i,j)$ and a weight function ${\mathcal{A}}:{\mathcal{E}}\to{\mathbb{R}}$ taking values ${\mathcal{A}}(i,j)=a_{ij}$ . The presence of the edge $(i,j)$ in the set ${\mathcal{E}}$ is interpreted as an expectation that signal components $i$ and $j$ are close. The weight $a_{ij}$ measures the expected similarity between nodes, i.e., the larger $a_{ij}$ , the more related components $x_{i}$ and $x_{j}$ should be. Associated with the graph ${\mathcal{G}}$ , we also define the weighted adjacency matrix ${\mathbf{A}}\in{\mathbb{R}}^{N\times N}$ and the unweighted adjacency matrix ${\mathbf{N}}\in\{0,1\}^{N\times N}$ . Both of these matrices have a sparsity pattern that matches that of the edge set of the graph by taking values $[{\mathbf{A}}]_{ij}=[{\mathbf{N}}]_{ij}=0$ for all $(j,i)\notin{\mathcal{E}}$ . At entries corresponding to edges of ${\mathcal{G}}$ , we have

[TABLE]

A graph shift operator associated with ${\mathcal{G}}$ is a sparse matrix ${\mathbf{S}}$ whose nonzero entries are either in the diagonal or at entries that match an edge of the graph,

[TABLE]

The nonzero entry pattern in (2) is meant to abstract the properties that are shared between different matrix representations. The weighted adjacency ${\mathbf{A}}$ and the unweighted adjacency ${\mathbf{N}}$ defined in (1) satisfy the restrictions placed by (2) on the shift operator ${\mathbf{S}}$ . Other acceptable choices are the corresponding Laplacians ${\mathbf{L}}_{\mathbf{A}}:=\text{diag}({\mathbf{A}}{\mathbf{1}})-{\mathbf{A}}$ and ${\mathbf{L}}_{\mathbf{N}}:=\text{diag}({\mathbf{N}}{\mathbf{1}})-{\mathbf{N}}$ as well as the self loop adjacency ${\mathbf{M}}={\mathbf{I}}+{\mathbf{N}}$ . Using ${\mathbf{S}}$ as a stand-in for an arbitrary matrix representation of the graph avoids restricting attention to a particular selection. This is useful because while different matrix representations are of interest in different contexts, they can all be leveraged in a similar manner to process graph signals ${\mathbf{x}}$ using graph convolutions as we explain in Section II-A.

The graph shift operator induces neighborhoods in the graph. The 1-hop neighborhood of node $i$ is the set of nodes ${\mathcal{N}}_{i}=\mathcal{N}_{i}^{1}:=\{j:(j,i)\in{\mathcal{E}}\}$ that can be reached from $i$ by taking a single hop along an edge $(j,i)$ . More generically, the $k$ -hop neighborhood is the set of nodes that can be reached in exactly $k$ hops. This set is easily determined from the nonzero elements of the $k$ th power of the unweighted adjacency matrix ${\mathbf{S}}={\mathbf{N}}$ ,

[TABLE]

Observe that, consistent with (3), the [math]-hop neighborhood is the node itself since for $k=0$ we have ${\mathbf{S}}^{k}={\mathbf{S}}^{0}={\mathbf{I}}$ . In general, ${\mathcal{N}}_{i}^{l}\nsubseteq{\mathcal{N}}_{i}^{k}$ for $l<k$ since a node may be reachable in exactly $k$ hops but not reachable in exactly $l<k$ hops. A sufficient condition for having ${\mathcal{N}}_{i}^{l}\subseteq{\mathcal{N}}_{i}^{k}$ for $l<k$ is that the shift operator be nonnegative with a full diagonal. An example of a shift operator with this property is the self loop unweighted adjacency matrix ${\mathbf{M}}={\mathbf{I}}+{\mathbf{N}}$ ; see Remark 2. Further note that we can think of ${\mathbf{S}}^{k}$ itself as the shift operator of a graph. With this interpretation, the $k$ -hop neighborhood of ${\mathbf{S}}$ is equivalent to the 1-hop neighborhood of ${\mathbf{S}}^{k}$ . We will exploit this fact to simplify definitions in Section III.

II-A Graph Convolutions

A graph convolution is defined as a linear operator ${\mathbf{H}}$ that can be written as a polynomial in the shift operator operator ${\mathbf{S}}$ [24, 25, 26, 27]. Formally, for a given vector of coefficients ${\mathbf{h}}=[h_{0},\ldots,h_{K-1}]\in{\mathbb{R}}^{K}$ and a graph signal ${\mathbf{x}}$ , the graph convolution of ${\mathbf{h}}$ and ${\mathbf{x}}$ is an operation whose outcome is the graph signal

[TABLE]

where we have defined the graph convolution operator $*_{\mathbf{S}}$ to represent the linear transformation in (4). The graph convolution in (4) shares the localization properties of regular convolutions since each of the terms in the polynomial performs operations that are localized to a specific neighborhood. Indeed, it is straightforward to see that we can have $[{\mathbf{S}}^{k}]_{ij}\neq 0$ only when $j$ is in the $k$ -hop neighborhood of $i$ . Consequently, the $i$ th entry of the product ${\mathbf{S}}^{k}{\mathbf{x}}$ is only affected by the entries $x_{j}$ for which $j\in\mathcal{N}_{i}^{k}$ . We can then think of the first polynomial term $h_{0}{\mathbf{S}}^{0}{\mathbf{x}}$ as a nodewise operation, the second polynomial term $h_{1}{\mathbf{S}}^{1}{\mathbf{x}}$ as a 1-hop neighborhood operation, and, in general, the $k$ th polynomial term as an operation localized to $(k-1)$ -hop neighborhoods. This is akin to regular convolutional filters of order $K$ extending to no more than $K-1$ points in time. This property makes the graph convolution in (4) a natural choice for the extension of convolutional neural networks to signals supported on graphs, as we discuss in the following section.

Remark 1.

We point out that (4) is also referred to as a linear shift invariant (LSI) filter because of its invariance with respect to the application of the shift operator. Namely, if the input ${\mathbf{x}}$ is replaced by the (shifted) input ${\mathbf{S}}{\mathbf{x}}$ the output shifts from ${\mathbf{z}}$ to $\mathbf{S}$$\mathbf{z}$ . This is analogous to (convolutional) linear time invariant (LTI) filters in which a time shift of the input produces a time shift at the output. LSI filters also admit spectral representations in terms of graph Fourier transforms that are analogous to spectral representations of time invariant filters [24]. The connections between LSI and LTI filters are deeper than simple analogies. Regular convolutions and regular Fourier transforms can be recovered if the graph shift operator is particularized to a cycle graph representing a periodic time axis.

II-B Graph Neural Networks

Consider a training set ${\mathcal{T}}=\{({\mathbf{x}}_{m},{\mathbf{y}}_{m})\}$ comprised of $|{\mathcal{T}}|$ input-output samples $({\mathbf{x}}_{m},{\mathbf{y}}_{m})$ . In each training example, the vector ${\mathbf{x}}_{m}$ is a graph signal and the vector ${\mathbf{y}}_{m}$ is some observed output whose shape depends on the problem at hand. For instance, ${\mathbf{y}}_{m}$ might be a class label in a classification problem, or another graph signal in the context of regression. The objective of learning is to find a representation of the training set that can produce output estimates ${\hat{\mathbf{y}}}({\mathbf{x}})$ for unknown inputs ${\mathbf{x}}\notin{\mathcal{T}}$ . Graph Neural Networks (GNNs) do so by composing computational layers that are themselves the composition of two distinct operations, namely: (i) a filter bank of graph convolutions, and (ii) a pointwise nonlinear activation function [5]. Formally, each layer $\ell$ takes at its input a set of $F_{\ell-1}$ features ${\mathbf{x}}_{\ell-1}^{f}$ , where the ${\mathbf{x}}_{\ell-1}^{f}$ are signals supported on ${\mathcal{G}}$ to which we attach the graph shift operator ${\mathbf{S}}$ . Each feature is processed by a separate set of filter banks, and each filter bank consists of $F_{\ell}$ graph convolutional filters as described by equation (4). Denoting the filters’ coefficients by ${\mathbf{h}}_{\ell}^{fg}\in{\mathbb{R}}^{K}$ , we can write the resulting $F_{\ell}$ convolutional features ${\mathbf{u}}_{\ell}^{f}$ as

[TABLE]

The convolutional features ${\mathbf{u}}_{\ell}^{f}$ are then fed into a scalar and nonlinear activation function to produce the $\ell$ th layer’s output features,

[TABLE]

which will then act as the subsequent layer’s input features.

Beginning with ${\mathbf{x}}_{0}={\mathbf{x}}$ as the first layer’s ( $\ell=1$ ) input, we proceed recursively through (5)-(6) until the last layer. In a system with $L$ layers, the GNN output is the collection of features ${\mathbf{x}}_{L}=[({\mathbf{x}}_{L}^{1})^{\mathsf{T}},\ldots,({\mathbf{x}}_{L}^{F_{L}})^{\mathsf{T}}]^{\mathsf{T}}$ , and represents the output estimate ${\hat{\mathbf{y}}}({\mathbf{x}})={\mathbf{x}}_{L}$ . For future reference, we define the set ${\mathcal{H}}=\{{\mathbf{h}}_{\ell}^{fg}\}_{\ell,f,g}$ grouping all the graph filter coefficients of the model to write the GNN as a mapping $\Phi:{\mathbf{x}}_{0}\mapsto{\mathbf{x}}_{L}$ parametrized by ${\mathbf{S}}$ and by the coefficient set ${\mathcal{H}}$ :

[TABLE]

In (7), ${\mathbf{S}}$ is given, while ${\mathcal{H}}$ is determined by optimizing a loss function ${\bar{J}}=\sum_{{\mathcal{T}}}J[{\mathbf{y}},{\hat{\mathbf{y}}}({\mathbf{x}}_{m})]=\sum_{{\mathcal{T}}}J[{\mathbf{y}},\Phi({\mathbf{x}}_{m};{\mathbf{S}},{\mathcal{H}})]$ on the training set ${\mathcal{T}}$ .

GNNs are particular cases of neural networks (NNs) and generalizations of convolutional neural networks (CNNs). NNs are obtained through the composition of arbitrary linear operations with pointwise activation functions. In practice, they are hard to train and underperform architectures exploiting structural information because the number of parameters to learn is too large. CNNs resolve this problem for time signals, images, and other signals supported on regular domains by restricting arbitrary linear transformations to convolutional filter banks [28]. In graph domains, GNNs fulfill the same purpose by utilizing graph convolutions [cf. (4) and (5)].

The pointwise nonlinear operators typically used in GNNs, however, neglect the graph structure. They process nodes homogeneously thus ignoring the different compositions of their heterogeneous neighborhoods. This is not unreasonable for the regular signals that are processed with CNNs because their underlying regular structures repeat themselves from node to node and layer to layer. On a generic irregular graph, however, the individual node connectivity profiles are important because they explain how each node interacts with the other nodes in the graph. In this context, summarizing nonlinear operations acting on node neighborhoods instead of individual nodes would exploit meaningful information that the mere application of a graph filter followed by a pointwise nonlinearity do not. Preliminary evidence for this ability comes from the better reconstruction properties of median (nonlinear) graph filters and the improvement in topology identification stemming from the use of nonlinear structural models [29]. The goal of this paper is to design nonlinear local activation functions for GNNs that leverage the neighborhood structure of the graph.

III Invariance-Preserving Local Activation Functions

In pursuing nonlinear activation functions that exploit the graph structure it is important to begin by understanding why graph convolutions are suitable for processing graph signals. A partial answer to this question is the fact that processing signals with a GNN is independent of the graph labeling, as we formally state and prove in the following proposition.

Proposition 1.

Consider a graph signal ${\mathbf{x}}$ supported on a graph ${\mathcal{G}}$ with shift operator ${\mathbf{S}}$ . Let $\Phi({\mathbf{x}};{\mathbf{S}},{\mathcal{H}})$ be the output of a graph neural network (GNN) with coefficient set ${\mathcal{H}}$ [cf. (5)-(7)]. If ${\mathbf{P}}$ is a permutation matrix, then

[TABLE]

i.e., the output of the GNN is invariant to permutations.

Proof.

The result follows from the fact that the LSI filter in (4) is invariant to permutations. To see this, consider the graph permutation ${\mathbf{S}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}}$ and let ${\mathbf{x}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{x}}$ be a signal on the permuted graph, with ${\mathbf{x}}$ its non-permuted counterpart. The application of (4) to ${\mathbf{x}}^{\prime}$ yields

[TABLE]

Using the fact that ${\mathbf{P}}^{k}={\mathbf{P}}$ and that permutation matrices are orthogonal, we get

[TABLE]

which is simply the permutation by ${\mathbf{P}}$ of the output of filter (4) applied to ${\mathbf{x}}$ . As for activation functions, the fact that they are scalar automatically entails invariance to permutations. Thus, if the input to a GNN layer is permuted, its output will be permuted likewise. This is also true in architectures with multiple layers, where permutations will cascade from one layer to the other until they reach the output. ∎

The permutation invariance stated in Proposition 1 shows that the features that are learned by a GNN are independent of the labeling of the graph. But permutation invariance is also important because it means that GNNs exploit internal signal symmetries as we illustrate in Figure 1. The graphs in Figure 1-(a) and Figure 1-(b) are the same, as indicated by the integer labels. The signals in Figure 1-(a) and Figure 1-(b) are different, as indicated by different colors. However, it is possible to permute the graph onto itself to make the signals match – rotate $180^{\circ}$ degrees and pull it inside out (Figure 1-(c)). It then follows from Proposition 1 that the output of a GNN applied to the signal on the left (a) is a corresponding permutation of the output of the same GNN applied to the signal on the right (b). This is beneficial because we can learn to process the signal on (a) from seeing examples of the signal on (b). Although most graphs do not have perfect symmetries, existing perturbation analyses show that similar comments hold for transformations that are close to perturbations [15]. In the design of nonlinear, localized activation functions, our goal is to retain this property by making localized activation functions permutation invariant.

III-A Median GNNs

Consider a set ${\mathcal{X}}=\{x_{1},\ldots,x_{n}\}$ and order the elements of ${\mathcal{X}}$ so that $x_{[1]}\leq x_{[2]}\leq\ldots\leq x_{[n]}$ . If we denote by $n\div 2$ the integer division of the cardinality of ${\mathcal{X}}$ by 2, the median of the set ${\mathcal{X}}$ is given by

[TABLE]

In order to define activation functions in terms of median graph filters, we begin by defining the median operator associated with a graph shift operator ${\mathbf{S}}$ .

Definition 1 (Median operator).

Let ${\mathbf{S}}$ be a graph shift operator and let ${\mathcal{N}}_{i}$ denote the neighborhoods induced by ${\mathbf{S}}$ , each $i$ , $1\leq i\leq N$ , being a node of the graph. The output of the median operator $\text{med}({\mathbf{S}},\cdot)$ applied to the graph signal ${\mathbf{x}}$ is the graph signal ${\mathbf{z}}:=\text{med}({\mathbf{S}},{\mathbf{x}})$ , whose components we write

[TABLE]

Definition 1 implies that the median operator replaces the value of the analyzed signal ${\mathbf{x}}$ at each node by the median of the values of ${\mathbf{x}}$ in the corresponding neighborhood. In that sense, we can think of $\text{med}({\mathbf{S}},\cdot)$ as a nonlinear diffusion operator in which, instead of computing the average of neighboring values at each node, we compute their median. These two diffusions tend to yield similar values if neighboring values are symmetric around their mean.

From the definition of the median operator, we can now define what we call the multiresolution median graph filter as a linear combination of medians associated with neighborhoods of different depths. We write it as follows.

Definition 2 (Multiresolution median graph filter).

Given a graph shift operator ${\mathbf{S}}$ and a vector of $K+1$ filter coefficients ${\mathbf{w}}=[w_{0},\ldots,w_{K}]^{\mathsf{T}}$ , the output of the median graph filter with coefficients ${\mathbf{w}}$ applied to the signal ${\mathbf{x}}$ is the signal ${\mathbf{z}}$ ,

[TABLE]

This definition is to be contrasted with that of the linear graph convolution in (4). In (4), each summand ${\mathbf{S}}^{k}{\mathbf{x}}$ represents a weighted average, so that the value of the signal at node $i$ is affected by the values of the signals at its $k$ -hop neighbors. In (11), the summand $\text{med}({\mathbf{S}}^{k},{\mathbf{x}})$ summarizes information from the same set, because the shift operator ${\mathbf{S}}^{k}$ is associated with a graph in which nodes are connected to their $k$ -hop neighbors. Differently from regular linear filters, however, the signal values at neighboring nodes are not averaged, but summarized by the median operation. We can thus think of (11) as a nonlinear convolutional filter constructed from median operations that retains the multiresolution aspect of conventional linear graph convolutions.

The median graph filter in Definition 2 is used here to define nonlinear activation functions. Specifically, consider a collection of median filter coefficients ${\mathbf{w}}_{\ell}^{f}$ and replace the pointwise activation functions in (6) by the local activation functions

[TABLE]

A median GNN is a GNN with median activation functions, i.e., one in which the $\ell$ th layer is defined by the composition of (5) and (12). We can interpret this architecture as the composition of two types of convolutional layers: linear convolutional layers [cf. (5)] and nonlinear convolutional layers based on median filters (12). Observe that the output of a median GNN is determined by the linear filter coefficients ${\mathcal{H}}$ and the median graph filter coefficients ${\mathcal{W}}=\{{\mathbf{w}}_{\ell}^{f}\}_{\ell,f}$ . For future reference, we therefore define the median GNN map

[TABLE]

where ${\mathbf{x}}_{L}$ is obtained from the application of $L$ median GNN layers, each of which computes a linear convolution (5) followed by a median graph filter activation (12). That the definition of median graph filters parallels the definitions of graph convolutions is not an accidental choice. Choosing activation functions of this form is intended not only to encode the graph structure at multiple resolutions, but also to preserve the permutation invariance of GNNs, as we formally state and prove next.

Proposition 2.

Consider a graph signal ${\mathbf{x}}$ supported on a graph with shift operator ${\mathbf{S}}$ . Let $\Phi({\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}})$ be the output of a median GNN with linear filter coefficients ${\mathcal{H}}$ and nonlinearity coefficients ${\mathcal{W}}$ [cf. (5)-(7), (12). If ${\mathbf{P}}$ is a permutation matrix, then

[TABLE]

i.e., the output of the median GNN is invariant to permutations.

Proof.

The result follows from Proposition 1 and the fact that the median activation function in (12) is invariant to permutations. Consider the graph permutation ${\mathbf{S}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}}$ and let ${\mathbf{x}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{x}}$ be the corresponding permuted graph signal. Applying the median activation function to ${\mathbf{x}}^{\prime}$ , we get

[TABLE]

but this expression can be simplified by observing that ${\mathbf{P}}^{k}={\mathbf{P}}$ , ${\mathbf{P}}^{\mathsf{T}}={\mathbf{P}}^{-1}$ and that the graph median operators in equations (10) and (11) is permutation invariant. This yields

[TABLE]

a result that will also hold for multi-layered median GNNs as graph permutations cascade from each layer to the next. ∎

As nonscalar operators that take in values of a graph signal in localized neighborhoods around each node, median activation functions endow GNNs with the ability to extract nonlinear local features in addition to the linear features extracted through graph convolutions. This is something that pointwise activation functions, by construction, cannot do. Proposition 2 establishes that the features extracted by median graph filters are invariant to permutations of the graph. As previously noted, this is a desirable property for activation functions in GNNs. Another important property of median filters is that they are differentiable with respect to their parameters. This is instrumental in facilitating their training, as we explain in Section IV.

Remark 2.

To keep the presentation simple, we have used the same shift operator in the definition of linear convolutional filters and median convolutional filters. This is not necessary. In our numerical experiments, we use weighted adjacency matrices, namely ${\mathbf{S}}={\mathbf{A}}$ , for the linear convolution, but we add an identity to the unweighted adjacency, namely ${\mathbf{S}}={\mathbf{I}}+{\mathbf{N}}={\mathbf{M}}$ , for the median convolution. This choice of shift operator for the median convolution makes the neighborhood sets in (11) nested. This is not necessarily true if we just choose the adjacency matrix as a shift operator, and it also makes neighborhoods more interpretable. Regardless of interpretation, we have observed that this mix of shift operators reduces test error. Further note that different shift operators associated with the same graph can be used at different layers or even for different features at the same layer. We have not seen advantages associated with this expansion of the representation space.

Remark 3 (Alternative median graph filter definitions).

The definition of median graph filters in (11) does not make use of the shift operator weights because these are not used in the definition of the median operator in (10). These weights would be easy to incorporate. E.g., we could replace $x_{j}$ by $s_{ij}x_{j}$ in the computation of the median in (10). Or, more attuned to classical definitions of median filters, the signal value $x_{j}$ can be repeated within the neighborhood set a number of times proportional to the weight $w_{ij}$ . Further note that (11) performs a linear combination of different neighborhood medians. We could think of replacing this linear combination by a weighted median operation as well. All of these generalizations are possible and would retain the invariance claimed in Proposition 2. Their analysis and evaluation are beyond the scope of this paper. We refer interested readers to [11, 12] for a comprehensive analysis of median filters for signals supported on graphs.

III-B Max GNNs

Consider once again the set ${\mathcal{X}}=\{x_{1},\ldots,x_{n}\}$ , and order the elements of ${\mathcal{X}}$ so that $x_{[1]}\leq x_{[2]}\leq\ldots\leq x_{[n]}$ . The max of the set ${\mathcal{X}}$ is the last element of the ordered sequence, which we write

[TABLE]

If the elements of ${\mathcal{X}}$ are the values of a graph signal ${\mathbf{x}}\in{\mathbb{R}}^{N}$ on the nodes of a graph ${\mathcal{G}}$ , it makes sense to define a max operator that takes the topology of ${\mathcal{G}}$ into account. In particular, this will be necessary to define activation functions in terms of max graph filters. We thus define the max operator associated with the graph shift operator ${\mathbf{S}}$ as follows.

Definition 3 (Max operator).

Let ${\mathbf{S}}$ be a graph shift operator and denote by ${\mathcal{N}}_{i}$ the neighborhood induced by ${\mathbf{S}}$ at node $i$ , with $1\leq i\leq N$ . The output of the max operator $\text{max}({\mathbf{S}},\cdot)$ applied to the graph signal ${\mathbf{x}}$ is the graph signal ${\mathbf{z}}:=\text{max}({\mathbf{S}},{\mathbf{x}})$ , whose components we write

[TABLE]

The max operator in Definition 1 replaces the value of $x_{i}$ at each node by the maximum of the values of ${\mathbf{x}}$ in the corresponding neighborhood ${\mathcal{N}}_{i}$ . As such, $\text{max}({\mathbf{S}},\cdot)$ acts as a nonlinear diffusion operator, akin to the median operator introduced in Definition 1.

With the definition of the max operator at hand, we can now define multiresolution max graph filters, which are linear combinations of max operators acting on neighborhoods of different resolutions. We write them as follows.

Definition 4 (Multiresolution max graph filter).

Given a graph shift operator ${\mathbf{S}}$ and a vector of $K+1$ filter coefficients ${\mathbf{w}}=[w_{0},\ldots,w_{K}]^{\mathsf{T}}$ , the output of the max graph filter with coefficients ${\mathbf{w}}$ applied to the signal ${\mathbf{x}}$ is the signal ${\mathbf{z}}$ ,

[TABLE]

As was the case for the median graph filter in Definition 2, in the max graph filter in Definition 4 the summands $\text{max}({\mathbf{S}}^{k},{\mathbf{x}})$ combine information in the same multiresolution fashion as the polynomial terms of LSI-GFs. To see this, notice that $\text{max}({\mathbf{S}}^{k},{\mathbf{x}})$ takes as arguments the values of the graph signal on the same sets of $k$ -hop neighbors in which the LSI-GFs from (4) calculate weighted averages of the graph signal. However, while in LSI-GFs the signal components are averaged, in the graph max filter they are mapped to a single value by the nonlinear max operation. We can think of (17) as another type of nonlinear convolutional filter, this one constructed from max operations.

The max graph filter from Definition 4 is used to define a second nonlinear activation function —the max activation function— by considering a collection of max filter coefficients ${\mathbf{w}}_{\ell}^{f}$ and replacing the pointwise activation functions in (6) with max filters,

[TABLE]

From (18), we can then define a GNN with max activation functions in which the $\ell$ th layer is a composition of (5) and (18). We call it the max GNN.

Max GNN layers compose two types of convolutional layers, linear convolutional layers [cf. (5)] and nonlinear convolutional layers based on max filters (18). The output of a max GNN is determined by the linear filter coefficients ${\mathcal{H}}$ and the max graph filter coefficients ${\mathcal{W}}=\{{\mathbf{w}}_{\ell}^{f}\}_{\ell,f}$ . These parameterizations allow us to write the max GNN as the map

[TABLE]

where ${\mathbf{x}}_{L}$ is obtained by applying $L$ max GNN layers to ${\mathbf{x}}$ . Like their median filter counterparts, max graph filters are constructed in a way that preserves the permutation invariance of GNNs. This is formally stated and proved in Proposition 3.

Proposition 3.

Consider a graph signal ${\mathbf{x}}$ supported on a graph with shift operator ${\mathbf{S}}$ . Let $\Phi({\mathbf{x}};{\mathbf{S}},{\mathcal{H}},{\mathcal{W}})$ be the output of a max GNN whose linear filter coefficients are ${\mathcal{H}}$ and whose nonlinearity coefficients are ${\mathcal{W}}$ [cf. (5)-(7), (18)]. If ${\mathbf{P}}$ is a permutation matrix, then

[TABLE]

i.e., the output of the max GNN is invariant to permutations.

Proof.

This follows from Proposition 1 and from the invariance to permutations of the max activation function in (18). Let ${\mathbf{S}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{S}}{\mathbf{P}}$ be a graph permutation and let ${\mathbf{x}}^{\prime}={\mathbf{P}}^{\mathsf{T}}{\mathbf{x}}$ be the corresponding permuted graph signal. This proof mimics the proof of Proposition 2, where we now use the fact that the maximum of a graph signal within its node neighborhoods is permutation invariant to show

[TABLE]

As was the case for median GNNs, the result above will also hold for max GNNs with multiple layers. ∎

Like their median counterparts, and unlike pointwise activation functions, max activations give GNNs the ability to extract nonlinear local features that cannot be extracted by application of linear graph convolutions alone. As seen in Proposition 2, the features extracted by max graph filters are also invariant to permutations of the graph. Finally, max filters are differentiable with respect to their parameters, which allows max activation functions to be trained (cf. Section IV).

Remark 4 (Pooling).

In deep neural network architectures, the representation dimension increases with the depth of the network and the number of features. To keep the representation dimension under control, many architectures implement pooling as an intermediate step between the convolutional filter banks and the nonlinearity. Pooling is a summarizing two-step operation that reduces dimensionality by first computing local summaries of the signal within the graph equivalent of a window and then subsampling it. The summarizing operation can be either linear or nonlinear; the most common examples are average pooling and max-pooling. Because the graph windows on which it operates can be rescaled, pooling is also used for feature extraction at multiple resolutions. Permutation invariance is preserved if the subsampling operation is based on topological features of the graph such as the node degrees [7]. We also note that the nonlinear activation function and the pooling summarizing operation can be composed into one single localized activation function that precedes subsampling. Therefore, all design strategies of localized activation functions described herein might also be useful in designing summarizing functions in pooling operations. In the case of GNNs, pooling strategies have been proposed in [7, 4, 2, 3].

IV Localized Activation Function Training

In contrast with pointwise nonlinearities, localized activation functions act on multiple nodal components at a time. As a result, the computations that the NN carries out both during the forward and backward passes of data need to be updated. In this section, we look at how localized activation functions affect the gradient updates in backpropagation training (Section IV-A) and discuss the additional computational complexity incurred by these operators (Section IV-B).

IV-A Backpropagation

For ease of exposition, let us consider a single-layer GNN with a graph convolutional layer given by [cf. (5)]

[TABLE]

where the input has $G$ features ${\mathbf{x}}^{g}$ , $g=1,\ldots,G$ and the output has $F$ features ${\mathbf{u}}^{f}$ , $f=1,\ldots,F$ . The set ${\mathcal{H}}=\{h_{k}^{fg}\}_{f,g,k}$ groups the corresponding trainable parameters. In median and max GNNs, the linear operation is followed by a local activation function [cf. (12), (18)]

[TABLE]

where $\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},\cdot):{\mathbb{R}}^{N}\to{\mathbb{R}}^{N}$ represents the chosen local activation function —either the median (12) or max filters (18)— with trainable parameters ${\mathcal{W}}=\{w_{k^{\prime}}^{f}\}_{f,k^{\prime}}$ . The estimate ${\hat{\mathbf{y}}}({\mathbf{x}})=[{\hat{\mathbf{y}}}^{1}({\mathbf{x}})^{\mathsf{T}},\ldots,{\hat{\mathbf{y}}}^{F}({\mathbf{x}})^{\mathsf{T}}]^{\mathsf{T}}$ is then the output of this layer, with ${\hat{\mathbf{y}}}^{f}({\mathbf{x}})={\mathbf{z}}^{f}({\mathbf{x}})$ for each $f=1,\ldots,F$ . Given a training set ${\mathcal{T}}=\{({\mathbf{x}}_{m},{\mathbf{y}}_{m})\}$ , we choose parameters ${\mathcal{H}}$ and ${\mathcal{W}}$ that minimize some total loss function ${\bar{J}}$ over ${\mathcal{T}}$

[TABLE]

In conventional GNNs, this is done by using the backpropagation algorithm. In the following proposition, we show that local activation functions are equally amenable to training via backpropagation, and give closed form expressions for the trainable parameters’ gradient updates.

Proposition 4.

Let ${\mathcal{T}}=\{({\mathbf{x}}_{m},{\mathbf{y}}_{m})\}$ be a training set. Let ${\bar{J}}({\mathbf{y}},{\hat{\mathbf{y}}}({\mathbf{x}}))$ be the loss function in (23), where ${\hat{\mathbf{y}}}({\mathbf{x}})=[{\hat{\mathbf{y}}}^{1}({\mathbf{x}})^{\mathsf{T}},\ldots,{\hat{\mathbf{y}}}^{F}({\mathbf{x}})^{\mathsf{T}}]^{\mathsf{T}}$ is the output estimate of the single-layer GNN with convolutional layer (21) followed by the local activation layer (22), with ${\hat{\mathbf{y}}}^{f}({\mathbf{x}})={\mathbf{z}}^{f}({\mathbf{x}})$ for each $f=1,\ldots,F$ . At each round of the backpropagation algorithm, the learnable parameters ${\mathcal{H}}=\{h_{k}^{fg}\}_{f,g,k}$ and ${\mathcal{W}}=\{w_{k^{\prime}}^{f}\}_{f,k^{\prime}}$ are updated by calculating the derivatives

[TABLE]

where $\partial J/\partial{\hat{\mathbf{y}}}^{f}=[\partial J/\partial[{\hat{\mathbf{y}}}^{f}]_{1},\ldots,\partial J/\partial[{\hat{\mathbf{y}}}^{f}]_{N}]\in{\mathbb{R}}^{1\times N}$ is the gradient of the loss function $J$ , and with ${\mathbf{P}}_{k^{\prime}}\in{\mathbb{R}}^{N\times N}$ a binary matrix such that $[{\mathbf{P}}_{k^{\prime}}]_{ij}=1$ if node $j$ realizes the local activation $\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},\cdot)$ for node $i$ , and zeros for every other $j$ .

Proof.

Let us start by proving (24). We take the derivative of the total loss function ${\bar{J}}$ with respect to the scalar coefficient $h_{k}^{fg}$ . Towards this end, denote the gradient of $J$ with respect to all the features by $\partial J/\partial{\hat{\mathbf{y}}}=[\partial J/\partial{\hat{\mathbf{y}}}^{1},\ldots,\partial J/\partial{\hat{\mathbf{y}}}^{F}]\in{\mathbb{R}}^{1\times NF}$ where each $\partial J/\partial{\hat{\mathbf{y}}}^{f}\in{\mathbb{R}}^{1\times N}$ is the gradient with respect to feature $f$ . Denote the gradient of the output of the GNN with respect to the specific parameter $h_{k}^{fg}$ by $\partial{\hat{\mathbf{y}}}/\partial h_{k}^{fg}=[(\partial{\hat{\mathbf{y}}}^{1}/\partial h_{k}^{fg})^{\mathsf{T}},\ldots,(\partial{\hat{\mathbf{y}}}^{F}/\partial h_{k}^{fg})^{\mathsf{T}}]^{\mathsf{T}}\in{\mathbb{R}}^{NF\times 1}$ , where $\partial{\hat{\mathbf{y}}}^{f^{\prime}}/\partial h_{k}^{fg}\in{\mathbb{R}}^{N\times 1}$ for each $f^{\prime}=1,\ldots,F$ . We note that $\partial{\hat{\mathbf{y}}}^{f^{\prime}}/\partial h_{k}^{fg}={\mathbf{0}}$ whenever $f^{\prime}\neq f$ . Applying the chain rule once, we get

[TABLE]

since $\partial{\hat{\mathbf{y}}}^{f^{\prime}}/\partial h_{k}^{fg}={\mathbf{0}}$ whenever $f^{\prime}\neq f$ . Now, we focus on $\partial{\hat{\mathbf{y}}}^{f}/\partial h_{k}^{fg}$ , and using (22) and the chain rule once again, we get

[TABLE]

where $\partial\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},\cdot)/\partial{\mathbf{u}}^{f}\in{\mathbb{R}}^{N\times N}$ is the Jacobian, with $[\partial\boldsymbol{\sigma}/\partial{\mathbf{u}}^{f}]_{ij}=\partial[\boldsymbol{\sigma}]_{i}/\partial[{\mathbf{u}}^{f}]_{j}$ being the corresponding (sub-)derivative. The local activation function $\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},{\mathbf{u}}^{f})$ outputs a graph signal where each element is a nonlinear combination of the $k^{\prime}$ -hop neighbors of each node. In both cases (median and max), the function application is actually equivalent to selecting a value from the ones at the neighboring nodes (i.e. the output of $[\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},{\mathbf{u}}^{f})]_{i}$ is the value of $[{\mathbf{u}}^{f}]_{j}$ for some $j\in{\mathcal{N}}_{i}^{k^{\prime}}$ ). Therefore, the (sub-)derivative of $[\boldsymbol{\sigma}]_{i}$ with respect to the input graph signal ${\mathbf{u}}^{f}$ is a vector with a $1$ in the position corresponding to the nodes $j\in{\mathcal{N}}_{i}^{k^{\prime}}$ that generate the output of the function, and zeros elsewhere. This implies that $\partial\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},\cdot)/\partial{\mathbf{u}}^{f}={\mathbf{P}}_{k^{\prime}}\in{\mathbb{R}}^{N\times N}$ is a binary matrix with $[{\mathbf{P}}_{k^{\prime}}]_{ij}=1$ if node $j$ generates the output of the function at node $i$ and [math] for all other $j$ . Finally, since per (21) ${\mathbf{u}}^{f}$ is a linear function of $h_{k}^{fg}$ ,

[TABLE]

Using (28) together with the fact that $\partial\boldsymbol{\sigma}({\mathbf{S}}^{k^{\prime}},\cdot)/\partial{\mathbf{u}}^{f}={\mathbf{P}}_{k^{\prime}}$ back in (27), and, consequently, back in (26), completes the proof of (24).

For (25) we have

[TABLE]

where we used again the fact that $\partial{\hat{\mathbf{y}}}^{f^{\prime}}/\partial h_{k}^{fg}={\mathbf{0}}$ whenever $f^{\prime}\neq f$ , and where the derivative of the feature $f$ with respect to the nonlinear weight parameter is denoted by $\partial{\hat{\mathbf{y}}}^{f}/\partial w_{k^{\prime}}^{f}=[(\partial[{\hat{\mathbf{y}}}^{f}]_{1}/\partial w_{k^{\prime}}^{f})^{\mathsf{T}},\ldots,(\partial[{\hat{\mathbf{y}}}^{f}]_{N}/\partial w_{k^{\prime}}^{f})^{\mathsf{T}}]\in{\mathbb{R}}^{N\times 1}$ . Observe that ${\hat{\mathbf{y}}}^{f}$ is a linear function of $w_{k^{\prime}}^{f}$ , and so using (22) we get

[TABLE]

Replacing (30) back in (29) proves (25). ∎

It should be noted that both in (24) and (25) the gradient updates depend on quantities that are either available from the start —like ${\mathbf{S}}$ — or made available in the forward pass immediately before the backpropagation step. In particular, the weights $w_{k^{\prime}}^{f}$ in (24) are initialized before the first forward and backward passes.

IV-B Computational complexity

In this section, we discuss and quantify the additional computational complexity incurred by localized activation functions in the forward and backward passes of data needed to train the single layer median/max GNNs of the previous subsection. These GNNs are compared with a single-layer GNN containing only pointwise activation functions.

Corollary 1.

In Proposition 4, replace the local activation function layer (22) by a pointwise activation function $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ ,

[TABLE]

where $[\boldsymbol{\sigma}({\mathbf{u}}^{f})]_{i}=\sigma([{\mathbf{u}}^{f}]_{i})$ for all $i=1,\ldots,N$ . In this case, the derivatives used in the backpropagation algorithm are

[TABLE]

where ${\mathbf{q}}\in{\mathbb{R}}^{N}$ with $[{\mathbf{q}}]_{i}=d\sigma/dx\ |_{x=[{\mathbf{u}}^{f}]_{i}}$ .

Proof.

We set $K^{\prime}=0$ in (24) in Proposition 4, since we only consider the value of the signal at the node individually and $w_{k^{\prime}}^{f}=1$ (the output of conventional pointwise activation functions is not modified by any trainable weights). In this scenario, the matrix ${\mathbf{P}}_{k^{\prime}}={\mathbf{P}}_{0}$ takes the form of a diagonal matrix, because the only nodes contributing to the output of the nonlinearity are the nodes themselves. The derivatives of the nonlinearity $\sigma$ at each node are then the diagonal values of the matrix ${\mathbf{P}}_{0}=\text{diag}({\mathbf{q}})$ with $[{\mathbf{q}}]_{i}=d\sigma/dx\ |_{x=[{\mathbf{u}}^{f}]_{i}}$ . ∎

The overall complexity of the activation step in one forward pass of the GNN with pointwise activations is ${\mathcal{O}}(N)$ , that is, one nonlinearity per node. In a forward pass of the GNN with localized activation functions, the added complexity stems mainly from the sorting operations performed across multiple neighborhoods of a given node to find the maximum or the median of the graph signal in those regions. As a result, the multi-hop maximum and the multi-hop median activations incur an overall worst-case computational complexity of ${\mathcal{O}}(NK^{\prime}d^{K^{\prime}}\log(d^{K^{\prime}}))$ , with $K^{\prime}>0$ and $d$ denoting the maximum degree. Although the activation function reach $K^{\prime}$ is usually small, it exponentiates $d$ , which need not be. The graph degree is therefore bound to be a considerable limiting factor complexity-wise, but this is expected since our nonlinearities are now local instead of pointwise. In any case, we note that, as long as the degree is not a function of the number of nodes (this is the case, for example, of regular networks and of most small-world networks [30, Ch. 10]), the complexity is still linear in $N$ .

In the backward pass, the GNN with pointwise activations incurs ${\mathcal{O}}(N)$ operations (cf.(32)) to compute the derivative with respect to $h_{k}^{fg}$ . In the case of the GNN with local activation functions, following Proposition 4 we have ${\mathcal{O}}(K^{\prime}N)$ for the computation of the derivative with respect to $h_{k}^{fg}$ followed by ${\mathcal{O}}(N)$ operations to compute the derivatives with respect to $w_{k^{\prime}}^{f}$ . These results are summarized in Table I.

V Numerical Experiments

We assess the performance of median and max GNNs in four different applications. In all scenarios, the multi-hop maximum and the multi-hop median are compared with the ReLU. The first problem is source localization on Erdős-Rényi (ER) and geometric graphs. In this synthetic setting, we analyze how the activation function reach $K$ and the underlying graph degree affect classification accuracy. The second application is authorship attribution of text excerpts taken from 19th century novels, modeled as a binary classification problem detailed in Section V-B. In the third experiment, we tackle the problem of predicting movie ratings using the MovieLens 100k dataset. In addition to comparing localized and pointwise activation functions, we contrast our performance with that of the recommendation systems proposed in [31] and [32]. The last experiment is a node classification task in which we use the Cora citation network and associated dataset to classify scientific articles into 7 different classes.

In Sections V-A through V-C, the simulated GNNs predict class labels for graph signals. In Section V-D, they predict labels for nodes of the graph. Unless otherwise noted, all models consist of one convolutional layer with $F_{1}=32$ linear graph filters that have $K_{1}=5$ filter taps each, followed by the activation function under analysis (ReLU, $K$ -hop median or $K$ -hop max). The GSO of the convolutional filters is always the adjacency matrix; in Sections V-A, V-B and V-D, the best results were obtained when the adjacency matrix was rescaled by the inverse of its largest eigenvalue and so these are the results that we report. No pooling is performed, and in the graph signal classification settings (Sections V-A-V-C) the convolutional layer is followed by a fully connected layer that carries out a softmax classification with $F_{2}=C$ nodes. $C$ is the number of classes, which is different for each problem. In all applications, the GNNs were trained using the ADAM algorithm for stochastic optimization. This algorithm keeps an exponentially decaying average of past gradients with decaying factors $\beta_{1}=0.9$ and $\beta_{2}=0.999$ [33].

V-A Source Localization

Source localization is a classification problem where we aim to identify the node that originated a diffusion process on a graph [34]. Take for instance a graph ${\mathcal{G}}$ with $N$ nodes and adjacency matrix ${\mathbf{W}}$ . To be specific, let $c\in\{1,...,N\}$ represent the index of the source node and consider the seeding graph signal ${\mathbf{x}}_{0}$ , which is defined as $[{\mathbf{x}}_{0}]_{i}=1$ if $i=c$ and $[{\mathbf{x}}_{0}]_{i}=0$ for all other $i$ . The corresponding diffused signal at time $t=1,2,\ldots$ is then ${\mathbf{x}}(t)={\mathbf{W}}^{t}{\mathbf{x}}_{0}$ , which satisfies ${\mathbf{x}}(0)={\mathbf{x}}_{0}$ . Given ${\mathbf{x}}(t)$ and without knowing $t$ , we want to identify the node $c$ .

In all of this section’s experiments, we train a GNN to predict the source node of a graph diffusion process by optimizing a cross entropy loss with 0.005 learning rate. In every round, 10,000 synthetic training samples consisting of a diffused graph signal ${\mathbf{x}}(t)$ at a random time $t$ (input) and its source $c$ (true label) were evaluated in batches of 100. To prevent overfitting, we set the node dropout probability to 50% during training [35]. The validation and test sets comprised 200 input-output samples in every round.

We first analyze the evolution of the training and validation losses over 30 epochs for an ER graph with 100 nodes, edge probability 0.4 and and 20 possible sources that we choose uniformly at random. This amounts to a classification problem with 20 classes. The training vs. validation loss plots for the ReLU, the 1-hop maximum and the 1-hop median are shown in Figure 2. No architecture overfits the training set, and they all achieve a comparable loss over the unseen validation loss.

Next, we study the source localization problem on 40 random ER [36] and 40 random geometric graphs [37, Ch. 4]. Each graph has 100 nodes, 10 of which are randomly picked to be potential sources (10 classes). The ER graphs have edge probability 0.4 and average degree 39.4. The geometric graphs have radius 0.15 on the unit square and average graph degree 5.6. Training was done in 20 epochs for both types of graphs.

First, we analyze localized activation function performance in terms of their reach in number of hops, i.e., $K$ in expressions (12) and (18). We adopt the convention that, for $K=0$ , both activation functions amount to the ReLU, $\mbox{max}\{0,x_{i}\}$ , since expressions (12) and (18) are linear in ${\mathbf{x}}$ for $K=0$ . Figure 3(a) shows the average test accuracies for ER graphs. At $K=1$ , localized activation functions increase accuracy of at least 6.5 percentage points over the ReLU. However, as $K$ grows bigger, their performance appears to stagnate or decay. A plausible explanation for that is that on graphs with such high connectivity and average degree, neighborhoods become more redundant as they grow larger (larger $K$ ).

The average test accuracies as a function of $K$ are displayed in Figure 3(b) for geometric graphs. Even if the 2-hop max showcases the largest accuracy, its performance degrades as the number of hops increases. On the other hand, the median sustains consistent improvements up to $K=3$ , but its performance also decays from there to $K=4$ . This explained by the fact that geometric graphs with small radii are not very connected and have an almost Euclidean structure. Thus, the more distant the neighborhood, the smallest its influence is likely to be on a node. It is natural, then, to expect a function returning extreme values like the maximum to add a lot of high intensity noise from distant neighborhoods as $K$ grows and thus degrade more abruptly in performance than a smoothing operation like the median. Nonetheless, the median still performs worse as neighborhoods increase, due to the excess of information and possible redundancies. Although all localized activation functions outperform the ReLU, they do so by smaller margins and with higher variances than those observed on ER graphs. This is further evidence that localized activation functions are less powerful on graphs with some structural regularity, which is precisely the case of geometric graphs.

The next analysis was done by keeping all the same simulation parameters and varying the graph degree. Results are presented in Figure 4(a) for ER graphs and in Figure 4(b) for geometric graphs. On both ER and geometric graphs, the 1-hop median delivers the best results by degree. On ER graphs, localized activation functions are consistently better than the ReLU, and even if classification accuracy decreases with the graph degree regardless of the choice of activation function, the accuracy gap between localized activations and the ReLU increases systematically.

On geometric graphs (Figure 4(b)), the best accuracy is obtained in the middle of the degree range. The worst accuracies are observed for graphs with small degree; this is once again related to them having an almost Euclidean structure that is more likely to benefit from the CNN rather than the GNN apparatus. Localized activation functions outperform the ReLU in all scenarios. The steady decay in performance that occurs at higher degrees is accompanied by a reduction in the accuracy gap between localized and pointwise activations, in contrast with what we observed for ER graphs. This can be explained by the highly connected patterns arising from the smaller variances in the individual nodes’ degrees of geometric graphs.

V-B Authorship attribution

In this section, we assess the performance of localized activation function GNNs in an authorship attribution problem based on real data. The graphs we consider are author-specific word adjacency netowrks (WANs), which are directed graphs whose nodes are function words and whose edges represent the probability of transitioning between a particular pair of words in a text written by the author. Function words are prepositions, pronouns, conjunctions and other words with syntactic importance and little semantic meaning; their use in authorship attribution was first discussed in [38] and is based on the fact that their usage carries stylometric information about the author while being content independent.

We consider $N=211$ nodes or functions words. Using the method in [39], we build single-author WANs for Emily Brönte and Jane Austen. To build each author’s WAN, we process their texts to count the number of times that each pair of function words co-appear in 10-word windows. These are inputted to a $N\times N$ function-word matrix and normalized row-wise. The resulting matrix is the WAN adjacency matrix, which can also be interpreted as a Markov chain transition matrix. Because the order in which function words appear matters, the resulting graphs are directed. As for the graph signals, they are defined as each function word’s count among 1,000 words. Thus, the texts available by a given author are split in 1,000-word excerpts (signals) where we store the frequency of each of the function words.

Splitting an author’s texts between training and test sets on a 80-20 ratio, each author’s WAN is generated from function word co-appearance counts in the training set only. An example of such a network is depicted in Figure 5. The graph signals in the training set are the individual function words’ counts in these same excerpts and in excerpts by other authors picked at random from a pool of 21 authors to yield a balanced classification problem. Paired with a binary label where 1 indicates a text by the author in question and 0 a text by any other author in the pool, these constitute the input-output pairs used to train the GNN. Test samples are defined analogously, but we only consider excerpts that have not been used to build the author’s WAN. The loss function is the cross entropy, which we optimize in 25 training epochs and batches of 20 samples, with learning rate 0.005 and without dropout.

Figure 6 presents the authorship attribution accuracy results for Emily Brönte (Figure 6(a)) and Jane Austen (Figure 6(b)). Ten rounds of simulations were conducted for each author by varying the training and test splits.

The average average out-degree of the WANs built for author Emily Brönte was 77.9. The training and test sets consisted of 1,092 and 272 1,000-word excerpts, both with equally balanced classes. For Jane Austen, the average out-degree of the WANs considered was 88.3, and the training and test sets contained 1,234 and 308 labeled excerpts respectively.

On Figure 6(a), we see that median and max GNNs did consistently better than the ReLU GNNs on discerning between texts written by Brönte and any other author in the pool. Although the smallest classification error, of 12.43% (34/272), was obtained with the 1h-max, every other localized activation outperforms the ReLU on average, with significantly smaller test errors and deviations around the average.

For the author Jane Austen (Figure 6(b)), three of our schemes perform better than the ReLU. The 2-hop localized activations do worse than the ReLU, which could be explained by the higher average degree of this author’s WANs. The gap between the best performing localized activation — the 1-hop median — and the ReLU is not as big as in the previous example, but it still amounts to at least one extra excerpt being labeled correctly. What is more impressive is this architecture’s ability to correctly attribute text fragments as short as a single page with up to 98.37% accuracy (303/308). This was the best observed accuracy in all 10 realizations and it was obtained by training the 1-hop max.

V-C Recommender systems

The third application we consider is movie rating prediction using the MovieLens 100k dataset [40], which contains ratings that a set of users have given to a subset of movies. There are $U=943$ users and $M=1,582$ movies (items), the ratings range from 1 to 5, only 100,000 out of 1,491,826 ratings are known and the ratings for unknown user-movie pairs are set to 0. Given an incomplete $U\times M$ rating matrix, we can define two different graphs — an (a) user similarity network and a (b) movie similarity network. Both are constructed by computing Pearson correlations considering only (a) items that have been rated by pairs of users or (b) users that have rated the same item pairs. For a given node, we keep only the top- $k$ user or item pairs with highest similarity, which yields a directed graph. Correspondingly, a user-based and a movie-based GNN architecture can be defined atop each one of these graphs.

In the user-based approach, each graph signal corresponds to a different movie and consists of the existing ratings by every user in the network who has rated that movie. A depiction of such a signal is shown on Figure 7(a). The way in which we create training and test samples from these signals is by “zero-ing out” the ratings of the user $u$ in which we are interested. The GNN is then trained to predict ratings by this user to any movie previously rated by other users. We can see this task as equivalent to completing the $u$ th row of the rating matrix.

In the movie-based approach, the underlying graph is a movie similarity network. There are as many graph signals as users, and each of the signals correspond to the ratings that the corresponding user has given to the movies in the dataset. We create training and test samples by “zero-ing out” the ratings to a movie $m$ of our choice. The GNN is trained to predict ratings to this movie by any user who has already given ratings to other movies in the graph. This is equivalent to completing the $m$ th column of the rating matrix. An example of this is given in Figure 7(b), where we predicted every rating in the 1st column of the matrix (corresponding to the movie Toy Story) and represented them as a graph signal on top of the user similarity network.

We choose five 90-10 splits for the training and test sets in both the user-based and movie-based experiments, submitting the GNNs to 40 epochs of training, in batches of 5 and without dropout. We optimize the cross entropy loss with learning rate 0.005. In both cases, we contrast the average performance of our localized activation function-based GNNs on all data splits with that of an all-ReLU GNN. On the best data split, we also compare our method with the recommender systems proposed in [31] and [32]. In [31], the authors also make the distinction between a user-based and a movie-based approach. The user-based approach predicts the entire rating matrix through application of linear graph filters defined on top of the user similarity network. They have up to 6 taps and their coefficients are optimized on the full 90,000-rating training set. This author’s movie-based approach does the same, but using linear graph filters defined on the movie similarity network instead. These have up to 3 filter taps that are also optimized on the training set. Because our methods look at each user/movie individually, to make for a fair comparison we test the method in [31] once for each user/movie, taking only that particular user’s/movie’s ratings into account in the calculation of the RMSE. Additionally, [31] presents a third approach to the rating prediction problem: mirror filtering (MiFi), which filters on the user and movie similarity networks simultaneously. Although this is the best performing method in the analysis carried out in [31], it cannot be compared to our approaches, because it intertwines user and movie information and does not allow looking at each user or movie individually.

As for the method in [32], it uses a multi-graph CNN (MGCNN) to extract features from existing ratings. We train this CNN on the same 90,000-sample training dataset as before. The extracted features are then fed to a recurrent neural network (RNN) responsible for the score diffusion process.

In our user-based approach, the user similarity networks are built from the 90,000 ratings in the training set with $k=40$ . This results in networks with average out-degree of 38.3. Using the same 90-10 training-to-test ratio and $k=40$ , we also build directed movie similarity networks, whose average out-degree is 1.09. Because these networks are large and highly connected at some nodes, only the 1-hop median and max were considered. The number of parameters in the convolutional layer of the ReLU, the 1-hop median and the 1-hop max architectures are shown in Table II. The number of parameters of the 1-hop median/max only exceeds the number of parameters of the ReLU GNN by 2, because we regularized it by making ${\mathbf{w}}^{1}_{\ell}=\ldots={\mathbf{w}}^{F}_{\ell}={\mathbf{w}}_{\ell}$ the same for all features.

The 20 users with largest number of ratings were chosen to assess GNN performance in the user-based approach; we report test RMSEs for the first 5 and the averages for the first 5, 10, 15, and all 20 in Table III. Localized activation functions consistently outperform the ReLU on average when the first 5, 10, 15 and all 20 users are considered, as well in most of the individual user cases (at least one of the local architectures outperforms the ReLU architecture for every user). More than improving upon the ReLU, we note that our localized activation functions incur an increase in capacity given that they only have 2 more parameters than the ReLU GNN (cf. Table II). On Table IV we contrast the minimum RMSE achieved by either the max or median GNNs in the best data split –the data split where localized activations outperform the ReLU by the largest margin– for each individual user with the RMSEs obtained using the methods in [31] (user-based) and [32]. For [31], we report the smallest RMSE of the 6 filters that are trained. Our user-based method outperforms both [31] and [32] for all users except 450. Even then, the difference in the recorded RMSEs is minimal relatively to discrepancies observed in other rows.

As for the movie-based approach, accuracies for the first 5 movies with most ratings, as well as averages for the first 5, 10, 15 and 20, are shown in Table V. Localized activation functions outperform the ReLU for all movies with as little as 2 additional trainable parameters (cf. Table II), which once again attests to the increased capacity of median and max GNNs. On Table VI, the results obtained on the best data split for each user (the data split where localized activations outperform the ReLU by the largest margin) are compared with those obtained using the movie-based method with 3 filter taps in [31] and the MGCNN in [32] on these same splits. In both [31] and [32], all 1,682 movies were taken into account.

On Table VI, localized activation function GNNs outperform [31] for all movies and [32] for 3 of the 5 movies. Except for “Star Wars” (where we outperform both methods with at least a 10% reduction in RMSE) and “Contact” (where [32] outperforms our method by 5%), for all other movies the differences in RMSE are not as significant as they were in the user-based approach. The most pertinent observation here is that, in the movie-base approach, our method is able to deliver recommendations that are essentially as accurate as those provided by [31] and [32], but with less data and less computational complexity. Unlike [31] and [32], in both the user and movie-based approaches we do not need to use the entire 90,000-sample training dataset to train the GNN because we only look at a row/column of the rating matrix at a time. In this sense, another advantage of our architecture is the ability to offer more personalized and possibly on-demand movie recommendations.

V-D Citation networks

To evaluate the performance of the localized activation functions in a node classification setting, we compare max and median GNNs with GNN architectures using only ReLU activations on the Cora dataset. The Cora dataset consists of $N=2708$ scientific articles that pertain to $C=7$ different classes and make up the nodes of a citation network. Each article is described by a bag-of-words feature vector with $F_{\mbox{{\scriptsize in}}}=1433$ words. Given the articles’ feature vectors, the objective is to predict to which class each article belongs.

In our setup, feature vectors are interpreted as multi-feature graph signals and, during training, validation and test, all of them are fed to the GNN models. The GNNs generate intermediate features for all nodes, which are then interpreted as individual samples and processed through a fully connected layer mapping each node’s features to a class label between 1 and 7. In the training stage, we only predict labels for $140$ nodes; the validation and test nodes are a total of $300$ and $1000$ respectively. This data split is the same used in [5], and can be found at http://github.com/tkipf/pygcn.

$1$ -hop and $2$ -hop max and median GNNs with $L=1,F_{0}=F_{\mbox{{\scriptsize in}}},F_{1}=16$ , and $K_{1}=5$ were compared against two ReLU architectures, which we call $\mbox{ReLU}_{1}$ and $\mbox{ReLU}_{2}$ . $\mbox{ReLU}_{1}$ has the same hyperparameters as the localized activation function GNNs, while $\mbox{ReLU}_{2}$ has hyperparameters $L=4,F_{0}=F_{\mbox{{\scriptsize in}}},\{F_{i}\}_{i=1}^{4}=16$ , and $\{K_{i}\}_{i=1}^{4}=2$ . $\mbox{ReLU}_{2}$ was designed so as to provide for a fair comparison with [5]. Even if the GNN architecture in [5] only considers convolutional filters with $1$ -hop diffusions ( $K=2$ ), by setting the number of layers to $L=4$ we can force information exchanges at most $4$ -hops away, which is equivalent to having $K=5$ .

All models were trained by optimizing the cross entropy loss, for a total of $150$ epochs and with learning rate $0.005$ . The classification accuracy achieved by each architecture is presented in Table VII. Regardless of the number of hops or of the type of nonlinearity, the localized activation functions outperform both ReLU architectures by a significant margin, attesting to the value of encoding the graph structure in the computation of nonlinearities to improve GNN capacity.

VI Conclusions

We have presented GNN architectures that replace pointwise nonlinearities by activation functions with multiple local inputs. These activation functions perform a linear combination of signals observed at the output of nonlinear max or median operators in node neighborhoods of increasing resolution. By using either max or median filters, we achieved greater model capacity at the expense of only a slight increase in computational complexity, with the architectures still being linear in the number of nodes. As weighted linear combinations of nonlinear operators, they also endow GNNs with the ability to learn activation function parameters from data. We have additionally shown that the gradients of localized activation functions can be efficiently computed through backpropagation.

Median and max GNNs were compared with GNNs using only pointwise activation functions in 3 different problems, and we observed performance improvements across all of them. In source localization on synthetic graphs, localized activation functions improved GNN capacity and outperformed the traditional ReLU-based designs regardless of the number of hops and of the graph degree. In authorship attribution, the classification accuracy improved in 1.8% for Emily Brönte and 0.4% for Jane Austen, approaching a little more than 98% of texts correctly classified as having been written or not by the author. We have additionally proposed a user/movie-oriented movie recommendation system that is fit for online implementation and that improved upon both the conventional GNN implementation and two comparable methods. Finally, on the Cora dataset localized activation GNNs were shown to improve performance upon conventional GNNs with pointwise activation functions in at least 7%.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Ruiz, F. Gama, A. G. Marques, and A. Ribeiro, “Median activation functions for graph neural networks,” in 44th IEEE Int. Conf. Acoust., Speech and Signal Process. Brighton, UK: IEEE, 12-17 May 2019, pp. 7440–7447.
2[2] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun, “Spectral networks and deep locally connected networks on graphs,” in 2nd Int. Conf. Learning Representations . Banff, AB: Assoc. Comput. Linguistics, 14-16 Apr. 2014, pp. 1–14.
3[3] M. Henaff, J. Bruna, and Y. Le Cun, “Deep convolutional networks on graph-structured data,” ar Xiv:1506.05163 v 1 [cs.LG] , 16 June 2015. [Online]. Available: http://arxiv.org/abs/1506.05163
4[4] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in 30th Conf. Neural Inform. Process. Syst. Barcelona, Spain: Neural Inform. Process. Syst. Foundation, 5-10 Dec. 2016, pp. 3844–3858.
5[5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th Int. Conf. Learning Representations . Toulon, France: Assoc. Comput. Linguistics, 24-26 Apr. 2017, pp. 1–14.
6[6] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for web-scale recommender systems,” in 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining . London, UK: Assoc. Computing Machinery, 19-23 Aug. 2018, pp. 974–983.
7[7] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE Trans. Signal Process. , vol. 67, no. 4, pp. 1034–1049, Feb. 2019.
8[8] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th Int. Conf. Learning Representations . New Orleans, LA: Assoc. Comput. Linguistics, 6-9 May 2019, pp. 1–17.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Invariance-Preserving Localized Activation Functions for Graph Neural Networks

Abstract

Index Terms:

I Introduction

II Convolutional Processing of Graph Signals

II-A Graph Convolutions

Remark 1**.**

II-B Graph Neural Networks

III Invariance-Preserving Local Activation Functions

Proposition 1**.**

Proof.

III-A Median GNNs

Definition 1** (Median operator).**

Definition 2** (Multiresolution median graph filter).**

Proposition 2**.**

Proof.

Remark 2**.**

Remark 3** (Alternative median graph filter definitions).**

III-B Max GNNs

Definition 3** (Max operator).**

Definition 4** (Multiresolution max graph filter).**

Proposition 3**.**

Proof.

Remark 4** (Pooling).**

IV Localized Activation Function Training

IV-A Backpropagation

Proposition 4**.**

Proof.

IV-B Computational complexity

Corollary 1**.**

Proof.

V Numerical Experiments

V-A Source Localization

V-B Authorship attribution

V-C Recommender systems

V-D Citation networks

VI Conclusions

Remark 1.

Proposition 1.

Definition 1 (Median operator).

Definition 2 (Multiresolution median graph filter).

Proposition 2.

Remark 2.

Remark 3 (Alternative median graph filter definitions).

Definition 3 (Max operator).

Definition 4 (Multiresolution max graph filter).

Proposition 3.

Remark 4 (Pooling).

Proposition 4.

Corollary 1.