Constant Time Graph Neural Networks

Ryoma Sato; Makoto Yamada; Hisashi Kashima

arXiv:1901.07868·cs.LG·March 30, 2022

Constant Time Graph Neural Networks

Ryoma Sato, Makoto Yamada, Hisashi Kashima

PDF

TL;DR

This paper introduces a novel constant-time node sampling method for GNNs that guarantees approximation accuracy independently of graph size, enabling scalable analysis of large graphs.

Contribution

It provides the first theoretical guarantee of approximation error for GNNs using constant-time node sampling, independent of graph size.

Findings

01

Node sampling complexity depends only on error tolerance and confidence, not graph size.

02

Experimental validation confirms speed and accuracy of the proposed method.

03

The approach scales efficiently to large real-world graphs.

Abstract

The recent advancements in graph neural networks (GNNs) have led to state-of-the-art performances in various applications, including chemo-informatics, question-answering systems, and recommender systems. However, scaling up these methods to huge graphs, such as social networks and Web graphs, remains a challenge. In particular, the existing methods for accelerating GNNs either are not theoretically guaranteed in terms of the approximation error or incur at least a linear time computation cost. In this study, we reveal the query complexity of the uniform node sampling scheme for Message Passing Neural Networks, including GraphSAGE, graph attention networks (GATs), and graph convolutional networks (GCNs). Surprisingly, our analysis reveals that the complexity of the node sampling method is completely independent of the number of the nodes, edges, and neighbors of the input and depends…

Tables2

Table 1. Table 1. ✓ indicates that neighbor sampling approximates the network in constant time. ✗ indicates that no algorithm can approximate the network in constant time. ✓ in the Gradient column indicates that the error between the gradient of the approximated embedding and that of the exact embedding is also theoretically bounded. ✓ ∗ requires an additional condition to approximate it in constant time.

Activation	Embedding	Gradient		Embedding	Gradient
	GATs, GraphSAGE-{GCN, mean}		GraphSAGE-pool	GCNs
sigmoid / tanh	✓	✓	✗	✓^∗	✓^∗
sigmoid / tanh	Thm. 1	Thm. 4	Thm. 4	Thm. 1	Thm. 4
ReLU	✓	✗	✗	✓^∗	✗
ReLU	Thm. 1	Thm. 3	Thm. 4	Thm. 1	Thm. 3
ReLU + normalization	✗	✗	✗	✗	✗
ReLU + normalization	Thm. 2	Thm. 2	Thm. 4	Thm. 2	Thm. 2

Table 2. Table 2. Node Classification performance (F1-score) with various r 𝑟 r . The highest score and scores within ± plus-or-minus \pm std dev of the highest score are marked in bold . The relative errors reported in parentheses are the approximation errors of node embeddings.

	Cora	Cora Full	PubMed	Citeseer
$r = 3$	0.8385 $\pm$ 0.0157	0.6144 $\pm$ 0.0094	0.8068 $\pm$ 0.0089	0.7457 $\pm$ 0.0078
$r = 3$	(rel. err. 15.8%)	(rel. err. 19.1%)	(rel. err. 18.3%)	(rel. err. 6.8%)
$r = 5$	0.8527 $\pm$ 0.0129	0.6406 $\pm$ 0.0192	0.8205 $\pm$ 0.0093	0.7469 $\pm$ 0.0073
$r = 5$	(rel. err. 7.7%)	(rel. err. 11.0%)	(rel. err. 11.7%)	(rel. err. 2.3%)
$r = 10$	0.8465 $\pm$ 0.0103	0.6507 $\pm$ 0.0133	0.8259 $\pm$ 0.0132	0.7464 $\pm$ 0.0102
$r = 10$	(rel. err. 3.0%)	(rel. err. 4.7%)	(rel. err. 5.6%)	(rel. err. 0.6%)
$r = 20$	0.8540 $\pm$ 0.0078	0.6663 $\pm$ 0.0183	0.8272 $\pm$ 0.0159	0.7512 $\pm$ 0.0055
$r = 20$	(rel. err. 1.1%)	(rel. err. 1.7%)	(rel. err. 1.6%)	(rel. err. 0.1%)
Exact computation	0.8503 $\pm$ 0.0176	0.6506 $\pm$ 0.0199	0.8264 $\pm$ 0.0131	0.7481 $\pm$ 0.0076
	Coauthor CS	Coauthor Physics	Amazon Computer	Amazon Photo
$r = 3$	0.8446 $\pm$ 0.0129	0.9224 $\pm$ 0.0060	0.7880 $\pm$ 0.0163	0.8598 $\pm$ 0.0202
$r = 3$	(rel. err. 37.3%)	(rel. err. 33.3%)	(rel. err. 24.0%)	(rel. err. 22.0%)
$r = 5$	0.8817 $\pm$ 0.0063	0.9429 $\pm$ 0.0060	0.8009 $\pm$ 0.0143	0.8893 $\pm$ 0.0091
$r = 5$	(rel. err. 20.9%)	(rel. err. 19.1%)	(rel. err. 15.3%)	(rel. err. 14.1%)
$r = 10$	0.9018 $\pm$ 0.0115	0.9460 $\pm$ 0.0052	0.8265 $\pm$ 0.0197	0.9002 $\pm$ 0.0067
$r = 10$	(rel. err. 8.5%)	(rel. err. 8.2%)	(rel. err. 8.0%)	(rel. err. 7.4%)
$r = 20$	0.9018 $\pm$ 0.0077	0.9475 $\pm$ 0.0061	0.8342 $\pm$ 0.0143	0.9009 $\pm$ 0.0214
$r = 20$	(rel. err. 2.7%)	(rel. err. 2.9%)	(rel. err. 3.9%)	(rel. err. 3.6%)
Exact computation	0.9072 $\pm$ 0.0078	0.9502 $\pm$ 0.0051	0.8324 $\pm$ 0.0190	0.8971 $\pm$ 0.0121

Equations213

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

U_{l} (z_{v}, h_{v}, θ)

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

U_{l} (z_{v}, h_{v}, θ)

z_{v}^{(l)} = max ({σ (W^{(l)} z_{u}^{(l - 1)} + b) ∣ u \in N (v)}) .

z_{v}^{(l)} = max ({σ (W^{(l)} z_{u}^{(l - 1)} + b) ∣ u \in N (v)}) .

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

U_{l} (z_{v}, h_{v}, θ)

α_{v u}^{(l)}

α_{v u}^{(l)}

M_{l v u} (z_{v}, z_{u}, e_{v u}, θ)

U_{l} (z_{v}, h_{v}, θ)

Pr [∥ \hat{O}_{z} (v, ε, δ) - z_{v} ∥_{2} \geq ε] \leq δ .

Pr [∥ \hat{O}_{z} (v, ε, δ) - z_{v} ∥_{2} \geq ε] \leq δ .

Pr [∥ \frac{\partial z _{v}^{(L)}}{\partial θ} - \frac{\partial z _{v}^{(L)}}{\partial θ} ∥_{F} \geq ε] \leq δ,

Pr [∥ \frac{\partial z _{v}^{(L)}}{\partial θ} - \frac{\partial z _{v}^{(L)}}{\partial θ} ∥_{F} \geq ε] \leq δ,

\hat{z}_{v}^{(0)}

\hat{z}_{v}^{(0)}

\overset{α}{^}_{v u}^{(l)}

\hat{h}_{v}^{(l)}

\hat{z}_{v}^{(l)}

Pr [∥ z_{v}^{(L)} - \hat{z}_{v}^{(L)} ∥_{2} \geq ε] < δ .

Pr [∥ z_{v}^{(L)} - \hat{z}_{v}^{(L)} ∥_{2} \geq ε] < δ .

\hat{z}_{v}^{(0)}

\hat{z}_{v}^{(0)}

\hat{h}_{v}^{(l)}

\hat{z}_{v}^{(l)}

Pr [∥ z_{v}^{(L)} - \hat{z}_{v}^{(L)} ∥_{2} \geq ε] < δ .

Pr [∥ z_{v}^{(L)} - \hat{z}_{v}^{(L)} ∥_{2} \geq ε] < δ .

z_{G} = \textsc R E A D O U T ({z_{i} ∣ i \in V}) .

z_{G} = \textsc R E A D O U T ({z_{i} ∣ i \in V}) .

Pr [∣ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∣ \geq ε] \leq 2 exp (- \frac{n ε ^{2}}{2 B ^{2}})

Pr [∣ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∣ \geq ε] \leq 2 exp (- \frac{n ε ^{2}}{2 B ^{2}})

Pr [∥ \overset{ˉ}{x} - E [\overset{ˉ}{x}] ∥_{2} \geq ε] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d})

Pr [∥ \overset{ˉ}{x} - E [\overset{ˉ}{x}] ∥_{2} \geq ε] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d})

Pr [∣ \overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k} ∣ \geq \frac{ε}{d}] \leq 2 exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

Pr [∣ \overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k} ∣ \geq \frac{ε}{d}] \leq 2 exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

Pr [\exists k \in {1, 2, \dots, d} ∣ \overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k} ∣ \geq \frac{ε}{d}] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

Pr [\exists k \in {1, 2, \dots, d} ∣ \overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k} ∣ \geq \frac{ε}{d}] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

∥ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∥_{2} = k = 1 \sum d (\overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k})^{2} < d \cdot \frac{ε ^{2}}{d} = ε .

∥ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∥_{2} = k = 1 \sum d (\overset{ˉ}{X}_{k} - E [\overset{ˉ}{X}]_{k})^{2} < d \cdot \frac{ε ^{2}}{d} = ε .

Pr [∥ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∥_{2} \geq ε] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

Pr [∥ \overset{ˉ}{X} - E [\overset{ˉ}{X}] ∥_{2} \geq ε] \leq 2 d exp (- \frac{n ε ^{2}}{2 B ^{2} d}) .

\exists ε^{'} > 0, \forall z_{v}, h_{v}, h_{v}^{'}, θ, ∥ h_{v} - h_{v}^{'} ∥_{2} < ε^{'} \Rightarrow ∥ U_{L} (z_{v}, h_{v}, θ) - U_{L} (z_{v}, h_{v}^{'}, θ) ∥_{2} < ε .

\exists ε^{'} > 0, \forall z_{v}, h_{v}, h_{v}^{'}, θ, ∥ h_{v} - h_{v}^{'} ∥_{2} < ε^{'} \Rightarrow ∥ U_{L} (z_{v}, h_{v}, θ) - U_{L} (z_{v}, h_{v}^{'}, θ) ∥_{2} < ε .

E [X_{k}] = u \in N (v) \sum M_{Lv u} (z_{v}^{(0)}, z_{u}^{(0)}, e_{v u}, θ) = h_{v}^{(L)} .

E [X_{k}] = u \in N (v) \sum M_{Lv u} (z_{v}^{(0)}, z_{u}^{(0)}, e_{v u}, θ) = h_{v}^{(L)} .

∥ X_{k} ∥_{2} < C

∥ X_{k} ∥_{2} < C

\exists ε^{'} > 0, \forall z_{v}, h_{v}, z_{v}^{'}, h_{v}^{'}, θ, ∥ [z_{v}, h_{v}] - [z_{v}^{'}, h_{v}^{'}] ∥_{2} < ε^{'} \Rightarrow ∥ U_{L} (z_{v}, h_{v}, θ) - U_{L} (z_{v}^{'}, h_{v}^{'}, θ) ∥_{2} < ε,

\exists ε^{'} > 0, \forall z_{v}, h_{v}, z_{v}^{'}, h_{v}^{'}, θ, ∥ [z_{v}, h_{v}] - [z_{v}^{'}, h_{v}^{'}] ∥_{2} < ε^{'} \Rightarrow ∥ U_{L} (z_{v}, h_{v}, θ) - U_{L} (z_{v}^{'}, h_{v}^{'}, θ) ∥_{2} < ε,

\exists r^{' (1)}, \dots, r^{' (L - 1)} such that Pr [∥ \hat{O}_{z}^{(L - 1)} (v) - z_{v}^{(L - 1)} ∥_{2} \geq ε^{'} / 2] \leq δ /2.

\exists r^{' (1)}, \dots, r^{' (L - 1)} such that Pr [∥ \hat{O}_{z}^{(L - 1)} (v) - z_{v}^{(L - 1)} ∥_{2} \geq ε^{'} / 2] \leq δ /2.

\tilde{h}_{v}^{(L)} = \frac{deg ( v )}{r ^{(L)}} u \in S^{(L)} \sum M_{Lv u} (z_{v}^{(L - 1)}, z_{u}^{(L - 1)}, e_{v u}, θ) .

\tilde{h}_{v}^{(L)} = \frac{deg ( v )}{r ^{(L)}} u \in S^{(L)} \sum M_{Lv u} (z_{v}^{(L - 1)}, z_{u}^{(L - 1)}, e_{v u}, θ) .

E [X_{k}] = u \in N (v) \sum M_{Lv u} (z_{v}^{(L - 1)}, z_{x_{k}}^{(L - 1)}, e_{v x_{k}}, θ) = h_{v}^{(L)} .

E [X_{k}] = u \in N (v) \sum M_{Lv u} (z_{v}^{(L - 1)}, z_{x_{k}}^{(L - 1)}, e_{v x_{k}}, θ) = h_{v}^{(L)} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Graph Convolutional Networks · GraphSAGE

Full text

Constant Time Graph Neural Networks

Ryoma Sato

[email protected]

Kyoto University, RIKEN AIPJapan

,

Makoto Yamada

Kyoto University, RIKEN AIPJapan

and

Hisashi Kashima

Kyoto University, RIKEN AIPJapan

Abstract.

The recent advancements in graph neural networks (GNNs) have led to state-of-the-art performances in various applications, including chemo-informatics, question-answering systems, and recommender systems. However, scaling up these methods to huge graphs, such as social networks and Web graphs, remains a challenge. In particular, the existing methods for accelerating GNNs either are not theoretically guaranteed in terms of the approximation error or incur at least a linear time computation cost. In this study, we reveal the query complexity of the uniform node sampling scheme for Message Passing Neural Networks, including GraphSAGE, graph attention networks (GATs), and graph convolutional networks (GCNs). Surprisingly, our analysis reveals that the complexity of the node sampling method is completely independent of the number of the nodes, edges, and neighbors of the input and depends only on the error tolerance and confidence probability while providing a theoretical guarantee for the approximation error. To the best of our knowledge, this is the first paper to provide a theoretical guarantee of approximation for GNNs within constant time. Through experiments with synthetic and real-world datasets, we investigated the speed and precision of the node sampling scheme and validated our theoretical results.

graph neural networks, large-scale graphs

††copyright: acmcopyright††journal: TKDD††journalyear: 2022††journalvolume: 16††journalnumber: 5††article: 92††publicationmonth: 3††price: 15.00††doi: 10.1145/3502733††ccs: Computing methodologies Machine learning algorithms††ccs: Theory of computation Theory and algorithms for application domains††ccs: Theory of computation Streaming, sublinear and near linear time algorithms††ccs: Information systems World Wide Web

1. Introduction

Machine learning on graph structures has various applications, such as chemo-informatics (Gilmer et al., 2017; Zhang et al., 2018), question answering systems (Schlichtkrull et al., 2018; Park et al., 2019), and recommender systems (Ying et al., 2018; Wang et al., 2019b, a; Fan et al., 2019). Recently, a novel machine learning model for graph data called graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2009) demonstrated state-of-the-art performances in various graph learning tasks. However, large scale graphs, such as social networks and Web graphs, contain billions of nodes, and even a linear time computation cost per iteration is prohibitive. Therefore, the application of GNNs to huge graphs presents a challenge. Although PinSAGE (Ying et al., 2018) succeeded in applying GNNs to a Web-scale network using MapReduce, their method still requires massive computational resources. There are several node sampling techniques for reducing GNN computation, such as GraphSAGE (Hamilton et al., 2017), FastGCN (Chen et al., 2018a), and LADIES (Zou et al., 2019), which are effective in practice. However, these techniques either are not theoretically guaranteed in terms of approximation error or incur at least a linear time computation cost.

In this study, we considered the problem of approximating the embedding of one node using GNNs in constant time with maximum precision. For example, let us consider the problem of predicting whether a user of a social networking service clicks an advertisement using GNNs in real time (i.e., when the user accesses the service). The exact computation may be prohibitive because a user may have many neighbors. To make matters worse, neighboring users tend to have much more friends because users with many friends have more chances to receive a link than users with few friends. Therefore, the average degree of node GNNs with two or more layers process is higher than the average degree of the input graph. Another example is building a browser add-on that detects malicious web pages using GNNs, where a node represents a web page and an edge represents a hyperlink. In that case, we cannot pre-compute embeddings of all the web pages because the entire graph (the WWW graph) is massive. There are two obstacles to compute the embedding on the fly. First, a page may contain many links as in the previous example. Second, retrieving contents of neighboring web pages is expensive due to communication costs. Accessing too many web pages at once may be certified as a DOS attack. Therefore, we cannot use the information of many neighbors. A natural countermeasure is sampling neighboring nodes to reduce the computational cost (Hamilton et al., 2017). However, it may lose much information and degrade much performance especially when the node has many neighbors.

We analyze the neighbor sampling technique to show that only a constant number of samples is needed to guarantee the approximation error for Message Passing Neural Networks (Gilmer et al., 2017) including GraphSAGE (Hamilton et al., 2017), GATs (Veličković et al., 2018), and GCNs (Kipf and Welling, 2017). It should be noted that the neighbor sampling technique was introduced as a heuristic method originally, and no theoretical guarantees were provided. We prove PAC learning-like bounds of the approximation errors of neighbor sampling. Given an error tolerance $\varepsilon$ and confidence probability $1-\delta$ , our analysis shows that the following estimates can be computed in constant time, which is completely independent of the number of the nodes, edges, and neighbors.

•

The estimate $\hat{{\bm{z}}}_{v}$ of the exact embedding ${\bm{z}}_{v}$ of a node $v$ , such that $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}-{\bm{z}}_{v}\|_{2}\geq\varepsilon]\leq\delta$ .

•

The estimate $\widehat{\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}}$ of the exact gradient $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}$ of the embedding ${\bm{z}}_{v}$ with respect to the network parameters ${\bm{\theta}}$ such that $\textnormal{Pr}[\|\widehat{\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}}-\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}\|_{F}\geq\varepsilon]\leq\delta$ .

This result enables us to deal with graphs irrespective of their size (e.g., the WWW graph). We demonstrate that the time complexity is optimal when $L=1$ with respect to the error tolerance $\varepsilon$ . Our analysis can be applied to the prediction setting by considering the prediction problem as $1$ -dimensional embedding in $[0,1]$ . In that case, the prediction with the approximation is correct if the exact computation predicts correctly with margin $\epsilon$ .

In addition to presenting positive results, we show that some existing GNNs, including the original GraphSAGE, cannot be approximated in constant time by any algorithm. These results indicate that the constant-time approximation is not trivial but characteristics of certain GNN architectures.

Furthermore, in addition to providing guarantees of approximation errors, our analysis also reveals which information each GNN model gives importance to in the light of computational complexity. The GNN models that can be approximated in constant time do not use the fine-grained information of the input graph, whereas the GNN architectures that cannot be approximated in constant time do use all the information of the input graph. These observations provide theoretical characteristics of GNN architectures.

Contributions:

•

We analyze the neighbor sampling for GraphSAGE, graph attention networks (GATs), and GCNs to provide theoretical justification. Our analysis shows that the complexity is completely independent of the number of nodes, edges, and neighbors.

•

We show that some existing GNNs, including the original GraphSAGE, cannot be approximated in constant time by any algorithm.

•

We empirically validate our theorems using synthetic and real-world datasets.

2. Related Work

2.1. Graph Neural Networks (GNNs)

Originally, GNN-like models were proposed in the chemistry field (Sperduti and Starita, 1997; Baskin et al., 1997). Gori et al. (Gori et al., 2005) and Scarselli et al. (Scarselli et al., 2009) proposed graph learning models and named them graph neural networks (GNNs), which recursively apply the propagation function until convergence to obtain node embeddings. Bruna et al. (Bruna et al., 2014) and Defferrard et al. (Defferrard et al., 2016) took advantage of graph spectral analysis and graph signal processing to construct GNN models. GCNs (Kipf and Welling, 2017) approximate a spectral model by linear functions with respect to the graph Laplacian and reduced spectral models to spatial models. Gilmer et al. (Gilmer et al., 2017) proposed message passing neural networks (MPNNs), a general framework of GNNs using the message passing mechanism. GATs (Veličković et al., 2018) improve the performance of GNNs greatly by incorporating the attention mechanism. With the advent of GATs, various GNN models with the attention mechanism have been proposed (Wang et al., 2019b; Park et al., 2019).

GraphSAGE (Hamilton et al., 2017) is a GNN model which employs neighbor sampling to reduce the computational costs of training and inference. Neighbor sampling enables GraphSAGE to deal with large graphs. However, neighbor sampling was introduced without any theoretical guarantee, and the number of samples is chosen empirically. An alternative computationally efficient GNN would be FastGCN (Chen et al., 2018a), which employs layer-wise random node sampling to speed up training and inference. Huang et al. (Huang et al., 2018) further improved FastGCN by using an adaptive node sampling technique to reduce the variance of estimators. By virtue of the adaptive sampling technique, it reduces the computational costs and outperforms neighbor sampling in terms of classification accuracy and convergence speed. Note that the layer-wise sampling is designed for mini-batch training and is not suitable for our setting, where we make a prediction for a single node. Chen et al. (Chen et al., 2018b) proposed an alternative neighbor sampling technique, which uses historical activations to reduce the estimator variance. Additionally, it can achieve zero variance after a certain number of iterations. However, because it uses the same sampling technique as GraphSAGE to obtain the initial solution, the approximation error is not theoretically bounded until the $\Omega(n)$ -th iteration. ClusterGCN (Chiang et al., 2019) first clusters nodes into dense blocks and then aggregates node features within each block. LADIES (Zou et al., 2019) samples neighboring nodes using importance sampling for each layer to improve efficiency while keeping variance small. However, neither LADIES (Zou et al., 2019) nor ClusterGCN (Chiang et al., 2019) has a theoretical guarantee on the approximation error. Overall, the existing sampling techniques are effective in practice. However, they either are not theoretically guaranteed in terms of approximation error or incur at least a linear time computation cost to calculate the embedding of a node and its gradients. In this paper, we analyze the neighbor sapling method and derive a constant time complexity. We focus on neighbor sampling owing to its simplicity. Extending our analysis to other sampling methods, including layer-wise sampling (Chen et al., 2018a; Huang et al., 2018), is an important future direction.

2.2. Sublinear Time Algorithms

Sublinear time algorithms were originally proposed for property testing (Rubinfeld and Sudan, 1996). Sublinear property testing algorithms determine whether the input has some property $\pi$ or the input is sufficiently far from property $\pi$ with high probability in sublinear time with respect to the input size. Sublinear time approximation algorithms are an additional type of sublinear time algorithms. More specifically, they calculate a value sufficiently close to the exact value with high probability in sublinear time. Constant time algorithms are a subclass of sublinear time algorithms. They operate not only in sublinear time with respect to the input size but also in constant time. The proposed algorithm is classified as a constant time approximation algorithm.

The examples of sublinear time approximation algorithms include minimum spanning tree in metric space (Czumaj and Sohler, 2004) and minimum spanning tree with integer weights (Chazelle et al., 2005). Parnas et al. (Parnas and Ron, 2007) proposed a method to convert distributed local algorithms into constant time approximation algorithms. In their paper, they proposed a method to construct constant time algorithms for the minimum vertex cover problem and dominating set problem. Nguyen et al. (Nguyen and Onak, 2008) and Yoshida et al. (Yoshida et al., 2009) improved the complexities of these algorithms. A classic example of sublinear time algorithms related to machine learning includes clustering (Indyk, 1999; Mishra et al., 2001). Examples of recent studies in this stream include constant time approximation of the minimum value of quadratic functions (Hayashi and Yoshida, 2016) and constant time approximation of the residual error of the Tucker decomposition (Hayashi and Yoshida, 2017). Hayashi and Yoshida adopted simple sampling strategies to obtain a theoretical guarantee, as we did in our study. In this paper, we provide a theoretical guarantee for the approximation of GNNs within constant time for the first time.

3. Background

3.1. Notations

Let $G$ the input graph, $\mathcal{V}=\{1,2,\dots,n\}$ the set of nodes, $n=|\mathcal{V}|$ the number of nodes, $\mathcal{E}$ the set of edges, $m=|\mathcal{E}|$ the number of edges, $\textrm{deg}(v)$ the degree of node $v$ , $\mathcal{N}(v)$ the set of neighbors of a node $v$ , ${\bm{x}}_{v}\in\mathbb{R}^{d_{0}}$ the feature vector associated to a node $v\in\mathcal{V}$ , and ${\bm{X}}=({\bm{x}}_{1},{\bm{x}}_{2}\dots,{\bm{x}}_{n})^{\top}\in\mathbb{R}^{n\times d_{0}}$ the stacked feature vectors, and let ⊤ denote the matrix transpose.

3.2. Node Embedding Model

We consider the node embedding problem using GNNs with the MPNN framework (Gilmer et al., 2017). This framework includes many GNN models, such as GraphSAGE and GCNs. Algorithm 1 shows the algorithm of MPNNs. We refer to the final embedding ${\bm{z}}_{v}^{(L)}$ as ${\bm{z}}_{v}$ . The aim of this study is to show that it is possible to approximate the embedding vector ${\bm{z}}_{v}$ and gradients $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}$ in constant time with the given model parameters ${\bm{\theta}}$ and node $v$ .

3.3. Examples of Models

We introduce GraphSAGE-GCN, GraphSAGE-mean, GraphSAGE-pool, the graph convolutional networks (GCNs), and the graph attention networks (GATs) for completeness of our paper.

GraphSAGE-GCN (Hamilton et al., 2017): The message function and the update function of this model are

[TABLE]

where ${\bm{W}}^{(l)}$ is a parameter matrix and $\sigma$ is an activation function such as sigmoid and ReLU. GraphSAGE-GCN includes the center node itself in the set of adjacent nodes (i.e., $\mathcal{N}(v)\leftarrow\mathcal{N}(v)\cup\{v\}$ ).

GraphSAGE-mean (Hamilton et al., 2017): The message function and the update function of this model are

[TABLE]

where $[\cdot]$ denotes vertical concatenation.

GraphSAGE-pool (Hamilton et al., 2017): We do not formulate GraphSAGE-pool using the message function and the update function because it takes maximum instead of summation. The model of GraphSAGE-pool is

[TABLE]

GCNs (Kipf and Welling, 2017): The message function and the update function of this model are

[TABLE]

GATs (Veličković et al., 2018): The message function and the update function of this model are

[TABLE]

Technically, MPNNs do not include the above formulation of GATs because it uses embeddings of other neighboring nodes to calculate the attention value $\alpha_{vu}$ . However, we can apply the same argument as MPNNs to GATs, and neighbor sampling can approximate GATs in constant time as other MPNNs.

3.4. Problem Formulation

Constant time algorithms may sound tricky for unfamiliar users because reading the input takes at least linear time. The browser add-on example we introduced in the introduction makes it easier to understand how constant time algorithms work. The WWW graph is so massive that we cannot know the entire structure of the WWW graph. Nonetheless, we can run GNNs on the WWW graph without knowing the entire graph by retrieving the contents of pages in an on-demand manner.

In the design of constant time algorithms, we need to specify a means by which they can access the input because they cannot read the entire input. We follow the standard convention of sublinear time algorithms (Parnas and Ron, 2007; Nguyen and Onak, 2008). We model an algorithm as an oracle machine that can generate queries regarding the input and we measure the complexity by query complexity. Algorithms can access the input only by querying the following oracles: (1) $\mathcal{O}_{\textrm{deg}}(v)$ : the degree of node $v$ , (2) $\mathcal{O}_{G}(v,i)$ : the $i$ -th neighbor of node $v$ , and (3) $\mathcal{O}_{\textrm{feature}}(v)$ : the feature of node $v$ . We assume that an algorithm can query the oracles in constant time per query.

Formally, given a node $v$ , we compute the following functions with the least number of oracle accesses: (1) $\mathcal{O}_{{\bm{z}}}(v)$ : the embedding ${\bm{z}}_{v}$ and (2) $\mathcal{O}_{g}(v)$ : the gradients of parameters $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}$ . However, the exact computation of $\mathcal{O}_{z}$ and $\mathcal{O}_{g}$ requires at least $\text{deg}(v)$ queries to aggregate the features from the neighbor nodes. Besides, the average degree of nodes GNNs process is higher than the average degree of the input graph, as we pointed out in the introduction. Thus, it is computationally expensive to execute the algorithm for a large and dense network, which motivates us to make the following approximations.

•

$\hat{\mathcal{O}}_{z}(v,\varepsilon,\delta)$ : an estimate $\hat{{\bm{z}}}_{v}$ of ${\bm{z}}_{v}$ such that $\textnormal{Pr}[\|\hat{{\bm{z}}}_{i}-{\bm{z}}_{i}\|_{2}\geq\varepsilon]\leq\delta$ ,

•

$\hat{\mathcal{O}}_{g}(v,\varepsilon,\delta)$ : an estimate $\widehat{\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}}$ of $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}$ such that $\textnormal{Pr}[\|\widehat{\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}}-\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}\|_{F}\geq\varepsilon]\leq\delta$ ,

where $\varepsilon>0$ is the error tolerance, $1-\delta$ is the confidence probability, and $\|\cdot\|_{2}$ and $\|\cdot\|_{F}$ are the Euclidean and Frobenius norm, respectively. Under the fixed model structure (i.e., the number of layers $L$ , the message passing functions, and the update functions), we construct an algorithm that calculates $\hat{\mathcal{O}}_{z}$ and $\hat{\mathcal{O}}_{g}$ in constant time irrespective of the number of the nodes, edges, and neighbors of the input. However, it is impossible to construct a constant time algorithm without any assumption about the inputs, as shown in Section 5. Therefore, we make the following mild assumptions.

**Assumption 1: **

$\exists B\in\mathbb{R}$ s.t. $\|{\bm{x}}_{i}\|_{2}\leq B$ , $\|{\bm{e}}_{iu}\|_{2}\leq B$ , and $\|{\bm{\theta}}\|_{2}\leq B$ .

**Assumption 2: **

$\text{deg}(i)M_{liu}$ and $U_{l}$ are uniformly continuous in any bounded domain.

**Assumption 3: **

(Only for gradient computation) $\text{deg}(i)DM_{liu}$ and $DU_{l}$ are uniformly continuous in any bounded domain, where $D$ denotes the Jacobian operator.

Intuitively, the first assumption is needed to bound the additive error, and the second and third assumptions are needed to prohibit the amplification of errors in the message passing phase. Furthermore, we derive the query complexity of neighbor sampling when the message functions and update functions satisfy the following assumptions.

**Assumption 4: **

$\exists K\in\mathbb{R}$ s.t. $\text{deg}(i)M_{liu}$ and $U_{l}$ are $K$ -Lipschitz continuous in any bounded domain.

**Assumption 5: **

(Only for gradient computation) $\exists K^{\prime}\in\mathbb{R}$ s.t. $\text{deg}(i)DM_{liu}$ and $DU_{l}$ are $K^{\prime}$ -Lipschitz continuous in any bounded domain.

4. Main Results

4.1. Constant Time Embedding Approximation

We describe the construction of a constant time approximation algorithm based on neighbor sampling, which approximates the embedding ${\bm{z}}_{v}$ with an absolute error of at most $\varepsilon$ and probability $1-\delta$ . We recursively constructed the algorithm layer by layer by sampling $r^{(l)}$ neighboring nodes in layer $l$ . We refer to the algorithm that calculates the estimate of the embeddings in the $l$ -th layer ${\bm{z}}^{(l)}$ as $\hat{\mathcal{O}}_{z}^{(l)}(l=1,\dots,L)$ . Algorithm 2 presents the pseudo code. Here, $\hat{\mathcal{O}}_{z}^{(l-1)}(v\leftarrow u)$ represents calling the function $\hat{\mathcal{O}}_{z}^{(l-1)}$ with the same parameters as the current function, except for $v$ , which is replaced by $u$ . In the following, we demonstrate the theoretical properties of Algorithm 2. The following theorem shows that the approximation error of Algorithm 2 is bounded by $\varepsilon$ with probability $1-\delta$ . It is proved by applying Hoeffding’s inequality (Hoeffding, 1963) recursively. Because the number of sampled nodes depends only on $\varepsilon$ and $\delta$ and is independent of the number of the nodes, edges, and neighbors of the input, Algorithm 2 operates in constant time. We stress that the only source of randomness in the analysis of Theorem 1 is the sampling distribution of neighboring nodes. This is irrelevant to the generation process of the input graph. Theorem 1 holds even if the feature distribution of the input graph is skewed and correlated to each other.

Theorem 1.

For all $\varepsilon>0,1>\delta>0$ , there exists $r^{(l)}(\varepsilon,\delta)~{}(l=1,\dots,L)$ such that for all inputs satisfying Assumptions 1 and 2, the following property holds true:

[TABLE]

Proof Sketch.

We prove the theorem by performing mathematical induction on the number of layers $L$ . First, we prove that the norms of the embeddings are bounded under the assumptions. In the base case, we can bound the sampling error by Hoeffding’s inequality owing to the bounded norms. In the inductive step, the error comes from the previous layer and the sampling error in the current layer. We bound the former by the induction hypothesis and the latter by Hoeffding’s inequality. ∎

All proofs are available in Section 10. Note that the i.i.d. assumption of uniform sampling is crucial in Hoeffding’s inequality used in our analysis. If a correlated sampling method holds a Hoeffding-style concentration inequality, our analyses can be extended to such a sampling method. We leave extending our analysis to other sampling methods for future work.

Next, we provide the complexity when the functions are Lipschitz continuous. This theorem shows a rough estimate of a sufficient number of the sampling size $r$ . We confirm that this theoretical sampling rate is valid in the experiments.

Theorem 2.

Under Assumptions 1 and 4, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ and $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ are sufficient, and the query complexity of Algorithms 2 is $O(\frac{1}{\varepsilon^{2L}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta})^{L-1}\log\frac{1}{\delta})$ .

Proof Sketch.

We prove the theorem by performing mathematical induction on the number of layers $L$ as in the previous theorem. Thanks to the Lipschitz assumption, we can quantitatively bound the error expansion in this case. The complete proof is available in Section 10. ∎

Although the complexity is exponential with respect to the number of layers, this is nonetheless beneficial because the number of layers is usually small in practice. For example, the original GCN employs two layers (Kipf and Welling, 2017). It is noteworthy that, although most constant time algorithms proposed in the literature also depend on some parameters exponentially, they have nonetheless been proved to be effective. For example, the constant time algorithms of Yoshida et al. (Yoshida et al., 2009) for the maximum matching problem and minimum set cover problem use $d^{O(\frac{1}{\varepsilon^{2}})}(\frac{1}{\varepsilon})^{O(\frac{1}{\varepsilon})}$ and $\frac{1}{\varepsilon^{2}}(st)^{O(s)}$ queries, respectively, where $d$ is the maximum degree and $s$ is the number of elements in a set, and each element is in at most $t$ sets. The important point here is that the complexity is completely independent of the size of the inputs, which is desirable, especially when the input size can be very large. In addition, we show that the query complexity of Algorithm 2 is optimal with respect to $\varepsilon$ if the number of layers is one. In other words, a one-layer model cannot be approximated in $o(\frac{1}{\varepsilon^{2}})$ time by any algorithm.

Theorem 3.

Under Assumptions 1 and 4 and $L=1$ , the time complexity of Algorithm 2 in Theorem 2 is optimal with respect to the error tolerance $\varepsilon$ .

Proof Sketch.

We reduce the problem to estimating a parameter of a simple distribution. We prove it impossible to determine the parameter in $o(\varepsilon^{2})$ queries by Chazelle’s Lemma (Chazelle et al., 2005). The complete proof is available in Section 10. ∎

The optimality when $L\geq 2$ is an open problem.

4.2. Constant Time Gradient Approximation.

We show that the neighbor sampling can guarantee the approximation errors of the gradient of embeddings with respect to the model parameters. Let $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}}$ be the gradient of the embedding ${\bm{z}}_{v}$ with respect to the model parameter ${\bm{\theta}}$ , i.e., $(\frac{\partial{\bm{z}}_{v}}{\partial{\bm{\theta}}})_{ijk}=\frac{\partial{\bm{z}}_{vi}}{\partial{\bm{\theta}}_{jk}}$ . The following theorem shows that an estimate of the gradient of the embedding with respect to parameters with an absolute error of at most $\varepsilon$ and probability $1-\delta$ can be calculated by running $\hat{\mathcal{O}}^{(L)}_{z}(v,\varepsilon,\delta)$ and calculating the gradient of the obtained estimate of the embedding.

Theorem 4.

For all $\varepsilon>0,1>\delta>0$ , there exists $r^{(l)}(\varepsilon,\delta)~{}(l=1,\dots,L)$ such that for all inputs satisfying Assumptions 1, 2, and 3, the following property holds true:

[TABLE]

where $\widehat{\frac{\partial{\bm{z}}^{(L)}_{v}}{\partial{\bm{\theta}}}}$ is the gradient of $\hat{{\bm{z}}}^{(L)}_{v}$ , which is obtained by $\hat{\mathcal{O}}^{(L)}_{z}(v,\varepsilon,\delta)$ , with respect to ${\bm{\theta}}$ .

Proof Sketch.

The basic strategy is common with Theorem 1. However, the derivation becomes more challenging and complicated due to additional terms in the backward path. The complete proof is available in Section 10. ∎

Next, we provide the complexity when the functions are Lipschitz continuous.

Theorem 5.

Under Assumptions 1, 4, and 5, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ and $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ are sufficient, and the gradient of the embedding with respect to parameters can be approximated with an absolute error of at most $\varepsilon$ and probability $1-\delta$ in $O(\frac{1}{\varepsilon^{2L}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta})^{L-1}\log\frac{1}{\delta})$ time.

Proof Sketch.

The Lipschitz assumption enables to quantitatively bound the error expansion. The proof is available in Section 10. ∎

4.3. Constant Time Approximation of Graph Attention Networks

Technically speaking, MPNNs do not include GATs, because these networks use embeddings of other neighboring nodes to calculate the attention value. However, our analysis can be naturally extended to GATs, and we can approximate GATs in constant time by neighbor sampling. This is surprising because GATs may pay considerable attention to some nodes, and uniform sampling may fail to sample these nodes. However, it can be shown that our assumptions, which do not seem related to this issue, prohibit this situation.

To be precise, the following proposition holds true.

Proposition 6.

If Assumption 1 holds true and $\sigma$ is Lipschitz continuous, and we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ and $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ samples, and let

[TABLE]

Then, the following property holds true.

[TABLE]

Proof Sketch.

The main difference between GCN and GAT is that GAT can put adaptive weights to neighboring nodes. Although large weights can blow up the approximation error, the magnitude of the weights in GAT is bounded under the assumptions. Therefore, the embeddings are also bounded, and we can adopt the same strategy as Theorem 1. The proof is available in Section 10. ∎

5. Inapproximability

In this section, we show that some existing GNNs cannot be approximated in constant time. The theorems state that these models cannot be approximated in constant time by either neighbor sampling or any other algorithm. In other words, for any algorithm that operates in constant time, there exists an error tolerance $\varepsilon$ , a confidence probability $1-\delta$ , and a counter example input such that the approximation error for the input is more than $\varepsilon$ with probability $\delta$ . This indicates that the application of an approximation method to these models requires close supervision because the obtained embedding may be significantly different from the exact embedding. These results also indicate that the positive results we have shown so far are not void but non-trivial properties.

The following proposition indicates that Assumption 1 is necessary for constant-time approximation.

Proposition 1.

If $\|{\bm{x}}_{i}\|_{2}$ or $\|{\bm{\theta}}\|_{F}$ is not bounded, even under Assumption 2, the embeddings of GraphSAGE-GCN cannot be approximated with arbitrary precision and probability in constant time.

Proof Sketch.

We prove this proposition by constructing a concrete counter-example. The proof is available in Section 10. ∎

All proofs in this section are based on constructing counter examples and available in Section 10. The following proposition shows that the original GraphSAGE-GCN cannot be approximated in constant time.

Proposition 2.

Even under Assumption 1, the embeddings and gradients of GraphSAGE-GCN with ReLU activation and normalization cannot be approximated with arbitrary precision and probability in constant time.

The following proposition indicates that the activation function is important for constant-time approximability because the gradient of GraphSAGE-GCN with sigmoid activation can be approximated within constant time by neighbor sampling (Theorem 4).

Proposition 3.

Even under Assumptions 1 and 2, the gradients of GraphSAGE-GCN with ReLU activation cannot be approximated with arbitrary precision and probability in constant time.

The following two theorems state that GraphSAGE-pool and GCN cannot be approximated in constant time even under Assumptions 1, 2, and 3.

Proposition 4.

Even under Assumptions 1, 2, and 3, the embeddings of GraphSAGE-pool cannot be approximated with arbitrary precision and probability in constant time.

Proposition 5.

Even under Assumptions 1, 2, and 3, the embeddings of GCN cannot be approximated with arbitrary precision and probability in constant time.

5.1. Constant Time Approximation for GCNs

Because of the inapproximability theorems, GCNs cannot be approximated in constant time. However, GCNs can be approximated in constant time if the input graph satisfies the following property.

**Assumption 6: **

There exists a constant $C\in\mathbb{R}$ such that, for any input graph $G=(\mathcal{V},\mathcal{E})$ and node $v,u\in\mathcal{V}$ , the ratio of $\text{deg}(v)$ to $\text{deg}(u)$ is at most $C$ (i.e., $\text{deg}(v)/\text{deg}(u)\leq C$ ).

Assumption 6 prohibits input graphs that have a skewed degree distribution. GCNs require Assumption 6 because the norm of the embedding is not bounded, and the influence of anomaly nodes with low degrees is significant without it.

Proposition 6.

If Assumptions 1 and 6 hold true, $\sigma$ is Lipschitz continuous, and we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ and $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ samples, and let

[TABLE]

Then, the following property holds true.

[TABLE]

Proof Sketch.

Although the norms of node embeddings of GCN may increase arbitrarily, they are bounded under Assumption 6. The same strategy as Theorem 1 can be used with a slight modification. The proof is available in Section 10. ∎

It should be noted that the GraphSAGE-pool cannot be approximated in constant time even under Assumption 6.

5.2. Discussion on Negative Results

These negative results reveal which information each GNN model gives importance to. For example, suppose there is a high-degree node that neighbors anomaly low-degree nodes in the input graph. GraphSAGE-GCN can be approximated in constant time even in this case according to Theorem 1, but GCNs cannot be approximated according to Proposition 5. GraphSAGE-GCN can be approximated in constant time because the exact computation of this model can be estimated by only a fraction of the input graph. This fraction hardly contains the anomaly nodes. Nevertheless, the estimation from the fraction is accurate. It means that the anomaly nodes do not perturb the exact computation of this model. In contrast, GCNs cannot be approximated in constant time because the anomaly nodes change the exact computation drastically. Without using anomaly nodes as well as majority nodes, the estimation becomes inaccurate. These observations tell us the characteristics of these models. If the anomaly nodes are important for estimating the label (e.g., fraud detection), we should use GCNs because GCNs can take anomaly nodes into consideration in model inference, whereas GraphSAGE-GCN ignores anomalies. If we want the model to be robust to anomaly nodes, we should use GraphSAGE-GCN. These observations are valid even when we do not use approximations, though the starting point was the time complexity of approximation methods. This type of observation can be applied to other models such as GraphSAGE-pool and GATs as well.

Besides, these results reveal the graph problems that GNNs cannot solve via the lens of time complexity. For example, let a node be positive if there exists at least one neighboring node with feature $1$ , and negative otherwise. This problem is not solvable in sublinear time because a star graph with no “ $1$ ” nodes and a star graph with only one “ $1$ ” leaf node are counterexamples. Therefore, GraphSAGE-GCNs, GCNs, and GATs cannot solve this problem. If these models can solve this problem, we can solve this problem in constant time by neighbor sampling, which leads to a contradiction. This result means that these models cannot simulate the pooling operation. By contrast, GraphSAGE-pool can solve this problem owing to the pooling operator, and GraphSAGE-pool indeed requires at least linear time computation according to Proposition 4.

6. Graph Embedding

Our analysis can be extended to graph embedding, where we embed an entire graph instead of a node. Graph embeddings can be calculated by aggregating the embeddings of all nodes (Gilmer et al., 2017):

[TABLE]

We adopt the mean of the feature vectors of the nodes as the readout function (i.e., ${\bm{z}}_{G}=\frac{1}{n}\sum_{i\in V}{\bm{z}}_{i}$ ). We cannot calculate the embeddings of all nodes in constant time even if each calculation is done in constant time because there are $n$ nodes. We adopt the sampling strategy here as well. We sample some nodes in a uniformly random manner, compute their feature vectors in constant time using Algorithm 2, and calculate their empirical mean. The errors of sampling and Algorithm 2 are bounded by Lemma 2 and Theorem 1, respectively. Therefore, we sample a sufficiently large (but independent of the graph size) number of nodes and call Algorithm 2 with sufficiently small $\varepsilon$ and $\delta$ . Then, the estimate is arbitrarily close to the exact embedding of $G$ with an arbitrary probability.

7. Experiments

We validate our theoretical results through numerical experiments. Namely, we answer the following questions through experiments.

•

Q1: How fast is the constant time approximation algorithm?

•

Q2: Does the neighbor sampling accurately approximate the embeddings of GraphSAGE-GCN without normalization (Theorem 1), whereas it cannot approximate the original one (Proposition 2)?

•

Q3: Does the neighbor sampling accurately approximate the gradients of GraphSAGE-GCN with sigmoid activation (Theorem 4), whereas it cannot approximate that with ReLU activation (Proposition 3)?

•

Q4: Is the theoretical rate of the approximation error of Algorithm 2 tight?

•

Q5: Does the neighbor sampling fail to approximate GCNs when the degree distribution is skewed (Proposition 5)? Does it succeed when the node distribution is flat (Assumption 6)?

•

Q6: Does the neighbor sampling work efficiently for real data?

•

Q7: How does the neighbor sampling affect downstream tasks?

We use cliques for Q1 to Q4 for the following reasons. First, sampling is effective when the input graph is dense. Cliques are the densest graphs. We use them to highlight the effectiveness of the node sampling. Second, cliques are important structures in practice. For example, deep sets (Zaheer et al., 2017) are a popular machine learning model for sets. Equivariant deep sets can be seen as GNNs that run on cliques. Self attention layers (Vaswani et al., 2017) can also be seen as graph attention networks that run on cliques.

7.1. Speedup Factors (Q1)

We measure the speed of exact computation and the neighbor sampling of two-layer GraphSAGE-GCN and two-layer GATs. We initialize parameters using the i.i.d. standard multivariate normal distribution. The input graph is a clique $K_{n}$ . We use ten-dimensional vectors from the i.i.d. standard multivariate normal distribution as the node features. We take $r^{(1)}=r^{(2)}=100$ samples. For each method and $n=2^{7},2^{8},\dots,2^{19}$ , we run inference $10$ times and measure the average time consumption and standard deviation. The speed is evaluated with a single core of Intel Xeon CPU E5-2690. Figure 1 (a) plots the speed of these methods as the number of nodes increases. This shows that the neighbor sampling is several orders of magnitude faster than the exact computation when the graph size is large.

7.2. Effect of Normalization (Q2)

We use the original one-layer GraphSAGE-GCN (with ReLU activation and normalization) and one-layer GraphSAGE-GCN with ReLU activation. The input graph is a clique $K_{n}$ , the features of which are ${\bm{x}}_{1}=(1,0)^{\top}$ and ${\bm{x}}_{i}=(0,1/n)^{\top}~{}(i\neq 1)$ , and the weight matrix is an identity matrix ${\bm{I}}_{2}$ . We use $r^{(1)}=5,30,$ and $100$ as the sample size. If a model can be approximated in constant time, the approximation error goes to zero as the sample size increases, even if the graph size reaches infinity. Figure 1 (b) illustrates the approximation errors of both models. The approximation error of the original GraphSAGE-GCN converges to approximately $0.75$ even if the sample size increases. In contrast, the approximation error without normalization becomes increasingly bounded as the sample size increases. This is consistent with Theorems 1 and 2.

7.3. Effect of Activation functions (Q3)

We examine the approximation errors of the gradients using the one-layer GraphSAGE-GCN with ReLU and sigmoid activation. The input graph is a clique $K_{n}$ , the features of which are ${\bm{x}}_{1}=(1,2)^{\top}$ and ${\bm{x}}_{i}=(1,1)^{\top}~{}(i\neq 1)$ , and the weight matrix is $((-1,1))$ . We use $r^{(1)}=5,30,$ and $100$ as the sample size. Figure 1 (c) illustrates the approximation error of both models. The approximation error with ReLU activation converges to approximately $1.0$ , even if the sample size increases. In contrast, the approximation error with sigmoid activation becomes increasingly bounded as the sample size increases. This is consistent with Theorems 4 and 3.

7.4. Theoretical Rate (Q4)

We use one-layer GraphSAGE-GCN with sigmoid activation. We initialize the weight matrix ${\bm{W}}^{(1)}$ with normal distribution and then normalize it so that the operator norm $\|{\bm{W}}^{(1)}\|_{\text{op}}$ of the matrix is equal to $1$ . This satisfies Assumption 1, i.e., $\|{\bm{\theta}}\|_{2}\leq\sqrt{2}$ . The input graph is a clique $K_{n}$ with $n=40000$ nodes. We set the dimensions of intermediate embeddings as $2$ , and each feature value is set to $1$ with probability $0.5$ and $-1$ otherwise. This satisfies Assumption 1, i.e., $\|{\bm{x}}_{i}\|_{2}\leq\sqrt{2}$ . We compute the approximation errors of Algorithm 2 with different numbers of samples. Specifically, for each $r=1,\dots,10000$ , we (1) initialize the weight matrix, (2) choose $400$ nodes, (3) calculate the exact embedding of each chosen node, (4) calculate the estimate for each chosen node with $r$ samples, i.e., $r^{(1)}=r$ , and (5) calculate the approximation error of each chosen node. Figure 1 (d) illustrates the $99$ -th percentile point of empirical approximation errors and the theoretical bound by Theorem 2, i.e., $\varepsilon=O(r^{-1/2})$ . It shows that the approximation error decreases together with the theoretical rate. This indicates that the theoretical rate is tight. Based on these experimental and theoretical results, we can estimate a sufficient number of samples given the required precision.

7.5. Other Architectures (Q5)

We analyze the instances when neighbor sampling succeeds and fails for a variety of models. First, we use the Barabasi–Albert (BA) model (Barabasi and Albert, 1999). The degree distribution of the BA model follows a power law, which indicates neighbor sampling will fail to approximate GCNs (Propositions 5 and 6). We use ten-dimensional vectors from the i.i.d. standard multivariate normal distribution as the node features. We use two-layer GraphSAGE-GCN, GATs, and GCNs with ReLU activation. We use the same number $r$ of samples in the first and second layer, i.e., $r=r^{(1)}=r^{(2)}\in[8,1000]$ and use graphs with $n=r^{2}$ nodes. Specifically, we (1) iterate $r$ from $8$ to $1000$ , (2) set $n=r^{2}$ , (3) generate 10 graphs with $n$ nodes using the BA model, (4) choose the node that has the maximum degree for each generated graph, (5) calculate the exact embeddings and its estimate for each chosen node with $r$ samples, i.e., $r^{(1)}=r^{(2)}=r$ , and (6) calculate the approximation error. We use the maximum degree node in step (4) because this is the hardest case and thus shows a clear distinction between appropriate and inappropriate situations. Low degree nodes with a few neighboring nodes are easy to approximate even with an inappropriate configuration. We focus on the hardest case to illustrate the distinction clearly in the following analysis.

Figure 1 (e) shows that the error of GCNs linearly increases, even if the number of samples increases while the errors of GraphSAGE-GCN and GATs gradually decrease. This is consistent with Proposition 5. This result indicates that the approximation of GCNs requires close supervision when the input graph is a social network because the degree distribution of a social network presents the power law as the BA model.

Next, we use the Erdős–Rényi (ER) model (Erdős and Rényi, 1959). It generates graphs with a flat degree distribution. We use the two-layer GraphSAGE-GCN, GATs, GCNs, and GraphSAGE-pool. The experimental process is similar to that for the BA model, but (1) we use the ER model instead of the BA model and (2) set $n=\text{floor}(r^{1.5})$ instead of $n=r^{2}$ to reduce the computational cost. Figure 1 (f) shows the approximation error. It shows that the errors of GraphSAGE-GCN, GATs, and GCNs gradually decrease as the number of samples increases. This is consistent with Theorem 1 and Proposition 6. In contrast, the approximation error of GraphSAGE-pool does not decrease, even if the input graphs are generated by the ER model. This is consistent with Proposition 4.

7.6. Real World Datasets (Q6)

We assess the speed and accuracy of neighbor sampling approximation using three large-scale real world datasets: AMiner citation network 111https://www.aminer.cn/aminernetwork, Friendster social network 222https://snap.stanford.edu/data/com-Friendster.html, and LiveJournal social network 333https://snap.stanford.edu/data/com-LiveJournal.html. They contain $2092356$ , $65608366$ , and $3997962$ nodes, respectively. We use two-layer GraphSAGE-GCN with sigmoid activation and randomly initialize parameters by the Xavier initializer (Glorot and Bengio, 2010). The initial embeddings of nodes are generated from i.i.d. $128$ -dimensional standard Gaussian distribution. The number of dimensions in intermediate embeddings is also $128$ . For each $r=1,\dots,100$ , we compute the exact and approximated embeddings of node with top- $10$ degrees and measure the relative error $y=\|{\bm{z}}_{\text{exact}}-{\bm{z}}_{\text{approx}}\|/\|{\bm{z}}_{\text{exact}}\|$ , where ${\bm{z}}_{\text{exact}}$ is the exact embedding and ${\bm{z}}_{\text{approx}}$ is the approximated embedding, and the speedup factor $x=t_{\text{exact}}/t_{\text{approx}}$ , where $t_{\text{exact}}$ is the time consumption of the exact computation and $t_{\text{approx}}$ is the time consumption of the sampling method. Figure 2 plots these values. This shows that the approximation errors drop quickly, in particular, with rate around $O(x^{1/2})$ . These results are consistent with our theoretical analyses. Based on these experimental and theoretical results, we can estimate a sufficient number of samples given a required precision.

7.7. Affect to Downstream Tasks (Q7)

Although (Hamilton et al., 2017) already investigated the effect of node sampling to downstream tasks as a heuristic method, we also investigate it for completeness. We use eight node classification datasets, Cora, Cora Full, PubMed, Citeseer, Coauthor CS, Coauthor Physics, Amazon Computer, and Amazon Photo, retrieved from Deep Graph Library https://docs.dgl.ai/api/python/data.html. Cora, Cora Full, PubMed, and Citeseer are citation networks, Coauthor CS and Coauthor Physics are co-authorship networks, and Amazon Computer and Amazon Photo are co-purchase networks. We train two-layered GraphSAGE-GCN with the neighbor sampling. We set the size of neighbor samples as $r_{1}=r_{2}=r$ both in training and inference. We use $500$ nodes for testing, $1000$ nodes for validation, and the remaining nodes for training for all datasets. Table 2 reports the average F1-score for $10$ different random seeds. We can see that even small numbers of neighbor samples offer good performance. In particular, $r=3$ may be not enough, but $r=5$ performs well in many datasets. Moreover, $r=10$ and $r=20$ perform as good as exact computation. These results validate the sampling approach in practical situations with various datasets. Table 2 also reports the relative approximation errors $\|{\bm{z}}_{\text{exact}}-{\bm{z}}_{\text{approx}}\|/\|{\bm{z}}_{\text{exact}}\|$ of the node embeddings in the last layer. It shows that (i) the approximation errors decrease as the number of sampled nodes increases, and (ii) most nodes are classified to the same category as in the exact computation when the relative error is around less than ten percent. Note that even if the embedding is not exact, the node is classified correctly unless the error crosses the classification boundary. This result shows that five to ten percent approximation suffices in typical downstream tasks.

8. Practical Implications

We summarize the practical implications of this work for practitioners.

•

We showed that GNNs can be approximated in constant time. When real time responses are required as the examples of the social network service and browser add-on we introduced in the introduction, we can use node sampling and estimate the approximation error using Theorem 2.

•

We showed the lower bound of the approximation error in the worst case (Theorem 3). We need at least $r=\Omega(\frac{1}{\varepsilon^{2}})$ samples to bound the approximation error in the worst case. We should keep this in mind when we apply an approximation technique, including other methods than node sampling, to applications that require guarantees.

•

We showed that many of GNN variants cannot be approximated in constant time in Section 5. We should be careful when we combine an approximation technique, including other methods than node sampling, with these GNN architectures. For example, when the performance of these GNN models is poor, we should investigate the sampling error in these cases.

•

We showed that GCNs cannot be approximated in constant time when the degree distribution is skewed, but they can be approximated in constant time when the degree distribution is flat (Proposition 5, Assumption 6, experiments for Q5). This indicates that the approximation of GCNs requires close supervision when the input graph is a social network because the degree distribution of a social network presents the power law.

•

Propositions 5 and 6 also indicate that GCNs do use the degree distribution information in its inference, whereas other GNNs, such as GraphSAGE-GCN, do not exploit it. Therefore, GCNs are more appropriate than GraphSAGE-GCN when the degree distribution is important to improve performance. For example, when one wants to estimate the influence of users in a social network, they should use GCNs instead of GraphSAGE-GCN.

•

We conducted numerical experiments using real world datasets (experiments for Q6). They describe the behavior of the node sampling in typical real world datasets. One can refer to them to estimate the approximation error in other real world datasets.

9. Conclusion

We analyzed neighbor sampling to prove that it can approximate the embedding and gradient of GNNs in constant time, where the complexity is completely independent of the number of the nodes, edges, and neighbors of the input. This is the first analysis that offers constant time approximation for GNNs. We further demonstrated that some existing GNNs cannot be approximated in constant time by any algorithm. Finally, we validated the theory through experiments using synthetic and real-world datasets.

10. Proofs

Lemma 1 (Hoeffding’s inequality (Hoeffding, 1963)).

Let $X_{1},X_{2},\dots,X_{n}$ be independent random variables bounded by the intervals $[-B,B]$ and let $\bar{X}$ be the empirical mean of these variables $\bar{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i}$ . Then, for any $\varepsilon>0$ ,

[TABLE]

holds true.

Lemma 2 (multivariate Hoeffding’s inequality).

Let ${\bm{x}}_{1},{\bm{x}}_{2},\dots,{\bm{x}}_{n}$ be independent $d$ -dimensional random variables whose two-norms are bounded $\|{\bm{x}}_{i}\|_{2}\leq B$ , and let $\bar{{\bm{x}}}$ be the empirical mean of these variables $\bar{{\bm{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\bm{x}}_{i}$ . Then, for any $\varepsilon>0$ ,

[TABLE]

holds true.

Proof of Lemma 2.

Apply Lemma 1 to each dimension $k$ of $X_{i}$ . Then,

[TABLE]

It should be noted that $|X_{ik}|<B$ because $\|X_{i}\|_{2}<B$ . Therefore,

[TABLE]

If $|\bar{X}_{k}-\mathbb{E}[\bar{X}]_{k}|<\frac{\varepsilon}{\sqrt{d}}$ holds true for all dimension $k$ , then

[TABLE]

Therefore,

[TABLE]

∎

Lemma 3.

Under Assumptions 1 and 2, the norms of the embeddings $\|{\bm{z}}^{(l)}_{v}\|_{2}$ , $\|\hat{{\bm{z}}}^{(l)}_{v}\|_{2}$ , $\|{\bm{h}}^{(l)}_{v}\|_{2}$ , and $\|\hat{{\bm{h}}}^{(l)}_{v}\|_{2}~{}(l=1,\dots,L)$ are bounded by a constant $B\in\mathbb{R}$ .

Proof of Lemma 3.

We prove the theorem by performing mathematical induction. The norm of the input to the first layer is bounded by Assumption 1. The message function $\text{deg}(i)M_{liu}$ and the update function $U_{l}$ is continuous by Assumption 2. Since the image $f(X)$ of a compact set $X\in\mathbb{R}^{d}$ is compact if $f$ is continuous, the images of $\text{deg}(i)M_{liu}$ and $U_{l}$ are bounded by induction. ∎

Proof of Theorem 1.

We prove the theorem by performing mathematical induction on the number of layers $L$ .

Base case: We show that the statement holds true for $L=1$ . Because $U_{L}$ is uniform continuous,

[TABLE]

Let $x_{k}$ be the $k$ -th sample in $\mathcal{S}^{(L)}$ and $X_{k}=\text{deg}(v)M_{Lvu}({\bm{z}}^{(0)}_{v},{\bm{z}}^{(0)}_{x_{k}},{\bm{e}}_{vx_{k}},{\bm{\theta}})$ . Then,

[TABLE]

There exists a constant $C\in\mathbb{R}$ such that for any input satisfying Assumption 1,

[TABLE]

holds true because $\|{\bm{z}}^{(0)}_{v}\|_{2},\|{\bm{z}}^{(0)}_{x_{k}}\|_{2},\|{\bm{e}}_{vx_{k}}\|_{2}$ , and $\|{\bm{\theta}}\|_{2}$ are bounded by Assumption 1 and $\text{deg}(v)M_{Lvu}$ is continuous. Therefore, if we take $r^{(L)}=O(\frac{1}{\varepsilon^{\prime 2}}\log\frac{1}{\delta})$ samples, $\textnormal{Pr}[\|\hat{{\bm{h}}}_{v}^{(L)}-{\bm{h}}_{v}^{(L)}\|_{2}\geq\varepsilon^{\prime}]\leq\delta$ by the Hoeffding’s inequality and equations (2) and (3). Therefore, $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}^{(L)}-{\bm{z}}_{v}^{(L)}\|_{2}\geq\varepsilon]\leq\delta$ . We note that the input features are fixed in this analysis and the only source of randomness is sampling neighbor nodes. Thus $X_{k}$ is i.i.d. and we can apply the Hoeffding’s inequality.

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ . The induction hypothesis is $\forall\varepsilon>0,1>\delta>0$ , $\exists r^{(1)}(\varepsilon,\delta),\dots,r^{(L-1)}(\varepsilon,\delta)$ such that $\forall v\in\mathcal{V}$ , $\textnormal{Pr}[\|\hat{\mathcal{O}}_{z}^{(L-1)}(v,\varepsilon,\delta)-{\bm{z}}_{v}\|_{2}\geq\varepsilon]\leq\delta$ .

Because $U_{L}$ is uniform continuous,

[TABLE]

where $[\cdot]$ denotes concatenation. By the induction hypothesis,

[TABLE]

holds true. Let

[TABLE]

Let $x_{k}$ be the $k$ -th sample in $\mathcal{S}^{(L)}$ and $X_{k}=\text{deg}(v)M_{Lvu}({\bm{z}}_{v}^{(L-1)},{\bm{z}}_{x_{k}}^{(L-1)},{\bm{e}}_{vx_{k}},{\bm{\theta}})$ . Then,

[TABLE]

There exists a constant $C\in\mathbb{R}$ such that for any input satisfying Assumption 1,

[TABLE]

because $\|{\bm{z}}^{(L-1)}_{v}\|_{2},\|{\bm{z}}^{(L-1)}_{x_{k}}\|_{2},\|{\bm{e}}_{vx_{k}}\|_{2}$ , and $\|{\bm{\theta}}\|_{2}$ are bounded by Assumption 1 and Theorem 3, and $\text{deg}(v)M_{Lvu}$ is continuous. If we take $r^{(L)}=O(\frac{1}{\varepsilon^{\prime 2}}\log\frac{1}{\delta})$ , then

[TABLE]

by the Hoeffding’s inequality and equations (6) and (7). Because $\text{deg}(v)M_{Lvu}$ is uniform continuous,

[TABLE]

By the induction hypothesis,

[TABLE]

Therefore, the probability that the errors of all oracle calls are bounded is

[TABLE]

By equations (9) and (11),

[TABLE]

By the triangular inequality and equations (8) and (12),

[TABLE]

Therefore, if we take $r^{(1)}=\max(r^{\prime(1)},r^{\prime\prime(1)}),\dots,r^{(L-1)}=\max(r^{\prime(L-1)},r^{\prime\prime(L-1)})$ , by equations (5) and (13),

[TABLE]

Therefore, by equations (9) and (14),

[TABLE]

∎

Proof of Theorem 2.

We prove this by performing mathematical induction on the number of layers.

Base case: We show that the statement holds true for $L=1$ . If $U_{L}$ is $K$ -Lipschitz continuous, $\varepsilon^{\prime}=O(\varepsilon)$ in equation (1). Therefore, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ .

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ . If $U_{L}$ and $M_{Lvu}$ are $K$ -Lipschitz continuous, $\varepsilon^{\prime}=O(\varepsilon)$ in equation (4) and $\varepsilon^{\prime\prime}=O(\varepsilon)$ in equation (9). Therefore, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ . We call $\hat{\mathcal{O}}^{(L-1)}_{z}(v)$ such that $\textnormal{Pr}[\|\hat{\mathcal{O}}^{(L-1)}_{z}(v)-{\bm{z}}^{(L-1)}_{v}\|_{2}\geq\varepsilon^{\prime}/\sqrt{2}]\leq\delta/2$ in equation (5). Therefore, $r^{\prime(1)},\dots,r^{\prime(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ are sufficient by the induction hypothesis. We call $\hat{\mathcal{O}}^{(L-1)}_{z}(v)$ such that $\textnormal{Pr}[\|\hat{\mathcal{O}}^{(L-1)}_{z}(v)-{\bm{z}}^{(L-1)}_{v}\|_{2}\geq\varepsilon^{\prime\prime}/\sqrt{2}]\leq\delta/(8r^{(L)})$ in equation (10). Therefore, $r^{\prime\prime(1)},\dots,r^{\prime\prime(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ are sufficient by the induction hypothesis because $\log\frac{1}{\delta/(8r^{(L)})}=O(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta})$ . In total, the complexity is $O(\frac{1}{\varepsilon^{2L}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\varepsilon})^{L-1}\log\frac{1}{\delta})$ ∎

Lemma 4 ((Chazelle

et al., 2005)).

Let $\mathcal{D}^{s}$ be $Bernoulli(\frac{1+s\varepsilon}{2})$ . Let $n$ -dimentional distribution $\mathcal{D}$ be (1) pick $s=1$ with probability $1/2$ and $s=-1$ otherwise; (2) then draw $n$ values from $\mathcal{D}^{s}$ . Any probabilistic algorithm that can guess the value of $s$ with a probability error below $1/4$ requires $\Omega(\frac{1}{\varepsilon^{2}})$ bit lookup on average.

Proof of Theorem 3.

We prove there is a counter example in the GraphSAGE-GCN models. Suppose there is an algorithm that approximates the one-layer GraphSAGE-GCN within $o(\varepsilon^{2})$ queries. We prove that this algorithm can distinguish $\mathcal{D}$ in Lemma 4 within $o(\varepsilon^{2})$ queries and derive a contradiction.

Let $\sigma$ be any non-constant $K$ -Lipschitz activation function. There exists $a,b\in\mathbb{R}~{}(a>b)$ such that $\sigma(a)\neq\sigma(b)$ because $\sigma$ is not constant. Let $S=\frac{|\sigma(a)-\sigma(b)|}{a-b}>0$ . Let $\varepsilon>0$ be any sufficiently small positive value and $t\in\{0,1\}^{n}$ be a random variable drawn from $\mathcal{D}$ . We prove that we can determine $s$ with high provability within $o(\varepsilon^{2})$ queries using the algorithm. Let $G$ be a clique $K_{n}$ and ${\bm{W}}^{(1)}=1$ . Let us calculate $a_{\varepsilon}$ and $b_{\varepsilon}$ using the following steps: (1) set $a_{\varepsilon}=a$ and $b_{\varepsilon}=b$ ; (2) if $a_{\varepsilon}-b_{\varepsilon}<\varepsilon$ , return $a_{\varepsilon}$ and $b_{\varepsilon}$ ; (3) $m=\frac{a_{\varepsilon}+b_{\varepsilon}}{2}$ ; (4) if $|\sigma(a_{\varepsilon})-\sigma(m)|>|\sigma(m)-\sigma(b_{\varepsilon})|$ , then set $b_{\varepsilon}=m$ , otherwise $a_{\varepsilon}=m$ ; and (5) go back to (2). Here, $\varepsilon/2\leq a_{\varepsilon}-b_{\varepsilon}<\varepsilon$ , $a\leq\frac{a_{\varepsilon}+b_{\varepsilon}}{2}\leq b$ , and $|\sigma(a_{\varepsilon})-\sigma(b_{\varepsilon})|\geq\frac{S}{2}\varepsilon$ hold true. Let $x_{v}=\frac{a_{\varepsilon}+b_{\varepsilon}}{2}+(2t_{v}-1)\frac{a_{\varepsilon}-b_{\varepsilon}}{2\varepsilon}$ for all $v\in\mathcal{V}$ . Then, $\mathbb{E}[h_{v}\mid s=1]=a_{\varepsilon}$ and $\mathbb{E}[h_{v}\mid s=-1]=b_{\varepsilon}$ . Therefore, $\textnormal{Pr}[|z_{v}-\sigma(a_{\varepsilon})|<\frac{S}{8}\varepsilon\mid s=1]\to 1$ as $n\to\infty$ and $\textnormal{Pr}[|z_{v}-\sigma(b_{\varepsilon})|<\frac{S}{8}\varepsilon\mid s=-1]\to 1$ as $n\to\infty$ because $\sigma$ is $K$ -Lipschitz. We set the error tolerance to $\frac{S}{8}\varepsilon$ and $n$ to a sufficiently large number. Then $s=1$ if $|\hat{z}_{v}-\sigma(a_{\varepsilon})|<\frac{S}{4}\varepsilon$ and $s=-1$ otherwise with high probability. However, the algorithm accesses $t$ (i.e., accesses $\mathcal{O}_{\text{feature}}$ ) $o(\varepsilon^{2})$ times. This contradicts with Lemma 4. ∎

Lemma 5.

Under Assumptions 1, 2 and 3, the norms of the gradients of the message functions and the update functions $\|DU_{l}({\bm{z}}^{(l-1)}_{v},{\bm{h}}^{(l)}_{v},{\bm{\theta}})\|_{F}$ , $\|DU_{l}(\hat{{\bm{z}}}^{(l-1)}_{v},\hat{{\bm{h}}}^{(l)}_{v},{\bm{\theta}})\|_{F}$ , $\|\text{deg}(v)DM_{lvu}({\bm{z}}^{(l-1)}_{v},{\bm{v}}^{(l)}_{u},{\bm{e}}_{vu},{\bm{\theta}})\|_{F}$ , and $\|\text{deg}(v)DM_{lvu}(\hat{{\bm{z}}}^{(l-1)}_{v},\hat{{\bm{v}}}^{(l)}_{u},{\bm{e}}_{vu},{\bm{\theta}})\|_{F}$ are bounded by a constant $B^{\prime}\in\mathbb{R}$ .

Proof of Lemma 5.

The input of each function is bounded by Lemma 3. Because $DU_{l}$ and $\text{deg}(v)DM_{lvu}$ is uniform continuous, these images are bounded. ∎

Proof of Theorem 4.

We prove the theorem by performing mathematical induction on the number of layers $L$ .

Base case: We show that the statement holds true for $L=1$ .

When the number of layers is one,

[TABLE]

Because $DU_{L}$ is uniform continuous,

[TABLE]

If we take $r^{(L)}=O(\frac{1}{\varepsilon^{\prime 2}}\log\frac{1}{\delta})$ ,

[TABLE]

holds true for any input by the argument of the proof of Theorem 1. Let $x_{k}$ be the $k$ -th sample in $\mathcal{S}^{(L)}$ and $X_{k}=\text{deg}(v)\frac{\partial M_{Lvu}}{\partial{\bm{\theta}}}({\bm{z}}^{(0)}_{v},{\bm{z}}^{(0)}_{u},{\bm{e}}_{vu},{\bm{\theta}})$ . Then,

[TABLE]

There exists a constant $C\in\mathbb{R}$ such that for any input satisfying Assumption 1,

[TABLE]

because $\|{\bm{z}}^{(0)}_{v}\|_{2},\|{\bm{z}}^{(0)}_{x_{k}}\|_{2},\|{\bm{e}}_{vx_{k}}\|_{2}$ , and $\|{\bm{\theta}}\|_{2}$ are bounded by Assumption 1 and $\text{deg}(v)DM_{Lvu}$ is continuous. Therefore, if we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ samples,

[TABLE]

holds true by the Hoeffding’s inequality and equations (17) and (18). If

[TABLE]

and

[TABLE]

hold true, then

[TABLE]

Therefore, $\textnormal{Pr}[\|\frac{\partial{\bm{z}}^{(L)}_{v}}{\partial{\bm{\theta}}}-\frac{\partial{\bm{z}}^{(L)}_{v}}{\partial{\bm{\theta}}}\|_{F}\geq\varepsilon]\leq\delta$ by equations (15), (16), and (19).

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ .

[TABLE]

Because $DU_{L}$ is uniform continuous,

[TABLE]

Here, $X<O(\varepsilon)$ means that there exists a universal constant $\alpha$ such that $X<\alpha\varepsilon$ . Because $\text{deg}(v)DM_{Lvu}$ is uniform continuous,

[TABLE]

By the argument of the proof of Theorem 1, if we take sufficiently large number of samples,

[TABLE]

By the induction hypothesis, there exists $r^{(1)},\dots,r^{(L-1)}$ such that

[TABLE]

If we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ ,

[TABLE]

holds true by the Hoeffding’s inequality. Therefore,

[TABLE]

holds true by equations (20), (21), (22), (23), (24), (25), and (26). Therefore, if we take $r^{(1)},\dots r^{(L)}$ sufficiently large, $\textnormal{Pr}[\|\frac{\partial{\bm{z}}^{(L)}_{v}}{\partial{\bm{\theta}}}-\frac{\partial{\bm{z}}^{(L)}_{v}}{\partial{\bm{\theta}}}\|_{F}\geq\varepsilon]\leq\delta$ holds true by equations (20), (21), (22), (23), (27), (28), and (29) because the universal constants can be reduced arbitrarily if we increase the constant of the number of samples.

∎

Proof of Theorem 5.

We prove this by performing mathematical induction on the number of layers.

Base case: We show that the statement holds true for $L=1$ . If $DU_{L}$ is $K^{\prime}$ -Lipschitz continuous, $\varepsilon^{\prime}=O(\varepsilon)$ in equation (15). Therefore, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ is sufficient.

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ . If $DU_{L}$ and $\text{deg}(v)DU_{Lvu}$ is $K^{\prime}$ -Lipschitz continuous, $\varepsilon^{\prime}=O(\varepsilon)$ in equation (20) and $\varepsilon^{\prime\prime}=O(\varepsilon)$ in equation (21). Therefore, $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ and $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ are sufficient. ∎

Lemma 6.

If Assumptions 1 hold true and $\sigma$ is Lipschitz continuous, $\|{\bm{z}}^{(l)}_{v}\|_{2}$ and $\|{\bm{z}}^{(l)}_{v}\|_{2}$ $(l=1,\dots,L)$ of the GAT model are bounded by a constant

Proof of Lemma 6.

We prove this by performing mathematical induction on the number of layers. The norm of the input of the first layer is bounded by Assumption 1. If $\|{\bm{z}}^{(l-1)}_{u}\|_{2}$ and $\|\hat{{\bm{z}}}^{(l-1)}_{u}\|_{2}$ are bounded for all $u\in\mathcal{V}$ , $\|{\bm{h}}^{(l)}_{v}\|_{2}$ and $\|\hat{{\bm{h}}}^{(l)}_{v}\|_{2}$ are bounded because ${\bm{h}}^{(l)}_{v}$ and $\hat{{\bm{h}}}^{(l)}_{v}$ are the weighted sum of ${\bm{z}}^{(l-1)}_{u}$ and $\hat{{\bm{z}}}^{(l-1)}_{u}$ . Therefore, $\|{\bm{z}}^{(l)}_{u}\|_{2}$ and $\|\hat{{\bm{z}}}^{(l)}_{u}\|_{2}$ are bounded because $U_{l}$ is continuous. ∎

Proof of Proposition 6.

We prove the theorem by performing mathematical induction on the number of layers $L$ .

Base case: We show that the statement holds true for $L=1$ . Because $U_{L}$ is Lipschitz continuous,

[TABLE]

Let $e_{u}=\exp(\textsc{LeakyReLU}({\bm{a}}^{(l)\top}[{\bm{W}}^{(0)}{\bm{z}}^{(0)}_{v},{\bm{W}}^{(0)}{\bm{z}}^{(0)}_{u}u]))$ . Then,

[TABLE]

Let $x_{k}$ be the $k$ -th sample in $\mathcal{S}^{(L)}$ and $X_{k}=e_{x_{k}}$ . Then,

[TABLE]

There exists a constant $c>0,C>0$ such that for any input satisfying Assumption 1,

[TABLE]

because $\|{\bm{z}}^{(0)}_{v}\|_{2},\|{\bm{z}}^{(0)}_{x_{k}}\|_{2},\|{\bm{W}}^{(0)}\|_{F}$ , and $\|{\bm{a}}^{(0)}\|_{2}$ are bounded by Assumption 1. Therefore, if we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ samples,

[TABLE]

by the Hoeffding’s inequality and equations (31) and (32). Because $f(x)=1/x$ is Lipschitz continuous in $x>c>0$ ,

[TABLE]

Let

[TABLE]

Then,

[TABLE]

There exists a constant $C^{\prime}\in\mathbb{R}$ such that for any input satisfying Assumption 1,

[TABLE]

holds true because $\|{\bm{z}}^{(0)}_{u}\|_{2}$ are bounded, and $c<|e_{u}|<C$ . Therefore, if we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ samples,

[TABLE]

holds true by the Hoeffding’s inequality and equations (34) and (35). Therefore,

[TABLE]

holds true by the triangle inequality and equations (33) and (36), and $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}^{(L)}-{\bm{z}}_{v}^{(L)}\|_{2}\geq\varepsilon]\leq\delta$ holds true by equations (30) and (37).

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ . Because $U_{L}$ is Lipschitz continuous,

[TABLE]

holds true.

[TABLE]

holds true by the same argument as the base step. If we take $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ samples,

[TABLE]

holds true by the induction hypothesis. Therefore, $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}^{(L)}-{\bm{z}}_{v}^{(L)}\|_{2}\geq\varepsilon]\leq\delta$ holds true by equations (38), (39), and (40).

∎

Proof of Proposition 1.

We show that one-layer GraphSAGE-GCN whose activation function is not constant cannot be approximated in constant time if $\|x_{v}\|_{2}$ or $\|\theta\|_{2}$ are not bounded. There exists $a,b\in\mathbb{R}$ such that $\sigma(a)\neq\sigma(0)$ because $\sigma$ is not constant. We consider the following two types of inputs:

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=1$ , and ${\bm{x}}_{i}=a$ for all nodes $i\in\mathcal{V}$ .

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=1$ , ${\bm{x}}_{i}=a(i\neq v)$ for some $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=n(b-a)$ .

Then, for the former input, ${\bm{z}}_{v}^{(1)}=\sigma(a)$ . For the latter type of inputs, ${\bm{z}}_{v}^{(1)}=\sigma(b)$ . Let $\mathcal{A}$ be an arbitrary constant algorithm and $C$ be the number of queries $\mathcal{A}$ makes when we set $\varepsilon=|\sigma(a)-\sigma(b)|/3$ . When $\mathcal{A}$ calculates the embedding of $u\neq v\in\mathcal{V}$ , the states of all nodes but $u$ are symmetrical until $\mathcal{A}$ makes a query about that node. Therefore, if $n$ is sufficiently large, $\mathcal{A}$ does not make any query about $v$ with high probability (i.e., at least $(1-\frac{1}{n-1})^{C}$ ). If $\mathcal{A}$ does not make any query about $v$ , the state of $\mathcal{A}$ is the same for both types of inputs. If the approximation error is less than $\varepsilon$ for the first type of inputs, the approximation error is larger than $\varepsilon$ for the second type of inputs by the triangle inequality and vice versa. Therefore, $\mathcal{A}$ fails to approximate the embeddings of either type of inputs with the absolute error of at most $\varepsilon$ . As for ${\bm{\theta}}$ , we set ${\bm{W}}^{(1)}=n$ and ${\bm{x}}_{i}=a/n$ and ${\bm{x}}_{x}=b-a$ . Then, the same argument follows. ∎

Proof of Proposition 2.

We consider the one-layer GraphSAGE-GCN with ReLU and normalization (i.e., $\sigma({\bm{x}})=\textsc{ReLU}({\bm{x}})/\|\textsc{ReLU}({\bm{x}})\|_{2}$ ). We use the following two types of inputs:

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}$ is the identity matrix ${\bm{I}}_{2}$ , ${\bm{x}}_{i}=(0,0)^{\top}(i\neq v)$ for some node $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=(1,0)^{\top}$ .

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}$ is the identity matrix ${\bm{I}}_{2}$ , ${\bm{x}}_{i}=(0,0)^{\top}(i\neq v)$ for some node $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=(0,1)^{\top}$ .

Then, for the former type of inputs, ${\bm{h}}_{i}=(1/n,0)^{\top}$ , ${\bm{z}}_{i}=(1,0)^{\top}$ , and $\frac{\partial z_{i2}}{\partial W_{21}}=1$ for all $i\in\mathcal{V}$ . For the latter type of inputs, ${\bm{h}}_{i}=(0,1/n)^{\top}$ , ${\bm{z}}_{i}=(0,1)^{\top}$ , and $\frac{\partial{\bm{z}}_{i2}}{\partial W_{21}}=0$ for all $i\in\mathcal{V}$ . Let $\mathcal{A}$ be an arbitrary constant algorithm and $C$ be the number of queries $\mathcal{A}$ makes when we set $\varepsilon=1/3$ . When $\mathcal{A}$ calculates the embedding or gradient of $u\neq v\in\mathcal{V}$ , the states of all nodes but $u$ are symmetrical until $\mathcal{A}$ makes a query about that node. Therefore, if $n$ is sufficiently large, $\mathcal{A}$ does not make any query about $v$ with high probability (i.e., at least $(1-\frac{1}{n-1})^{C}$ ). If $\mathcal{A}$ does not make any query about $v$ , the state of $\mathcal{A}$ is the same for both types of inputs. If the approximation error is less than $\varepsilon$ for the first type of inputs, the approximation error is larger than $\varepsilon$ for the second type of inputs by the triangle inequality and vice versa. Therefore, $\mathcal{A}$ fails to approximate the embeddings and gradients of either type of inputs with the absolute error of at most $\varepsilon$ . ∎

Proof of Proposition 3.

We consider the one-layer GraphSAGE-GCN with ReLU (i.e., $\sigma({\bm{x}})=\textsc{ReLU}({\bm{x}})$ ). We use the following two types of inputs:

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=(-1,1)$ , ${\bm{x}}_{i}=(1,1)^{\top}(i\neq v)$ for some node $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=(1,2)^{\top}$ .

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=(-1,1)$ , ${\bm{x}}_{i}=(1,1)^{\top}(i\neq v)$ for some node $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=(1,0)^{\top}$ .

Then, for the former type of inputs, $\textsc{MEAN}(\{{\bm{x}}_{u}\mid u\in\mathcal{N}(v)\})=(1,1+\frac{1}{n})^{\top}$ , ${\bm{h}}_{v}={\bm{z}}_{v}=\frac{1}{n}$ , and $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{W}}}=(1,1+\frac{1}{n})$ for all $i\in\mathcal{V}$ . For the latter type of inputs, $\textsc{MEAN}(\{{\bm{x}}_{u}\mid u\in\mathcal{N}(v)\})=(1,1-\frac{1}{n})^{\top}$ , ${\bm{h}}_{v}=-\frac{1}{n}$ , ${\bm{z}}_{v}=0$ , and $\frac{\partial{\bm{z}}_{v}}{\partial{\bm{W}}}=(0,0)$ for all $i\in\mathcal{V}$ . Let $\mathcal{A}$ be an arbitrary constant algorithm and $C$ be the number of queries $\mathcal{A}$ makes when we set $\varepsilon=1/3$ . When $\mathcal{A}$ calculates the gradient of $u\neq v\in\mathcal{V}$ , the states of all nodes but $u$ are symmetrical until $\mathcal{A}$ makes a query about that node. Therefore, if $n$ is sufficiently large, $\mathcal{A}$ does not make any query about $v$ with high probability (i.e., at least $(1-\frac{1}{n-1})^{C}$ ). If $\mathcal{A}$ does not make any query about $v$ , the state of $\mathcal{A}$ is the same for both types of inputs. If the approximation error is less than $\varepsilon$ for the first type of inputs, the approximation error is larger than $\varepsilon$ for the second type of inputs by the triangle inequality and vice versa. Therefore, $\mathcal{A}$ fails to approximate the gradients of either type of inputs with the absolute error of at most $\varepsilon$ . ∎

Proof of Proposition 4.

We consider the one-layer GraphSAGE-pool whose activation function satisfies $\sigma(1)\neq\sigma(0)$ and the following two types of inputs:

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=1$ , ${\bm{b}}=0$ , and ${\bm{x}}_{i}=0$ for all nodes $v\in\mathcal{V}$ .

•

$G$ is the clique $K_{n}$ , ${\bm{W}}^{(1)}=1$ , ${\bm{b}}=0$ , ${\bm{x}}_{i}=0~{}(i\neq v)$ for some node $v\in\mathcal{V}$ , and ${\bm{x}}_{v}=1$ .

Then, for the former type of inputs, ${\bm{z}}_{i}=\sigma(0)$ for all $i\in\mathcal{V}$ . For the latter type of inputs, ${\bm{z}}_{i}=\sigma(1)$ for all $i\in\mathcal{V}$ . Let $\mathcal{A}$ be an arbitrary constant algorithm and $C$ be the number of queries $\mathcal{A}$ makes when we set $\varepsilon=|\sigma(1)-\sigma(0)|/3$ . When $\mathcal{A}$ calculates the embedding of $u\neq v\in\mathcal{V}$ , the states of all nodes but $u$ are symmetrical until $\mathcal{A}$ makes a query about that node. Therefore, if $n$ is sufficiently large, $\mathcal{A}$ does not make any query about $v$ with high probability (i.e., at least $(1-\frac{1}{n-1})^{C}$ ). If $\mathcal{A}$ does not make any query about $v$ , the state of $\mathcal{A}$ is the same for both types of inputs. If the approximation error is less than $\varepsilon$ for the first type of inputs, the approximation error is larger than $\varepsilon$ for the second type of inputs by the triangle inequality and vice versa. Therefore, $\mathcal{A}$ fails to approximate the embeddings of either type of inputs with the absolute error of at most $\varepsilon$ . ∎

Proof of Proposition 5.

We consider the one-layer GCNs whose activation function satisfies $\sigma(1)\neq\sigma(0)$ . We use the following two types of inputs:

•

$G$ is a star graph, where $v\in\mathcal{V}$ is the center of $G$ , ${\bm{W}}^{(1)}=1$ , and all features are [math].

•

$G$ is a star graph, where $v\in\mathcal{V}$ is the center of $G$ , ${\bm{W}}^{(1)}=1$ , and the features of $\sqrt{2n}$ leafs are $1$ and the features of other nodes are [math].

Then, for the former type of inputs, ${\bm{z}}_{v}=\sigma(0)$ . For the latter type of inputs, ${\bm{z}}_{v}=\sigma(1)$ . Let $\mathcal{A}$ be an arbitrary constant algorithm and $C$ be the number of queries $\mathcal{A}$ makes when we set $\varepsilon=|\sigma(1)-\sigma(0)|/3$ . When $\mathcal{A}$ calculates the embedding of $u\in\mathcal{V}$ that ${\bm{x}}_{u}=0$ , the states of all nodes but $u$ are symmetrical until $\mathcal{A}$ makes a query about that node. Therefore, if $n$ is sufficiently large, $\mathcal{A}$ does not make any query about $v$ with high probability (i.e., at least $(1-\frac{\sqrt{2n}}{n-1})^{C}$ ). If $\mathcal{A}$ does not make any query about $v$ , the state of $\mathcal{A}$ is the same for both types of inputs. If the approximation error is less than $\varepsilon$ for the first type of inputs, the approximation error is larger than $\varepsilon$ for the second type of inputs by the triangle inequality and vice versa. Therefore, $\mathcal{A}$ fails to approximate the embeddings of either type of inputs with the absolute error of at most $\varepsilon$ . ∎

Lemma 7.

If Assumptions 1 and 6 hold true and $\sigma$ is Lipschitz continuous, $\|{\bm{z}}^{(l)}_{v}\|_{2}$ and $\|{\bm{z}}^{(l)}_{v}\|_{2}$ $(l=1,\dots,L)$ of the GCN model are bounded by a constant.

Proof of Lemma 7.

We prove this by performing mathematical induction on the number of layers. The norm of the input of the first layer is bounded by Assumption 1. If $\|{\bm{z}}^{(l-1)}_{u}\|_{2}$ and $\|\hat{{\bm{z}}}^{(l-1)}_{u}\|_{2}$ are bounded by $B$ for all $u\in\mathcal{V}$ , $\|{\bm{h}}^{(l)}_{v}\|_{2}$ and $\|\hat{{\bm{h}}}^{(l)}_{v}\|_{2}$ are bounded because under Assumption 6,

[TABLE]

Therefore, $\|{\bm{z}}^{(l)}_{u}\|_{2}$ and $\|\hat{{\bm{z}}}^{(l)}_{u}\|_{2}$ are bounded because $U_{l}$ is continuous. ∎

Proof of Proposition 6.

We prove the theorem by performing mathematical induction on the number of layers $L$ .

Base case: We show that the statement holds true for $L=1$ . Because $U_{L}$ is Lipschitz continuous,

[TABLE]

Let $x_{k}$ be the $k$ -th sample in $\mathcal{S}^{(L)}$ and $X_{k}=\sqrt{\frac{\text{deg}(v)}{\text{deg}(x_{k})}}{\bm{z}}^{(0)}_{x_{k}}$ . Then,

[TABLE]

There exists a constant $C>0$ such that for any input satisfying Assumption 1,

[TABLE]

because $\|{\bm{z}}^{(0)}_{x_{k}}\|_{2}$ and $\frac{\text{deg}(v)}{\text{deg}(x_{k})}$ are bounded by Assumptions 1 and 6. Therefore, if we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ samples, $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}^{(L)}-{\bm{z}}_{v}^{(L)}\|_{2}\geq\varepsilon]\leq\delta$ holds by the Hoeffding’s inequality and equations (41), (42), and (43).

Inductive step: We show that the statement holds true for $L=l+1$ if it holds true for $L=l$ . Because $U_{L}$ is Lipschitz continuous and does not use ${\bm{z}}_{v}$ ,

[TABLE]

holds true. If we take $r^{(L)}=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta})$ samples,

[TABLE]

holds true by the same argument as the base case. If we take $r^{(1)},\dots,r^{(L-1)}=O(\frac{1}{\varepsilon^{2}}(\log\frac{1}{\varepsilon}+\log\frac{1}{\delta}))$ samples,

[TABLE]

holds true by the induction hypothesis. Therefore, $\textnormal{Pr}[\|\hat{{\bm{z}}}_{v}^{(L)}-{\bm{z}}_{v}^{(L)}\|_{2}\geq\varepsilon]\leq\delta$ holds true by equations (44), (45), and (46).

∎

Acknowledgements.

This work was supported by JSPS KAKENHI GrantNumber 20H04244 and 21J22490, and the JST PRESTO program JPMJPR165A.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Barabasi and Albert (1999) Albert-Laszlo Barabasi and Reka Albert. 1999. Emergence of Scaling in Random Networks. Science 286, 5439 (1999), 509–512.
3Baskin et al . (1997) Igor I. Baskin, Vladimir A. Palyulin, and Nikolai S. Zefirov. 1997. A Neural Device for Searching Direct Correlations between Structures and Properties of Chemical Compounds. Journal of Chemical Information and Computer Sciences 37, 4 (1997), 715–721.
4Bruna et al . (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. 2014. Spectral Networks and Locally Connected Networks on Graphs. In 2nd International Conference on Learning Representations, ICLR .
5Chazelle et al . (2005) Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. 2005. Approximating the Minimum Spanning Tree Weight in Sublinear Time. SIAM J. Comput. 34, 6 (2005), 1370–1379.
6Chen et al . (2018 a) Jie Chen, Tengfei Ma, and Cao Xiao. 2018 a. Fast GCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In Proceedings of the Sixth International Conference on Learning Representations, ICLR .
7Chen et al . (2018 b) Jianfei Chen, Jun Zhu, and Le Song. 2018 b. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In Proceedings of the 35th International Conference on Machine Learning, ICML . PMLR, 941–949.
8Chiang et al . (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD . ACM, New York, NY, USA, 257–266.