Scalable Neural Network Training over Distributed Graphs

Aashish Kolluri; Sarthak Choudhary; Bryan Hooi; Prateek Saxena

arXiv:2302.13053·cs.LG·February 13, 2024

Scalable Neural Network Training over Distributed Graphs

Aashish Kolluri, Sarthak Choudhary, Bryan Hooi, Prateek Saxena

PDF

Open Access 1 Repo

TL;DR

RETEXO is a novel framework that significantly reduces communication costs in distributed GNN training by reordering training processes, enabling scalable and accurate training across various network configurations.

Contribution

RETEXO introduces lazy message passing, a new training procedure that eliminates communication bottlenecks in distributed GNN training regardless of data partitioning.

Findings

01

Achieves 10-100x reduction in network data costs

02

Retains accuracy comparable to standard training methods

03

Scales effectively across different network decentralization levels

Abstract

Graph neural networks (GNNs) fuel diverse machine learning tasks involving graph-structured data, ranging from predicting protein structures to serving personalized recommendations. Real-world graph data must often be stored distributed across many machines not just because of capacity constraints, but because of compliance with data residency or privacy laws. In such setups, network communication is costly and becomes the main bottleneck to train GNNs. Optimizations for distributed GNN training have targeted data-level improvements so far -- via caching, network-aware partitioning, and sub-sampling -- that work for data center-like setups where graph data is accessible to a single entity and data transfer costs are ignored. We present RETEXO, the first framework which eliminates the severe communication bottleneck in distributed GNN training while respecting any given data…

Tables7

Table 1. Table 1. Summary of symbols used, size in bytes.

symbol	description
$R$	# training rounds
$n$	# workers
$B$	# boundary nodes per worker
$H_{0}$	input features size
$H_{m}$ , $m \in [1 \dots K - 1]$	intermediate representations size
$C_{m}$	size of additional classifier layer
$M$	size of the GNN

Table 2. Table 2. Micro-F1 scores for node classification in the transductive setting.

Method	Cora	Citeseer	PubMed	Facebook	LastFMAsia	Reddit	Products
GCN	$0.820 \pm 0.005$	$0.725 \pm 0.006$	$0.838 \pm 0.001$	$0.917 \pm 0.001$	$0.846 \pm 0.002$	$0.925 \pm 0.000$	$0.713 \pm 0.000$
RetexoGCN	$0.828 \pm 0.006$	$0.739 \pm 0.009$	$0.850 \pm 0.002$	$0.916 \pm 0.003$	$0.843 \pm 0.005$	$0.939 \pm 0.000$	$0.791 \pm 0.001$
GraphSAGE	$0.826 \pm 0.004$	$0.743 \pm 0.010$	$0.867 \pm 0.002$	$0.925 \pm 0.001$	$0.835 \pm 0.004$	$0.952 \pm 0.000$	$0.781 \pm 0.000$
RetexoSage	$0.825 \pm 0.004$	$0.766 \pm 0.008$	$0.861 \pm 0.002$	$0.910 \pm 0.002$	$0.825 \pm 0.007$	$0.957 \pm 0.004$	$0.792 \pm 0.001$
GAT	$0.807 \pm 0.007$	$0.714 \pm 0.011$	$0.833 \pm 0.002$	$0.912 \pm 0.003$	$0.842 \pm 0.004$	$0.925 \pm 0.000$	$0.707 \pm 0.000$
RetexoGAT	$0.818 \pm 0.006$	$0.738 \pm 0.010$	$0.830 \pm 0.002$	$0.913 \pm 0.003$	$0.832 \pm 0.005$	$0.944 \pm 0.003$	$0.748 \pm 0.001$

Table 3. Table 3. Accuracy with increasing sampling.

Architecture	Reddit	Products
RetexoGAT	0.942	0.749
GAT	0.925	0.704
BNS-GAT (0.1)	0.903	0.648
BNS-GAT (0.01)	0.404	0.348

Table 4. Table 4. Time taken (in seconds) for total training and message passing (MP) with varying latency

	Latency (0 ms)		Latency (50 ms)		Latency (100 ms)		Latency (200 ms)
Reddit	Total	MP	Total	MP	Total	MP	Total	MP
RetexoSage	41.3	1.7	103.7	1.5	181.9	1.5	339.8	1.5
GraphSAGE	290.6	262.0	325.2	274.8	409.4	339.5	802.1	665.7
BNS-SAGE (0.1)	43.9	32.6	113.4	82.8	193.3	142.9	370.5	279.8
Products
RetexoSage	76.0	6.3	112.2	6.1	158.0	6.0	275.3	6.0
GraphSAGE	1681.2	1528.8	1692.8	1526.4	1711.6	1519.3	2176.4	1912.4
BNS-SAGE (0.1)	205.3	168.9	233.9	180.5	289.1	214.6	407.7	293.9

Table 5. Table 5. Theoretically expected versus empirically observed values of γ 𝛾 \gamma for Products

#Paritions	Theoretical	Empirical
$2$	$314.2$	$314.6$
$4$	$319.8$	$320.2$
$6$	$320.5$	$321.2$
$8$	$320.5$	$321.1$

Table 6. Table 6. Theoretically expected versus empirically observed values of γ 𝛾 \gamma for Reddit

#Paritions	Theoretical	Empirical
$2$	$125.9$	$125.9$
$4$	$146.8$	$146.8$
$6$	$151.5$	$151.4$
$8$	$153.4$	$153.5$
$16$	$155.1$	$155.2$
$24$	$154.7$	$154.8$

Table 7. Table 7. All datasets’ statistics.

Dataset	Nodes	Edges	Features	Classes
Cora (Yang et al., 2016)	2,708	10,566	1,433	7
Citeseer (Yang et al., 2016)	3,327	9,104	3,703	6
LastFMAsia (Rozemberczki and Sarkar, 2020)	7,624	55,612	128	16
PubMed (Yang et al., 2016)	19,717	88,648	500	3
Facebook (Rozemberczki et al., 2021)	22,470	342,004	128	4
Reddit (Hamilton et al., 2017)	233K	114M	602	41
Products (Hu et al., 2020a)	2.4M	62M	100	47

Equations19

Q_{v}^{m} = f_{θ}^{m - 1} (COMBINE (Q_{v}^{m - 1}, AGGR_{u \in N (v)}^{m - 1} (Q_{u}^{m - 1}))), Q_{u}^{0} = F [u]

Q_{v}^{m} = f_{θ}^{m - 1} (COMBINE (Q_{v}^{m - 1}, AGGR_{u \in N (v)}^{m - 1} (Q_{u}^{m - 1}))), Q_{u}^{0} = F [u]

γ = \frac{D V _{t o t a l}^{s t an d a r d}}{D V _{t o t a l}^{l a z y}} = \frac{1 + B \cdot ( 2 \frac{\sum _{m = 1}^{K - 1} H _{m}}{M} + \frac{H _{0}}{M \cdot R} )}{1 + \frac{\sum _{m = 1}^{K - 1} C _{m}}{M} + B \cdot ( \frac{\sum _{m = 1}^{K - 1} H _{m}}{M \cdot R} + \frac{H _{0}}{M \cdot R} )}

γ = \frac{D V _{t o t a l}^{s t an d a r d}}{D V _{t o t a l}^{l a z y}} = \frac{1 + B \cdot ( 2 \frac{\sum _{m = 1}^{K - 1} H _{m}}{M} + \frac{H _{0}}{M \cdot R} )}{1 + \frac{\sum _{m = 1}^{K - 1} C _{m}}{M} + B \cdot ( \frac{\sum _{m = 1}^{K - 1} H _{m}}{M \cdot R} + \frac{H _{0}}{M \cdot R} )}

D V_{l t}^{l a z y}

D V_{l t}^{l a z y}

D V_{l t}^{s t an d a r d}

D V_{l t}^{s t an d a r d}

\frac{D V _{l t}^{s t an d a r d}}{D V _{l t}^{l a z y}}

\frac{D V _{l t}^{s t an d a r d}}{D V _{l t}^{l a z y}}

= \frac{2 R \cdot \sum _{m = 1}^{K - 1} H _{m} + H _{0}}{\sum _{m = 1}^{K - 1} H _{m} + H _{0}}

D V_{r e d u ce}^{s t an d a r d} = n \cdot R \cdot M + n \cdot R \cdot M

D V_{r e d u ce}^{s t an d a r d} = n \cdot R \cdot M + n \cdot R \cdot M

D V_{r e d u ce}^{l a z y}

D V_{r e d u ce}^{l a z y}

= 2 n \cdot R \cdot M + 2 n \cdot R \cdot m = 1 \sum K - 1 C_{m}

\frac{D V _{r e d u ce}^{s t an d a r d}}{D V _{r e d u ce}^{l a z y}}

\frac{D V _{r e d u ce}^{s t an d a r d}}{D V _{r e d u ce}^{l a z y}}

= \frac{1}{1 + \frac{\sum _{m = 1}^{K - 1} C _{m}}{M}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aashishkolluri/retexo-distributed
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Advanced Graph Neural Networks · IoT and Edge/Fog Computing

Full text

Scalable Neural Network Training

over Distributed Graphs

Aashish Kolluri,* Sarthak Choudhary,* Bryan Hooi, and Prateek Saxena

School of Computing, National University of Singapore

[email protected], [email protected], {bhooi, prateeks}@comp.nus.edu.sg

Abstract.

Graph neural networks (GNNs) fuel diverse machine learning tasks involving graph-structured data, ranging from predicting protein structures to serving personalized recommendations. Real-world graph data must often be stored distributed across many machines not just because of capacity constraints, but because of compliance with data residency or privacy laws. In such setups, network communication is costly and becomes the main bottleneck to train GNNs. Optimizations for distributed GNN training have targeted data-level improvements so far—via caching, network-aware partitioning, and sub-sampling—that work for data center-like setups where graph data is accessible to a single entity and data transfer costs are ignored.

We present Retexo, the first framework which eliminates the severe communication bottleneck in distributed GNN training while respecting any given data partitioning configuration. The key is a new training procedure, lazy message passing, that reorders the sequence of training GNN elements. Retexo achieves $1-2$ * orders of magnitude reduction* in network data costs compared to standard GNN training, while retaining accuracy. Retexo scales gracefully with increasing decentralization and decreasing bandwidth. It is the first framework that can be used to train GNNs at all network decentralization levels—including centralized data-center networks, wide area networks, proximity networks, and edge networks.

††copyright: none††doi: ††isbn: ††price: **footnotetext: These authors contributed equally to this work.

1. Introduction

Machine learning on graphs is a fundamental problem, with classical solutions such as cluster analyses and label propagation (see survey (Fortunato, 2010)). Graph neural networks (GNN) have surpassed prior machine learning solutions by offering state-of-the-art performance in many machine learning tasks over graph data such as serving recommendations, financial fraud detection, and drug discovery (Wu et al., 2020; Zhou et al., 2020). Often in these applications, the graphs can be very large, therefore systems that optimize the cost of training GNNs are on the rise (Wan et al., 2022; Gandhi and Iyer, 2021; Zheng et al., 2022).

We are in an era where data sovereignty, geographic data residency, privacy laws like GDPR, and intellectual property concerns play a pivotal role in governing the use of sensitive information. Such governance applies to graph data used in machine learning systems as well. It is often necessary to partition such data to ensure that sensitive user data and processing is confined solely at certain servers or client devices (McMahan et al., 2017; Kairouz et al., 2021). This is to legally comply with data use policies and to manage the risk of data breaches (Roth, [n. d.]). Such partitioning requirements are often externally imposed, e.g., no single entity may store the entire dataset, and regulations forbid its exchange without user consent. Performance optimization must respect such data decentralization constraints.

Several works have been proposed to optimize training GNNs on partitioned graph data across networked machines (workers) (Wan et al., 2022; Md et al., 2021; Gandhi and Iyer, 2021; Zheng et al., 2022; Zhu et al., 2019b; Cai et al., 2021; Jia et al., 2020; Zheng et al., 2020; Liu et al., 2023a). These works have established that data-intensive message passing rounds conducted over the network between workers is the major bottleneck while training GNNs on large partitioned graphs. Our experiments presented later confirm the same: With the standard training procedure, over $2$ TB of data is communicated over the network for end-to-end training on a graph with close to $2.5$ million nodes when partitioned among just $2$ workers. Therefore, reducing the network data volume (costs) to train GNNs is critical for many reasons—lowering financial costs, reducing training time, and democratizing training even over low-bandwidth or unreliable networks.

Prior works, while highlighting the bottleneck, propose solutions for the centralized training setups where the entire graph is accessible to a centralized coordinator and worker machines can access each other’s raw data arbitrarily over fast communication links. Their systems implement variants of mainly three data-level optimizations: graph- and network-aware partitioning, caching data from other workers, and graph sub-sampling. None of them are designed for general distributed setups with arbitrary data decentralization constraints, and for handling diverse network characteristics, as a first-class principle. For instance, no single entity partitioning the graph is often feasible when graph data resides at workers separated by geopolitical boundaries. In centralized training setups, data transfer over the internal network does not incur much financial costs. Further, bandwidth constraints can also be eased with specialized infrastructure, for instance, RDMA networks with a throughput of over $100$ Gbps (Liu et al., 2023b). Network data transfer costs and the available bandwidth is still a critical bottleneck for all other setups where workers may be GPUs connected over NICs / PCIe interfaces or devices connected over wide-area networks, via edge servers, or over wireless/bluetooth. These settings may have less than $10$ Gbps links between workers and may require thousands of dollars to train a single medium-size GNN (see Section 2.3).

Our Solution: Retexo.

We design the first execution framework, called Retexo, for training GNNs efficiently for all levels of graph data decentralization. Retexo rethinks the training process of GNNs from the perspective of communication-efficiency, i.e., reduce as much network data volume as possible. The key idea is to change the order in which layers of the neural network are trained giving an effect of delaying the message passing operations during training until necessary. We call it the lazy message passing training strategy. Figure 7 in our evaluation illustrates Retexo’s savings of $1$ to $2$ orders of magnitude in network data costs, compared to state-of-the-art (Wan et al., 2022) and standard training strategies respectively.

Retexo is designed to adhere to any given data partitioning requirements, i.e., Retexo does not need to share raw node features or graph edges with any centralized coordinator. Communication can be peered between workers, so cross-worker graph edges need not be routed via a centralized coordinator. This design, first and foremost, removes any algorithmic necessity to route data centrally. The centralized coordinator is only used to conduct the training process in a synchronized manner and aggregate the gradients sent by the workers for every training round. Minimizing centralized coordination has well-known benefits of avoiding performance bottlenecks, but it also gives better data controls—each worker has complete autonomy over what it sends to other workers and the coordinator. It offers a natural baseline to implement advanced privacy enhancements in the future, for instance, local differential privacy with noise tailored individually to each worker (Sajadmanesh and Gatica-Perez, 2021; Zhu et al., 2023; Kolluri et al., 2022).

Retexo can achieve up to $321\times$ better network data costs compared to standard GNN training on standard benchmarks. It has $32\times$ better costs compared to BNS-GCN, the state-of-the-art system designed to minimize cross-worker costs. As a consequence, Retexo also trains GNNs end-to-end up to $22\times$ and $2.7\times$ faster than standard GNN training and BNS-GCN respectively, on our testbed which is similar to prior works (Wan et al., 2022; Gandhi and Iyer, 2021). The accuracy offered by Retexo for these architectures is within $\pm 1\%$ of that when centrally trained. At the same time, it respects data partitioning constraints, unlike prior systems (Gandhi and Iyer, 2021) which may violate these constraints by replicating raw data across workers.

Compatibility with GNNs and Training Frameworks.

Retexo is compatible with any message passing GNN model architectures such as GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and GAT (Veličković et al., 2018). We have implemented Retexo on two popular distributed machine learning frameworks—PyTorch Distributed 111 https://pytorch.org/tutorials/beginner/dist_overview.html and Flower (Beutel et al., 2020). This illustrates that Retexo can work in two very different network setups: multi-machine GPU clusters and bandwidth-constrained mobile Raspberry Pi (RPi) clients. To the best of our knowledge, Retexo is the first practical framework to enable training GNNs on distributed graphs stored on mobile clients. We have released the code to reproduce our experiments.222https://github.com/aashishkolluri/retexo-distributed We believe this work does not raise any ethical issues.

2. Background & Overview

Training GNNs on multiple workers distributively is indispensable for large datasets found in social networks (Ying et al., 2018), financial transactions (Shen et al., 2022; Zheng and Saupe, 2023), and recommender systems (Yang et al., [n. d.]) due to insufficient compute and memory in a single worker.

Another practical reason is compliance with data residency and privacy regulations. Real-world graph data often originates from distributed sources. For example, client users often generate data that are part of graphs arising in social networking or web analytics. These graphs can be sensitive (e.g. signifying personal relationships or consumer preferences) (Korolova et al., 2008; Wondracek et al., 2010). It is desirable to minimize the amount of sensitive graph data shared externally to the device for privacy reasons. Similarly, data residing at various edge servers, geo-distributed data centers, and across different organizations is often required to stay within jurisdictional boundaries and subject to cross-border transfer regulations otherwise (Clegg et al., [n. d.]; Staff, [n. d.]; Moriya et al., [n. d.]). Therefore, a good system for distributed GNN training should not assume that the data placement can be arbitrary, rather it should work with any data partitioning imposed by external constraints. The general edge-partitioned setup we describe next accommodates both motivations: efficiency and data decentralization.

2.1. Edge-partitioned Distributed Setup

Consider a graph $\mathcal{G}:(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ and $\mathcal{E}$ are the set of nodes and edges respectively. The graph $\mathcal{G}$ is partitioned across $N$ workers, i.e., each worker holds a set of inner nodes $\mathcal{V}_{i}\subset\mathcal{V}$ it owns and all the edges between its inner nodes. A boundary node for a worker $i$ is an inner node from another worker that has an edge with an inner node at $i$ . Therefore, each worker $i$ also stores its boundary nodes and the boundary edges i.e., the edges between its inner nodes and its boundary nodes. We want the edge-partitioned distributed setup to be general enough to address any level of partitioning, from completely centralized to fully decentralized. Therefore, we assume that these partitions exist naturally i.e., no entity controls what the partitions can be. Further, none of the workers share their graph data with other workers apart from the boundary edges which the corresponding workers are mutually aware of. The graphs are undirected in our work, though one can extend to directed graphs in a straightforward way. We illustrate this setup in Figure 1.

We want to train GNNs over such graphs efficiently. Our focus is on the most common application, namely supervised node classification. All nodes have feature vectors i.e., each node in $\mathcal{V}$ has a feature vector of $I$ dimensions captured by the node-feature map $\mathbf{F}:\mathcal{V}\xrightarrow{}\mathbb{R}^{I}$ . For example, in a social networking graph, the feature vectors can be word2vec embeddings (Mikolov et al., 2013) of user post topics. A few of the nodes have labels, for instance, topics that a user (node) in a social network is interested in. The task is to train a GNN on the graph to classify other nodes that do not have labels.

The features and other intermediate representations of a node have to be shared with its neighbors during the training of a GNN. It is therefore necessary for every worker to access the features and intermediate representations of their boundary nodes from other workers. But, nothing else is shared across workers, keeping with the principle of least privilege.

Our setup has a centralized coordinator (master) which collaborates with other workers for training. A worker denotes a process utilizing the resources of a machine with either a single or multiple GPU/CPU. In practice, multiple workers could be deployed on a single machine. To visualize the problem better, one can assume that a unique worker is deployed per machine that has a single GPU/CPU, communicating with other workers via network interfaces. The master only schedules training rounds and helps in aggregating gradient vectors from the workers (McMahan et al., 2017). The master may also assist the workers in setting up trusted peer-to-peer communication channels among them at the beginning if they do not exist already. During the entire training, the master will not be able to access raw node features or the graph structure from any worker and the workers are oblivious to any features, intermediate representations, or edges stored on other workers except the representations of their boundary nodes. Finally, we point out that this setup echoes the federated learning setup for images and text that is primarily motivated from a data decentralization perspective (McMahan et al., 2017).

2.2. Architecture of Classical GNNs

GNNs are feedforward neural networks. The inputs and outputs of every intermediate layer in the neural network is a vector often called a representation. We focus on message passing GNNs which are widely-used in practice, such as the GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and GAT (Veličković et al., 2018). We illustrate the forward pass of a $2$ -layer GNN to compute the embeddings for a node in Figure 2. During the forward pass, every node aggregates representations from its neighbors and combines them with its own representations in each layer $m\in[1,\ldots,K]$ . Each GNN layer can be viewed as performing one message passing round. After $m$ layers (or message passing rounds), a node’s representation has aggregated information from all nodes in its $m$ -hop neighborhood.

Formally, the GNN model architecture can be viewed as the function $\mathbf{M}:(v,\mathbf{F},\mathcal{E}))\xrightarrow{}\mathbb{R}^{O}$ . $\mathbf{M}$ takes a node $v$ , all features of all nodes $\mathbf{F}$ (which includes the $m$ -hop neighborhood of $v$ ), and the edges $\mathcal{E}$ as input. It outputs the node embedding vectors (representations) of $O$ dimensions for the given node $v$ . The raw features of $v$ are the representations used as input to the first GNN layer. Therefore, the representation of the node $v$ after $m$ rounds of message passing will be:

[TABLE]

Here, the neighbors of $v$ are denoted by $\mathcal{N}(v)$ , the

$\mathsf{COMBINE}$ function is usually vector concatenation, the non-linear functions $\mathbf{f}^{m-1}_{\theta}$ and $\mathsf{AGGR}$ may have trainable parameters. $\mathsf{AGGR}$ functions vary with different GNNs. For example, $\mathsf{AGGR}$ is the mean operator in GCNs (Kipf and Welling, 2017) or a max-pooling layer in GraphSAGE (Hamilton et al., 2017). $\mathbf{f}^{m-1}_{\theta}$ represents a sequence of non-linear functions including a fully-connected layer, the ReLU function, batch, and layer normalization layers. The $m^{th}$ GNN layer can be represented in short using only the functions with trainable parameters $(\mathsf{AGGR}^{m-1},\mathbf{f}^{m-1}_{\theta})$ . The $m^{th}$ representation of a node is, thus, a non-linear transformation over an aggregate of the $({m}-{1})^{th}$ neighbors’ representations.

The last layer produces an output embedding, on which labels are computed for the classification task. The goal of training is to optimize the values of $\mathbf{f}^{m-1}_{\theta}$ and $\mathsf{AGGR}^{m-1}$ functions for all $m\in\{1,\ldots,K\}$ for high classification accuracy.

2.3. Problem: Network Data Costs

Training GNNs distributively can incur massive network costs due to their inherent message passing architecture.

Standard Training.

The usual procedure to train GNNs distributively proceeds in training rounds. A training round has two phases: local training and gradient aggregation (Wan et al., 2022; Thorpe et al., 2021; Jia et al., 2020). The workers receive the same parameters of model $\mathbf{M}$ at the start of each training round. The local training uses the standard stochastic gradient descent algorithm (Bottou et al., 2018), which executes a forward pass execution of $\mathbf{M}$ to compute the loss function and then a backward pass of $\mathbf{M}$ to compute gradient vectors. During the forward pass, the workers send and receive the intermediate representations of boundary nodes with other workers, once for each layer in the GNN. During the backward pass, each worker receives the gradients corresponding to the representations they sent during the forward pass for a boundary node. The worker aggregates the received gradients for each layer for each node, working backward from the layer $K$ to $1$ . Thus, there are $2$ message passing rounds per layer and $2\cdot K$ in total for a $K$ -layer GNN.

At the end of a local training phase, each worker has a local gradient computed for all trainable parameters of the GNN. All workers send their local gradients to the master, which aggregates them, updates $\mathbf{M}$ , and sends the updated $\mathbf{M}$ for the next training round. This constitutes the gradient aggregation phase, completing one training round. Figure 4 illustrates the standard training for one end-to-end round.

Network data costs with an example.

It is easy to see that the network data volume exchanged between workers in the standard training increases linearly with the number of boundary nodes. Additionally, it increases linearly with the number of training rounds. If one partitions the graph randomly among a few workers then the number of boundary nodes can be in millions for even medium-sized graphs, such as the Amazon Products co-purchasing graph which has $2.5\times 10^{6}$ nodes (Hu et al., 2020a). In the standard benchmarked settings, the network data volume is close to $2.8$ GB just to share one intermediate representation among just $2$ workers for this graph. Further, close to $3.7$ TB of data volume is shared over the network among the workers for training a small GNN with $3$ layers on the Amazon Products graph. In the centralized setup itself, such high network data volume is prohibitive as pointed out by prior works (Gandhi and Iyer, 2021; Wan et al., 2022), specifically in the context of reducing training time. This problem becomes even worse for the edge-partitioned distributed setup due to the network traffic being routed over bandwidth-constrained and unreliable communication channels, incurring high costs of such data transfer.

As an example, consider two workers hosted over Amazon Web Services (AWS) which are located in two different continents. The costs of sending data even within AWS services can be up to $0.1$ USD per GB of data. Therefore, to train a GNN once for our running example, one would end up spending $370$ USD and about $1$ USD for every inference on the GNN. Extrapolating from this example, the cost to train a GNN on another commonly used benchmark, ogbn-papers100M (Hu et al., 2020a), with $111$ million nodes is $17,000$ USD and the cost of one inference is $43$ USD. We point out that GNNs are trained on these datasets regularly in the centralized setup nearly for free.333Leaderboard: https://ogb.stanford.edu/docs/leader_nodeprop/ Such network costs are less relevant in centralized training setups, where data is accessible by GPU workers connected over extremely fast links (e.g. NvLinks) on the same machine; here, computational costs dominate (Liu et al., 2023b).

Further, high network data volume inevitably affects the training time. Figure 3 illustrates the training time to train GraphSAGE on Products dataset using the standard strategy (middle bar). Observe that more than $80\%$ of the training time is spent during the message passing rounds. This will worsen as the bandwidth reduces in the distributed setups. The remaining bars provide a glimpse of training time using Retexo (first) and another state-of-the-art framework (Wan et al., 2022).

2.4. Prior Optimization Approaches

Many prior works have proposed optimization techniques that reduce network data costs of training GNNs on distributed graphs (Gandhi and Iyer, 2021; Wan et al., 2022; Zhu et al., 2019b; Zheng et al., 2022; Cai et al., 2021; Zheng et al., 2020). However, they assume that a central coordinator is in charge of optimization can customize the data partitioning, as needed, to reduce cross-worker costs. Two prominent examples of such optimizers are graph structure based (Zheng et al., 2020, 2022; Cai et al., 2021) and node feature partitioning (Gandhi and Iyer, 2021) which optimizes data placement for efficiency. Many of these works also assume direct access to features and edges of inner nodes, beyond boundary nodes, residing at other workers. These are incompatible with regulatory goals: Adhering to any partitioning pre-decided by the setup and disabling direct data access of inner nodes is desirable.

One and the only prior optimization approach that fits our constraints is boundary node sampling (BNS) (Wan et al., 2022). The idea is that a worker samples a small fraction of its boundary nodes at another worker, for which it exchanges data during each training round. This reduces communication costs proportional to the sampling rate. However, if the sampling rate is chosen too aggressively (say below $10\%$ ), the model performance is adversely affected in each round, and the number of training rounds needed for the GNN model to converge increases, as pointed out in the original work that proposed BNS (Wan et al., 2022). Despite being an orthogonal approach to ours, it does offer a good experimental comparison point, since the BNS framework has implemented sampling as well as other optimizations to reduce network data costs proposed in prior work, including caching boundary node features, precomputing inputs to specific GNNs for all training rounds, and pipelining training with evaluation.

3. Training GNNs with Retexo

Recall that the network data cost to train a GNN increases linearly with the number of boundary nodes and message passing rounds. All existing techniques focus on reducing the network data cost of each message passing round by using optimal partitioning. We observe that it is insufficient to reduce the network data cost of individual message passing rounds. Even a single local training phase with $2\cdot K$ message passing rounds may generate more network data than that generated during the gradient aggregation phase across all training rounds combined (see Section 4.3). Furthermore, there would be $2\cdot K\cdot R$ such message passing rounds to train a $K$ -layer GNN end-to-end for $R$ training rounds using standard training. Therefore, our key idea is to reduce the number of message passing rounds itself, while still having enough of them to achieve high model performance. To do this, we propose a novel training procedure for GNNs without changing their underlying architecture. It embodies what we call the lazy message passing training strategy.

3.1. Lazy Message Passing

We propose to train GNNs layer-by-layer instead of optimizing the parameters of all layers together. Specifically, each GNN layer is trained independently of the other layers and sequentially after training all of its previous layers. The rationale is as follows. Observe that while using a trained $K$ -layer GNN for inference, it can perform the classification task using the $K$ -hop neighborhoods of all the nodes well. This means that once the $K$ layers are trained, the output representations of $K^{th}$ layer are informative. Furthermore, the total number of message passing rounds is just $K$ during inference since only one forward pass is needed.

We can also use the above insight to make training efficient. Suppose we have already trained the parameters of the $K$ layers and fix them thereafter. Now, if we were to add one more layer, say a $\{K+1\}^{th}$ layer, we can use the representations already learned well by the trained $K$ -layers as inputs to the untrained final layer. To train this single $\{K+1\}^{th}$ layer, the workers perform one round of message passing to form the representations needed as inputs to the $\{K+1\}^{th}$ layer. Then, each worker can locally train this layer with no more message passing rounds, since all the information to run the forward and backward passes is then available locally.

We can use this insight to train the first layer itself, and beyond. The inputs to the first layer are obtained after one message passing round over raw features. After the first layer is trained the subsequent layers can be trained by using the representations obtained from the last trained layer as inputs. This has the effect of delaying a message passing round until completely training all previous layers of the GNN. Hence, we call this greedy layer-by-layer optimization procedure to train GNNs as Lazy Message Passing.

Concrete Illustration.

Consider a $2$ -layer GNN with layers ( $\mathsf{AGGR^{0}}$ , $\mathbf{f}^{0}_{\theta}$ ) and ( $\mathsf{AGGR^{1}}$ , $\mathbf{f}^{1}_{\theta}$ ) following the definition in equation 1 from Section 2.2. Before training the first layer one round of message passing is performed to update the feature information available at every node with their neighbors’. Using the updated feature information the first layer is trained independently from the second layer. To do that a classification layer, i.e., a non-linear function, $\mathbf{C}^{0}_{\theta}$ is added to the first layer. Effectively, the layer ( $\mathsf{AGGR^{0}}$ , $\mathbf{f}^{0}_{\theta}$ , $\mathbf{C}^{0}_{\theta}$ ) is trained to perform the node-classification task by itself. Once trained, the classifier layer $\mathbf{C}^{0}_{\theta}$ is removed, hence, leaving behind the first layer that has learned to aggregate information from $1$ -hop neighbors in a way that is useful for node classification. Observe that training the first layer did not require any more message passing rounds since there is only one $\mathsf{AGGR}$ function in that layer. Now, the second GNN layer ( $\mathsf{AGGR^{1}}$ , $\mathbf{f}^{1}_{\theta}$ ) is trained. The inputs to this layer are the fixed intermediate representations output by the trained first layer, i.e., $\mathbf{Q^{1}}$ . At this point, another message passing round is conducted for every node to obtain its neighbor’s intermediate representations. Thus, each node now has information from its $2$ -hop neighborhood. Using this information, the second GNN layer is trained similarly to the first one. We do not need to add a classification layer here because $\mathbf{f}^{1}_{\theta}$ is already the final classification function of a $2$ -layer GNN. Further, the first layer’s weights are not updated while training the second layer. Following this process, we have trained a GNN to do node classification while only performing $2$ rounds of message passing without changing its architecture.

We contrast our procedure to the standard training shown in Figure 4. The standard training tries to optimize parameters of all layers jointly, and therefore, incurs $2\cdot K$ every training round. Midway during training, when the first $K$ layers are only partly optimized, i.e., have poor classification power, the subsequent layers receive noisy input representations and the final computed loss is high. It takes several training rounds to reduce the loss in standard training. The message passing incurred in each training round is excessive.

Lazy message passing avoids these expensive costs by improving the $K$ -hop classification accuracy for nodes before proceeding to improve it over the $(K+1)$ -hop. We will show in our evaluation (Section 4.2) that the node classification performance of our strategy is as high as the standard training in several benchmarked graph datasets.

3.2. Communication-efficiency Analysis

It is illustrative to compare the total data costs analytically for lazy message passing with the standard procedure. During any particular phase of training, we denote the network data costs as the size of the data that has been communicated over the network by all the workers during that phase. We also call it the network data volume. We present the main equations for the analytical network data volume here and provide detailed explanations in Appendix A.1.

Considering both local training and gradient aggregation phases, the ratio of the total network data volumes for standard ( $DV^{standard}_{total}$ ) and lazy message passing ( $DV^{lazy}_{total}$ ) strategies, which we define as $\gamma$ comes out to be,

[TABLE]

Consider a scenario where network data volume during one round of local training is higher than the network data volume during gradient aggregation across all training rounds, i.e., $B\cdot(\sum_{m=1}^{K-1}H_{m}+H_{0})\geq M\cdot R$ . This is true for the real-world datasets we evaluated. In this case, $\gamma$ increases continuously with $B$ until it approximately stabilizes at $\Theta(R)$ . In our extended analysis (see table 5 and 6 in Appendix A.1), we will show that the measured data volume while training on our distributed testbed matches with our analysis for $\gamma$ .

In conclusion, we have shown that Lazy message passing training saves up to $\Theta(R)\times$ network data volume compared to the standard training. Under mild conditions that are true for real-world datasets, we show that the gap (ratio) in network data volume between the two training procedures increases with the number of boundary nodes per worker until a maximum of $\Theta(R)$ for a large number of boundary nodes.

Note. $\gamma$ can be computed even before training the GNN to assess the savings in communication costs prior to training.

3.3. Retexo

Retexo is a framework to train GNNs end-to-end using our lazy message passing strategy. We describe the framework conceptually using $3$ high-level abstract layers (see Figure 5).

Bottom layer.

This has the Backend communications component that handles the communication over TCP/IP, bluetooth, or similar networking backends. Mainly, it provides three API calls, init, send, recv. The init can be used to initialize the communication channels between workers. Subsequently, send and recv can be used to deliver messages over these communication channels.

Middle layer.

There are three components in this layer. The Reduce-sync-update (RSU) is responsible for three functionalities, aggregating (reducing) the gradients, synchronizing and scheduling the training process, and sharing the updated intermediate embeddings of the boundary nodes. The ML library is used to perform training and inference on the available hardware (CPU/GPU). Finally, LMP trainer is built using the ML library and the RSU components. It implements the logic of lazy message passing.

Top layer.

The GNN model zoo captures the GNN models that are built using the API provided by the ML library such as PyTorch or other third-party libraries such as Deep Graph Library (DGL) which is itself built on top of PyTorch. Custom GNN models can be written by the developer too. The Application Task component handles the tasks that are necessary for the GNN application such as creating data splits to train for node classification, defining the loss function, evaluation metrics, and the hyperparameters. Finally, the Application Task component can also be used to define custom aggregating functions to aggregate the gradients at the master. By default, the aggregation function is a simple average of gradients. However, one could also implement custom aggregation functions such as defenses against data poisoning, and training for better robustness (Liu and Yu, 2022)

3.4. Implementation

We implement the current version of Retexo on top of three libraries, PyTorch, PyTorch Distributed, and DGL (Wang et al., 2019). PyTorch Distributed provides our Backend communications component, PyTorch is the ML library, and DGL is used in the top layer to support functionalities such as splitting the graph data and building the GNN models. The RSU component consists of three API MultiThreadedReducer, sync_model, and sync_embeddings. The MultiThreadedReducer API handles the sending of the gradients to the master, aggregating them, and sending them back to the workers. Workers use the sync_embeddings API to send and receive the trained embeddings of the boundary nodes with their peers. Similarly, the sync_model is used to sync the model with the best validation accuracy among all the workers. These APIs are used by the LMP trainer to conduct training while ensuring that all the workers have a consistent state. Retexo can be modified to fit any application setup for graph learning by configuring the bottom and top layer components. For instance, we show that Retexo can be deployed even in highly-decentralized mobile setups by building a custom Bluetooth-based Network backend on top of the Flower federated learning library. We discuss this further in Section 4.7.

Measuring network data cost.

We instrument Retexo’s codebase to log the amount of data sent and received over all communication channels. To confirm that our instrumentation is correct, we use Wireshark to monitor the network and save the network packets sent and received on all communication channels. The difference between the measurements obtained from instrumentation and external monitoring is negligible (smaller than $0.1\%$ ) and it exists due to the additional network headers that the instrumentation ignores. Furthermore, our instrumentation allows us to measure network data volume for every worker irrespective of the underlying architecture on which the workers have been deployed. We deploy multiple workers per GPU and multiple workers on a single machine for instance. Workers within a single machine communicate over PCIe and across machines over Ethernet in an internal network. Nevertheless, our data volume measurements will still represent how much each worker would communicate with the others in any edge-partitioned distributed setup where all workers may communicate over the Internet or other bandwidth constrained channels.

4. Evaluation

We have three main evaluation objectives.

(1)

How well do RetexoGNNs perform compared to GNNs on supervised node classification tasks? 2. (2)

What are the network communication characteristics of training GNNs with Retexo and how do they compare with the baseline training strategy of GNNs? 3. (3)

How Retexo’s training strategy compares with state-of-the-art optimization boundary node sampling (BNS)?

4.1. Evaluation Setup

Datasets & Splits.

We consider $7$ widely-used GNN benchmarks of varying nature and sizes ( $\approx$ $9K-100M$ edges) to measure Retexo’s performance. They include citation networks Cora, Citeseer, and PubMed; social networks, LastFMAsia, Facebook, and Reddit; and an Amazon product co-purchasing network called Products. We perform node classification in the commonly studied transductive setting (Hamilton et al., 2017) for all datasets where the graph is fixed and not changing between training, validation, and testing phases. Due to space constraints, we provide additional details regarding the datasets and their splits in Appendix B.1.

Models Architectures & Baselines.

We train three most popular GNNs used in practice GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and GAT (Veličković et al., 2018). By training these GNNs using Retexo we obtain corresponding RetexoGNNs, i.e., RetexoGCN, RetexoSage, and RetexoGAT. First, we compare Retexo with the baseline training strategy. We refer to the GNNs trained using the baseline strategy simply by the GNN names.

On the two large datasets Reddit and Products, we further compare our optimization strategy with boundary node sampling (BNS) since it is the state-of-the-art solution that applies to our problem setup to the best of our knowledge (Wan et al., 2022). We choose the sampling rate for sampling the boundary nodes to be $0.1$ as is suggested in their paper. We will also show why lower sampling rates than that might not be desirable. Thus the second set of baselines are BNS-GCN (0.1), BNS-SAGE (0.1), and BNS-GAT (0.1). Finally, we choose the hyperparameters to train GNNs based on the ones mentioned in the prior work (Wan et al., 2022; Hamilton et al., 2017) (see Appendix B.1 for more details).

System Specifications.

We conduct experiments on two machines each with $4$ A40 GPUs (48 GB) and AMD EPYC 7443P $24$ -Core Processors@ $2.8$ GHz with $252$ GB memory. All of them are connected via PCIe4x16 on a single machine. The two machines are interconnected via Ethernet on an internal network with $10$ Gbps bandwidth that is fully available to these machines.

Evaluation Outline.

Retexo embodies a novel training procedure, hence, it is important to compare the node classification performance of RetexoGNNs and GNNs trained with standard training. Therefore, we first report on the classification accuracy on all datasets considered. On all small datasets, we train RetexoGNNs and GNNs for $100$ rounds (epochs) and report the test accuracy achieved at that training round where the best validation accuracy is achieved. For large datasets, we train the GNNs first and observe when they start to converge to give their respective best accuracies. We find that this happens on both Reddit and Products at around $200$ training rounds. Hence, we fix the number of rounds to $200$ and train their corresponding RetexoGNNs, i.e., we train each layer of a $K$ -layer GNN for $200$ rounds thus effectively performing $200\cdot K$ training rounds.

We observe that the time taken to train a $K$ -layer GNN with standard training in $R$ rounds is comparable to or higher than that of training all layers in a GNN using Retexo for $R$ rounds. We report all accuracies over $10$ runs.

For the network data cost experiments we choose the two larger datasets Reddit ( $\approx$ $233K$ nodes and $114M$ edges) and Products ( $\approx$ $2.7M$ nodes and $62M$ edges). We perform distributed training for RetexoGNNs and the two baselines across various decentralization levels. We first compare the network data volume at low decentralization levels, i.e., $\#$ of workers $(N)=2,4$ , and report the data volume sent over the network. We repeat these experiments for higher decentralization levels of $N=6,8,16$ , and $24$ for Reddit whereas $6,8$ for Products. We do not choose more workers for Products due to limited computational resources available on our test machines. However, we expect conclusions drawn from our results will extend to larger workers and to larger graph sizes, for instance, beyond $100M$ nodes.

4.2. Accuracy

We report the best accuracies achieved by RetexoGNNs and their corresponding GNNs in Table 2. The accuracy of RetexoGNNs are $\pm 1\%$ of their corresponding standard GNNs, and in a few configurations, RetexoGNNs performs better by up to $4\%$ and $8\%$ . Specifically, on the Products dataset, RetexoGCN and RetexoSage have accuracy close to $79\%$ that are higher than the best reported accuracies for their GNN counterparts on the public leaderboard444Products Learderboard: https://ogb.stanford.edu/docs/leader_nodeprop/#ogbn-products—GCN ( $75.6\%$ ) and GraphSAGE ( $78.2\%$ ). This shows that training GNNs with lazy message passing has as good or better classification performance as standard training.

4.3. Data Volume for low Decentralization

To measure the network data costs we start with the lowest levels of decentralization with only $2$ workers. This setup is supposed to be the least expensive setup for communication for any architecture since the majority of the graph’s edges would be within the workers itself and gradients across only two workers need to be synced. Thus, reducing the total data volume communicated during message passing.

We plot accuracy vs. total data volume sent over the network per worker during the entire training process in Figures 6 for all GNN architectures. We consider only $2$ workers for all configurations except for GAT-based models in Products due to limited GPU memory. We find that standard training of GNNs requires sending up to $285$ GB for Reddit and $1.9$ TB for Products over the network. In comparison, training GNNs with Retexo only requires up to $2.2$ GB and $6$ GB to be sent over the network.

The state-of-the-art system, BNS-GNNs ( $0.1$ ), requires sending up to $29.5$ GB and $191$ GB over the network per worker. Therefore, training RetexoGNNs requires up to $315\times$ and $32\times$ lower data volume than all baselines on Products dataset. At the same time, RetexoGNNs perform as good as (or better) than the compared baselines.

To understand the stark difference in data volume required for RetexoGNNs vs other baselines, we plot the data volume required for gradient aggregation and message passing separately in Figure 7. The data volume required for message passing is the dominating factor in this setup. For instance, more than $99.5\%$ of the data is sent during the message passing phase for standard GNNs and more than $96\%$ for BNS-GNNs ( $0.1$ ). It is as expected, since in every training round, only one gradient per worker is sent over the network during the gradient aggregation phase whereas during message passing features and gradients corresponding to all boundary nodes are exchanged between workers. In contrast, in Retexo there are only $K$ message passing rounds for a $K$ -layer GNN, therefore, the data volume required during both phases are comparable to each other.

The network data exchanged to train RetexoGNN is $1-2$ orders of magnitude less than with standard training and with BNS-GCN (0.1) for real-world graphs.

4.4. Data Volume for High Decentralization

We increase the number of workers from $2$ to $24$ for Reddit and $2$ to $8$ for Products to measure the total data volume and we report the values in Figure 8 for GraphSAGE-based architectures. On each bar, we present a number which represents the ratio of total data volume of that bar to the corresponding bar of RetexoGNN. GCN and GAT-based architectures have similar plots, hence elided here for brevity. We see that as the number of workers increases, the advantage offered by RetexoGNNs over other baselines increases. For instance, the gap between training RetexoSage and GraphSAGE increases from $315\times$ to $321\times$ for the Products dataset.

This is expected as per our analysis for $\gamma$ (see Section 3.2) which is the ratio of network data volume for baseline strategy and the lazy message passing strategy. Specifically, in all of our evaluated configurations, as decentralization increases more nodes per worker become boundary nodes for other workers. Therefore, the total data volume sent over the network during each message passing round also increases and still dominates the gradient aggregation phase.

The gap in data volume between training GNNs using Retexo vs baselines increases with decentralization.

4.5. Accuracy vs Sampling

Retexo offers an orthogonal opportunity to optimize network costs compared with prior approaches, prominently sampling as in BNS. It is nonetheless worth comparing how the two differ. Until now we have compared BNS at $0.1$ sampling rate, as also considered in prior work. We can further measure what happens if we sub-sample the boundary nodes even more aggressively. We observe that if the sampling rate is $0.01$ instead of $0.1$ , the data volume during message passing reduces by $10\times$ for BNS as expected, making it within the same order of magnitude as data efficient as Retexo. Figure 8 shows this. But, crucially, such aggressive sampling results in a significant loss of GNN accuracy. We illustrate this for GAT-based architectures, since they are the most affected, in Table 3. All architectures are trained for the same number of rounds where the original GAT converges. For both datasets, the accuracy drops sharply by up to $50\%$ for a sampling rate of $0.01$ and drops significantly ( $5.6\%$ ) even for a sampling rate of $0.1$ for Products. In contrast, Retexo’s training procedure does not require to sample boundary nodes. So its accuracy does not change with any level of decentralization.

4.6. Training Time

The secondary impact of reducing the network data volume is on the training time. Note that our testbed is similar to the prior works (Wan et al., 2022; Gandhi and Iyer, 2021) and our baselines have been optimized to minimize the training time. On our testbed, the total time taken to train a GNN is split between the local computations on GPUs and network communication with the latter being the bottleneck. We assume a centralized data center setup and do not throttle the bandwidth to give full advantage to our baselines. We also add different network latencies, from [math] to $200$ ms, to simulate workers in different geographical regions connected over dedicated high-bandwidth communication channels. The latencies are in line with those reported for geographically separated Azure servers as reference. 555https://learn.microsoft.com/en-us/azure/networking/azure-network-latency?tabs=Americas%2CWestUS Table 4 presents the total training time to train GraphSAGE on $4$ workers for $200$ training rounds.

Retexo trains GraphSAGE faster than both the baselines in all evaluated configurations for Reddit and Products datasets. At no additional latency simulated, Retexo trains GraphSAGE by up to $7\times$ and $22\times$ faster than the baseline procedure on Reddit and Products datasets respectively. Further, compared to BNS at a sampling rate $0.1$ , Retexo is slightly faster on Reddit and $2.7\times$ faster on Products. Retexo spends almost no time during the message passing rounds as expected. Most of its training time is spent during the local training phases and in gradient aggregation. In contrast, the baselines spend most of their time during the message passing rounds.

As latency increases, the training time for all procedures increases. At an additional latency of $50ms$ , Retexo is up to $15\times$ faster than the baseline procedure and $2\times$ faster than BNS. With increasing latencies, while Retexo maintains the advantage, though the improvement in training time is not linear compared to other baselines. This is primarily because the network data volume is much smaller during the gradient aggregation phase than during the message passing rounds for all training procedures. Therefore, message-passing rounds are bandwidth-bound as opposed to latency-bound, whereas gradient aggregation is latency-bound. 666at least for network throughput higher than $32$ Mbps per worker for GNNs with less than $10^{6}$ parameters ( $4$ MB). Retexo has much fewer message passing rounds compared to the baselines, but $K\cdot R$ gradient aggregation rounds compared to $R$ rounds in the baselines. So, the impact of increasing latency is more pronounced on Retexo, since more of its internal phases are latency-bound. In contrast, the baselines have much more bandwidth-bound message-passing phases, and hence, increasing latency has a less pronounced end effect on their resulting training time. Nevertheless, even at $200$ ms latency, Retexo is up to $8\times$ faster than the standard training and $1.5\times$ faster than BNS.

Retexo trains GNNs faster than both baselines across all evaluated configurations. Specifically, it is up to $22\times$ and $2.7\times$ faster than the baseline procedure and BNS ( $0.1$ ) respectively at low network latencies.

4.7. Extending to High Decentralization

All of the aforementioned results for Retexo indicate that it scales gracefully with increasing decentralization due to a small number of message passing rounds. Now, we also demonstrate the same by training GNNs using Retexo in a fully-distributed setup. In the fully-distributed setup, workers can be mobile or stationary IoT devices. Mainly, these devices are bandwidth constrained and may be in mobile environments with unreliable networks. We consider a specific application scenario of proximity networks. In such networks, workers share information over low-bandwidth bluetooth channels or using WiFi-Direct with each other when in close proximity. Such networks are widely used for contact-tracing, and sharing content with contacts (on Apple Airdrop and Android’s Nearby Share). To simulate these applications, we use two Raspberry Pis (RPi)s as mobile devices and train a GCN using Retexo on them. The RPis communicate with each other over Bluetooth and communicate with a server over Wi-Fi (Internet). The server coordinates the training of GCN on these mobile workers.

We split the Cora dataset equally among them to train a $2$ -layer GCN on these devices. We provide additional details on the setup and the training procedure in Appendix B.2. We build Retexo for fully-distributed setup on top of a popular federated learning framework called Flower (Beutel et al., 2020) (see Appendix B.2 for more details). It offers easier tools to build for highly decentralized setups777https://github.com/adap/flower. The framework provides worker-to-master channels and we implement our own worker-to-worker communication channels over Bluetooth. Quantitatively, we report the performance characteristics we observe during training.

The total data communicated for gradient aggregation, including the headers added by the network protocols, to train the first layer is $585$ MB. In total, $0.08$ MB is communicated over Bluetooth for the message passing rounds. The maximum RAM consumed on each RPi is $165.44$ MB. Retexo requires $63.28$ seconds for message passing, which includes the connection time, requesting specific nodes’ embeddings, and exchanging the requested ones. Overall, the RAM consumed is well within the constraints of typical embedded systems ( $<1$ GB RAM). Finally, the fully-distributed setup may have only one inner graph node per worker, for instance, when the worker is a mobile phone of a user (node) in a social or contact network. Resource-wise, training RetexoGNNs is even more practical in such setups, unlike above, since each RPi will consume much less RAM and communicate much less data. For instance, two RPis require only $0.6$ seconds to connect and exchange a single node’s embedding. This conclusion is independent of graph size since the resources needed per RPi do not scale with the number of nodes.

Retexo is the first framework to demonstrate training GNNs in fully-decentralized setups corresponding to mobile and edge networks.

5. Related Work

With the rising prominence of GNNs and their applications on industry-scale large graphs, recently, many works have addressed the problem of efficiently training them by distributing the graph data and GNN computation.

Efficient GNN Training.

Many works focus on reducing the end-to-end training time in a centralized datacenter-like setup. They can be further split in to systems designed for single-machine (Hu et al., 2020b; Ma et al., 2019; Lin et al., 2020; Jangda et al., 2021) and multi-machine settings (Zheng et al., 2020; Cai et al., 2021; Gandhi and Iyer, 2021; Wan et al., 2022; Zhu et al., 2019b; Jia et al., 2020; Md et al., 2021; Thorpe et al., 2021; Zheng et al., 2022; Liu et al., 2023a).

Single-machine systems assume that the graph data is located on a single machine, possibly with multiple CPUs/GPUs to train the GNNs. FeatGraph (Hu et al., 2020b) generates optimized kernels for both CPUs and GPUs that execute GNN-based operations such as message passing faster. NeuGraph (Ma et al., 2019) optimizes CPU-GPU data transfer to execute GNN computation graphs, by proposing a streaming scheduler that maximizes the overlap between computation and data transfer. Other works optimize the speed of minibatch sampling for graphs on GPUs (Lin et al., 2020; Jangda et al., 2021). Single-machine systems do not consider network data cost as a bottleneck. Their optimizations are orthogonal and complementary to our work, i.e., they can be implemented for Retexo to improve its performance during local training phase on individual machines.

Multi-machine systems assume that the graph data can be partitioned and stored on multiple networked machines. They all aim to train GNNs fast and propose several optimizations to do so. Unlike the single-machine systems, they raise the issue of high network data costs incurred during message passing rounds being the main bottleneck. Roc (Jia et al., 2020), DistGNN (Md et al., 2021), and DGCL (Cai et al., 2021) follow the standard training strategy but propose various competing graph partition strategies to reduce the network data costs. AliGraph (Zhu et al., 2019b), P3 (Gandhi and Iyer, 2021), DistDGL (Zheng et al., 2020), BGL (Liu et al., 2023a), and ByteGNN (Zheng et al., 2022) combine graph partitioning and mini-batch sampling in different ways to reduce the network data cost. At their core, they do not follow the standard training procedure that we have described. In every training round, the workers subsample a set of nodes (mini-batch) and then fetch the entire K-hop neighborhoods of those nodes from other workers to train the GNN. This essentially replicates raw features and edges of inner nodes, that are beyond the boundary nodes, of other workers. It is evident that all of these systems have been designed for the centralized setup and none of the aforementioned optimizations would apply in the distributed setups with data decentralization constraints. Further, their definition of “efficiency” corresponds to training speed only since network data volume in the centralized setup is not costly (financially). In contrast, we primarily focus on reducing the network data volume since it is the root cause for expensive training in terms of money and time. Finally, to the best of our knowledge, BNS-GCN is the only system to perform the standard training procedure with optimization that reduces network data costs in our setup (see Section 2.4).

Alternative Distributed Setups.

Prior works have trained GNNs in other distributed setups with different ways of storing graph data, and with different notions of privacy from our setup. These works (Chen et al., 2021; Wang et al., 2022b; Zheng et al., 2021; Zhang et al., 2021b; He et al., 2021; Wang et al., 2022a) propose distributed setups where cross-worker communication is not performed at all to train GNNs. They assume that the entire graph is replicated across workers (Wang et al., 2022b) or ignore the cross-worker edges altogether, thus, effectively conducting training similar to BNS-GCN with [math] sampling rate (He et al., 2021; Wang et al., 2022a). Further, (Chen et al., 2021) and (Zhang et al., 2021b) make the master help the workers to predict missing graph information from other workers apart from just aggregating gradients. These works are motivated differently, such as dealing with non-IID graph data partitions, and do not focus on training GNNs efficiently on large graphs. Their techniques do not extend to the edge-partitioned distributed setup, and also to high decentralization levels.

Finally, GNNs have been trained with differential privacy guarantees in distributed setups which allows workers to send node features and edges to the master with adequate anonymization (Sajadmanesh and Gatica-Perez, 2021; Kolluri et al., 2022; Zhu et al., 2023). However, these works assume that partial graph data, edges or features, is accessible to the master worker and privacy is preserved for the inaccessible data. Further, differential privacy can be detrimental to classification accuracy—the trained GNNs in the aforementioned works have performance loss of ( $>10\%$ ) compared to non-private training. We believe that Retexo provides a baseline platform to implement and test differentially private training of GNNs and leave it for the future work.

6. Conclusion

We have presented a new framework Retexo to train GNNs on distributed machines, for all decentralization levels, while eliminating the severe communication bottlenecks. Unlike prior works, Retexo does not need the machines to share raw graph edges or node features with any centralized coordinator or other machines. Retexo reduces the network data costs by up to $2$ orders of magnitude compared to state-of-the-art baselines to train GNNs on benchmarked datasets. To achieve this, Retexo implements a novel training strategy, called lazy message passing, where the key idea is to reduce the number of message passing rounds required to train the GNN compared to the standard training procedure.

Appendix A Training GNNs with Retexo

A.1. Communication-efficiency Analysis

Here we describe how we arrive at the equations in Section 3.2. Consider a graph randomly split between $n$ workers and each worker has $B$ boundary nodes spread across all other workers. The workers and master collaborate to train a $K$ -layer GNN. Each layer $m$ has size $L_{m}$ which outputs an intermediate representation of size $H_{m}$ . Note that the sum of all layer sizes is the model size $M$ . The feature size is $H_{0}$ and the output embedding size is $H_{K}$ . Further, to train each layer except the last one the lazy-message passing procedure adds a new classifier function to the layer. The size of the classifier function would be $C_{m}=H_{m}\times H_{K}$ since the classifier function takes the representation as input and outputs a vector of size of the last layer’s embedding. Hence, the total layer size to train becomes $L_{m}+C_{m}$ . Consider that the model takes $R$ rounds to converge using the standard training. Correspondingly, using the lazy message passing procedure each layer of the GNN is independently trained for $R$ training rounds. We do this to make the comparison fair to the standard training. Also, in this way, each layer of the GNN is trained for $R$ rounds in both the standard and lazy message passing procedures.

Local training phase.

Lazy message passing only requires $K$ -message passing rounds, one before training each layer. In each message passing round, the workers exchange representations of the boundary nodes with each other. Specifically, each worker receives the representations of their boundary nodes from other workers and sends the representations of its inner nodes that are boundary nodes for other workers. If a worker requests representations of $B$ boundary nodes then on average it will also send representations of its inner nodes $B$ times (since the nodes are randomly partitioned). The representations have size $H_{m}$ . Therefore, the total data volume communicated, $DV^{lazy}_{lt}$ among the $n$ workers is as follows.

[TABLE]

Now, consider the standard training procedure. Recall that in each training round, there are $2K$ message passing rounds equally split between forward and backward passes. During the forward pass, the workers send and receive representations of the boundary nodes. During the backward pass, they send and receive the gradients corresponding to the previously shared representations. For instance, if a “target” worker receives $H_{1}$ of a boundary node from a “source” worker during the forward pass then the target worker will send the gradient of $H_{1}$ to that source worker. Therefore the size of the exchanged gradients is the same as the size of the corresponding representations. Across all training rounds, the features, $H_{0}$ are the only representations that do not change and do not produce gradients since they are non-trainable fixed vectors. Therefore, once the features of the boundary nodes are shared in one training round they can be cached for future training rounds too. Therefore, after caching, across $R$ training rounds the total data volume, $DV^{standard}_{lt}$ , communicated is as follows.

[TABLE]

Therefore, the ratio can be computed as,

[TABLE]

Therefore, if $H_{0}\leq H_{m}$ , the data volume is approximately $\Theta(R)\times$ higher for standard than lazy message passing. $R$ can be arbitrary in practice depending on the specific application scenario. GNNs for node classification have been trained even for $3000$ rounds in prior works (Wan et al., 2022).

Gradient aggregation phase.

In every training round, each worker in the standard training procedure sends their gradient vector of model size $M$ to the master for aggregation. Post aggregation, each client receives the updated model of size $M$ to start the next local training phase. Therefore, the total network data volume, $DV^{standard}_{reduce}$ , communicated by all clients across $R$ training rounds during this phase is as follows.

[TABLE]

For the lazy message passing procedure, the gradient aggregation phase occurs while training for each GNN layer in a similar fashion. Therefore, if the layer sizes are $L_{m}+C_{m}$ then the total network data volume, communicated is as follows.

[TABLE]

Therefore, the ratio can be computed as,

[TABLE]

The ratio is close to $1$ as the model size $M$ is usually much larger than the total size of additional classifier layers added. For the GNNs we evaluated, we find that the model size $M$ can be more than $20\times$ the sum of the sizes of classifier layers added. Therefore, in that scenario, the ratio would be $\frac{1}{1+0.05}=0.95$ . Effectively, there may not be much difference between the data volumes communicated during this phase even though the GNN is being trained layer-by-layer for $R$ rounds each with additional classifier layers.

Following from the aforementioned analysis, we use eq. 2 to compute the theoretically expected $\gamma$ , representing the ratio of network data volumes transferred during standard training versus lazy. Subsequently, we compare this theoretical value with the empirically observed ratios obtained from our experiments in table 5 and 6.

Appendix B Evaluation

B.1. Evaluation Setup

Here we provide additional details on our evaluation setup.

Datasets & Splits.

Cora, Citeseer, and PubMed are some of the first benchmarks to measure the node classification performance of GNNs. These are citation networks where nodes are documents and their features represent the existence/non-existence of certain keywords. LastFMAsia, Facebook, and Reddit datasets are taken from their respective social networking platforms. In the LastFMAsia dataset nodes represent users from Asia and the edges represent friendship. Facebook is a social network graph where nodes are pages on Facebook and edges represent mutual follower relationships. Reddit dataset consists of posts obtained from the Reddit platform. Two posts have an edge between them if the same users have commented on both posts. Products dataset is taken from Amazon listed products where the nodes are the products, their descriptions are the features, and edges represent co-purchases. In all of the above datasets, the task is to classify the nodes into known set of classes. For instance, in the Products dataset, the classes correspond to top-level product categories on Amazon.

We perform node classification in the transductive setting for all datasets where the graph is fixed and not changing between the training, validation, and testing phases. For all datasets except Reddit and Products we use $20\%$ of the data for training, $10\%$ for validation, and the rest for testing. For Reddit we use the publicly available splits of $66\%$ for training, $10\%$ for validation, and $24\%$ for testing (Hamilton et al., 2017). Similarly, for Products we use the publicly available splits of $8\%$ for training, $2\%$ for validation, and the remaining for testing (Hu et al., 2020a).

Hyperparameters.

To make our results easily reproducible, we do not tune the hyperparameters. For Reddit and Products benchmarks we follow the same setup as prior work and fully reproduce their reported accuracy (Wan et al., 2022).

•

For Reddit, we use $4$ -layer models for GCN and GraphSAGE based models with learning rate $0.01$ and dropout is $0.5$ . For GAT based models, we use $2$ layers since we observe that GAT performs much better with $2$ layers than $4$ and we use a learning rate of $0.003$ with no dropout.

•

For Products, we use $3$ -layer models for all GNNs. We set the learning rate to $0.003$ with dropout $0.3$ for GCN and GraphSAGE whereas no dropout for GAT.

For other datasets, we use GNNs with $2$ aggregation layers and set the learning rate at $0.003$ with a dropout $0.3$ . The hidden size for all GNNs is set as $256$ for all datasets. These values for the number of aggregation layers and hidden size are commonly used by the prior work including BNS-GCN (Wan et al., 2022; Hamilton et al., 2017; Kipf and Welling, 2017). Note that, we use the same hyperparameters used for the base GNN for corresponding RetexoGNN and the BNS-GNN. For instance, the hyperparameters used for GCN will be used for RetexoGCN and BNS-GCN (0.1) too.

B.2. Feasibility Study: Fully-distributed Setup

We implement and deploy end-to-end training of RetexoGNNs on networked embedded system workers. The fully-distributed setup is motivated by mobile applications where typically there is only one graph node per worker such as a user of a social network or contact network. We cannot demonstrate the fully-distributed setup for even smaller benchmarks such as the Cora dataset since it would require over $2700$ workers. Instead, we demonstrate the training on $2$ devices with the dataset horizontally split between them. We believe that our demonstration has sufficient evidence to show that RetexoGNNs can be immediately deployed in the mobile and asynchronous fully-distributed setups such as individual users with their mobile phones as workers.

Hardware.

We choose $2$ Raspberry Pi model 4B boards with WiFi and Bluetooth support. One has $2$ GB and another has $4$ GB RAM with no GPU support. We implement a central server (master) on a Ubuntu 20.04 machine with $128$ GB RAM. The server’s location is immaterial to show the feasibility of training RetexoGNNs on workers, however, it may slightly affect the total training time due to network latency. The server is connected to Ethernet with approximately $100$ Mbps upload and download speeds measured at any given random time. The workers communicate with the server over the Internet. The workers communicate with each other over Bluetooth when in proximity. The RPi model 4B has Bluetooth 5.0 version.

Software stack.

We implement Retexo’s network backend component (see Retexo’s components in Section 3.3) using the Flower library (Beutel et al., 2020). The library provides communication API for the worker-to-master communication channels. We build a custom worker-to-worker communication channel using Python socket API on top of the Bluedot library 888https://github.com/martinohanlon/BlueDot. Correspondingly, we build an RSU component and LMP trainer on top of this network backend. The popular libraries used for GNNs such as torch-geometric and torch-scatter do not support the ARM architectures. Therefore, none of the preexisting libraries for GNN training can be used for Raspberry Pis. Therefore, we build custom RetexoGNNs on just plain PyTorch without using any of the aforementioned libraries.

Dataset, Model, and Split.

We use the Cora dataset and RetexoSage with $2$ aggregation layers. We take half of the nodes randomly and store them along with their induced subgraph on one worker. We store the remaining nodes and their induced subgraph of the other worker. On each worker, we also store the boundary nodes and edges. We observe that $1354$ nodes are present on each worker and $1088$ boundary edges. In a fully-distributed setup, there would be a single node in each worker and these boundary edges would just be the single node’s neighbors. For train/val/test split we use the publicly available split 999https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html#torch_geometric.datasets.Planetoid.

Message passing

When the worker-to-worker communication channels are on bluetooth we have to wait until the two RPis come into close proximity. When in proximity, the RPis exchange the embeddings of the necessary cross-worker neighbors. Specifically, each RPi will send a message requesting the other for the embeddings of the nodes corresponding to the cross-worker edges. After receiving the first message, each RPi will send a message consisting of the requested node’s embeddings. After the message passing round is finished the RPis will be ready for training the next layer. The message passing round in this setup requires the exchanging of messages between just two workers. In the fully-distributed setup, this round would end after each worker exchanges messages with many other workers. Once the communication channel is available, the time required for exchanging messages between two workers is bounded. However, since the communication channels may not be available, the time required to complete the message passing round can be arbitrary and will depend on many factors such as the end application and the mobility patterns of workers. In practice, a time period can be allocated to complete one message passing round such as $1$ day if it is expected that a worker would be able to exchange embeddings with most/all of its neighboring workers during that time.

Qualitative Analysis.

We train RetexoSage on Cora dataset end-to-end. The manual effort involved in this training is minimal. We manually pair the workers over bluetooth once before starting the training. This is a necessary step but only done once with every neighboring worker. This step also dictates which workers are trusted since only these workers can send and receive data. We manually start the training for each layer in our case study. However, in practice, the server can start the training without any user intervention. For instance, Google does federated learning over mobile devices when they are plugged into charging at night (Bonawitz et al., 2019). For message passing, we manually run a script to exchange the required embeddings when the RPis are in proximity. However, this step can also be automated by just making the devices run such a script automatically once they detect that their neighbors are in proximity. Therefore, a user does not have to oversee the training procedure after the initial pairing of the necessary devices required for training. The server can decide how much time to allocate for message passing rounds and after that start a new training round.

B.3. Limitations & Future Work

Trustworthy Distributed Training.

Similar to many prior works and existing real-world deployments (Bonawitz et al., 2019; He et al., 2021; Wang et al., 2022a) we do not design training protocols for Retexo with formal privacy guarantees such as differential privacy (Dwork et al., 2014). Therefore, training Retexo as-is in our setup may leak information about training data via intermediate gradients submitted to the server (Zhu et al., 2019a; Geiping et al., 2020). However, we point out that preventing information leakage via gradients is an active area of research (see Survey (Yin et al., 2021b)). Specifically, existing solutions such as secure aggregation that prevent information leakage from individual gradients can be readily integrated with Retexo (Bonawitz et al., 2017). This, however, does not prevent information leakage from aggregated gradients since the server can see the aggregate gradients (Yin et al., 2021a). Training RetexoGNNs with differential privacy guarantees can further prevent leakage from aggregated gradients. However, generic differential privacy techniques are notorious for reducing the utility of the models (Tramer and Boneh, 2021; Li et al., 2022; Wu et al., [n. d.]; Kolluri et al., 2022). Designing differential privacy techniques, for both features and edges, that are tailored to do node classification in the fully-distributed setup is an interesting future work.

Finally, the training Retexo may be susceptible to the robustness issues that are common in distributed setups. For instance, distribution shifts in the features and labels (Reisizadeh et al., 2020; Zhang et al., 2021a; Zhu et al., 2021), adversarial perturbations to the graph structure (Zügner et al., 2020; Bojchevski and Günnemann, 2019a), model and data poisoning attacks (Bhagoji et al., 2019; Fang et al., 2020; Tolpegin et al., 2020). Integrating existing robust federated learning and graph neural network training techniques (Geisler et al., 2021; Bojchevski and Günnemann, 2019b; Jin et al., 2021) or designing new ones is an interesting direction for future work.

Feasibility of Training RetexoGNNs in the Fully-distributed Setup.

We do not demonstrate training of RetexoGNNs when only one graph node is available per worker since that would require a large number of individual worker devices. However, we train RetexoGNNs end-to-end on two devices and claim that training RetexoGNNs in such setups should be easier than on two devices. Resource-wise our argument is sound since per worker the resources required to train with one node per worker is much smaller than training with the graph data split between just two workers. A worker just has to train a layer based on a single data point which requires little RAM. Also, it has to transmit a single embedding over worker-to-worker communication channels which can be quick ( $0.6$ seconds in our case study) even over low bandwidth bluetooth channel (see Section 4.7).

Bibliography73

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Beutel et al . (2020) Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Hei Li Kwing, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. 2020. Flower: A Friendly Federated Learning Research Framework. ar Xiv preprint ar Xiv:2007.14390 (2020).
3Bhagoji et al . (2019) Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. 2019. Analyzing federated learning through an adversarial lens. In ICML .
4Bojchevski and Günnemann (2019 a) Aleksandar Bojchevski and Stephan Günnemann. 2019 a. Adversarial attacks on node embeddings via graph poisoning. In ICML .
5Bojchevski and Günnemann (2019 b) Aleksandar Bojchevski and Stephan Günnemann. 2019 b. Certifiable robustness to graph perturbations. Neur IPS (2019).
6Bonawitz et al . (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, Brendan Mc Mahan, et al . 2019. Towards federated learning at scale: System design. ML Sys (2019).
7Bonawitz et al . (2017) Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security .
8Bottou et al . (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. SIAM review (2018).