Hulk: Graph Neural Networks for Optimizing Regionally Distributed   Computing Systems

Zhengqing Yuan; Huiwen Xue; Chao Zhang; Yongming Liu

arXiv:2302.13741·cs.DC·April 14, 2023

Hulk: Graph Neural Networks for Optimizing Regionally Distributed Computing Systems

Zhengqing Yuan, Huiwen Xue, Chao Zhang, Yongming Liu

PDF

Open Access

TL;DR

Hulk employs a graph neural network to optimize distributed training of large deep learning models, significantly reducing communication overhead and improving training efficiency across geographically distributed systems.

Contribution

The paper introduces Hulk, a novel graph neural network-based approach for optimizing distributed deep learning training across regions, addressing communication bottlenecks.

Findings

01

Over 20% improvement in training time efficiency

02

Effective optimization of data communication across regions

03

Enhanced deployment strategies for distributed models

Abstract

Large deep learning models have shown great potential for delivering exceptional results in various applications. However, the training process can be incredibly challenging due to the models' vast parameter sizes, often consisting of hundreds of billions of parameters. Common distributed training methods, such as data parallelism, tensor parallelism, and pipeline parallelism, demand significant data communication throughout the process, leading to prolonged wait times for some machines in physically distant distributed systems. To address this issue, we propose a novel solution called Hulk, which utilizes a modified graph neural network to optimize distributed computing systems. Hulk not only optimizes data communication efficiency between different countries or even different regions within the same city, but also provides optimal distributed deployment of models in parallel. For…

Tables2

Table 1. Table 1: We measured the time it takes for our machines in three different regions to send and receive 10 words, using eight servers, and calculated the average.

Regions	Communication time to send 64 bytes (ms)
	California	Tokyo	Berlin	London	New Delhi	Paris	Rome	Brasilia
Beijing, China	89.1	74.3	250.5	229.8	341.9	-	296.0	341.8
Nanjing, China	97.9	173.8	213.7	176.7	236.3	265.1	741.3	351.3
California, USA	1	118.8	144.8	132.3	197.0	133.9	158.6	158.6

Table 2. Table 2: Model Node Allocation

Model	Nodes
OPT (175B)	0, 1, 2, 3, 4, 20, 21, 22, 23, 24, 27, 28, 29, 30, 31
T5	5, 6, 7, 8, 9, 10, 11, 12, 13, 14
GPT-2	15, 16, 17, 18, 19, 25, 26, 32, 33, 34
BERT-large	35, 36, 37, 38

Equations14

L (θ) = - t = 1 \sum T lo g P (y_{t} ∣ y_{< t}, x; θ)

L (θ) = - t = 1 \sum T lo g P (y_{t} ∣ y_{< t}, x; θ)

Δ W_{i, j} = η k = 1 \sum K (\nabla_{W_{i, j}} L (f^{i, j} (x_{k}^{i, j}), y_{k}^{i, j}) + l = j + 1 \sum M \nabla_{W_{i, l}} L (f^{i, l} (x_{k}^{i, l}), y_{k}^{i, l}))

Δ W_{i, j} = η k = 1 \sum K (\nabla_{W_{i, j}} L (f^{i, j} (x_{k}^{i, j}), y_{k}^{i, j}) + l = j + 1 \sum M \nabla_{W_{i, l}} L (f^{i, l} (x_{k}^{i, l}), y_{k}^{i, l}))

v^{(l + 1)} = σ u \in N (v) \sum \frac{1}{c _{u, v}} W^{(l)} u^{(l)}

v^{(l + 1)} = σ u \in N (v) \sum \frac{1}{c _{u, v}} W^{(l)} u^{(l)}

v^{(0)} = x_{v}

v^{(0)} = x_{v}

e_{v u} = g (e_{v u}, u, v, Θ_{e})

e_{v u} = g (e_{v u}, u, v, Θ_{e})

v^{(l + 1)} = σ u \in N (v) \sum f (v^{(l)}, u^{(l)}, e_{v u})

v^{(l + 1)} = σ u \in N (v) \sum f (v^{(l)}, u^{(l)}, e_{v u})

L = - i = 1 \sum ∣ Y ∣ Y_{i} lo g \hat{Y}_{i}

L = - i = 1 \sum ∣ Y ∣ Y_{i} lo g \hat{Y}_{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Stochastic Gradient Optimization Techniques · IoT and Edge/Fog Computing

MethodsGraph Neural Network

Full text

\tocauthor

Zhengqing Yuan, Huiwen Xue, Chao Zhang, Yongming Liu, and Zhuanzhe Zhao

11institutetext: School of Artificial Intelligence, Anhui Polytechnic University, Wuhu 241009, China,

22institutetext: School of Optoelectronic Science and Engineering, Soochow University,

Suzhou 215031, China

33institutetext: 33email: [email protected],44institutetext: 44email: [email protected]

Hulk: Graph Neural Networks for Optimizing Regionally Distributed Computing Systems

Zhengqing Yuan 1133

Huiwen Xue 22

Chao Zhang 11

Yongming Liu Corresponding author1144

Abstract

Large deep learning models have shown great potential for delivering exceptional results in various applications. However, the training process can be incredibly challenging due to the models’ vast parameter sizes, often consisting of hundreds of billions of parameters. Common distributed training methods, such as data parallelism, tensor parallelism, and pipeline parallelism, demand significant data communication throughout the process, leading to prolonged wait times for some machines in physically distant distributed systems. To address this issue, we propose a novel solution called Hulk, which utilizes a modified graph neural network to optimize distributed computing systems. Hulk not only optimizes data communication efficiency between different countries or even different regions within the same city, but also provides optimal distributed deployment of models in parallel. For example, it can place certain layers on a machine in a specific region or pass specific parameters of a model to a machine in a particular location. By using Hulk in experiments, we were able to improve the time efficiency of training large deep learning models on distributed systems by more than 20%. Our open source collection of unlabeled data:https://github.com/DLYuanGod/Hulk.

keywords:

optimize communication efficiency, distributed training, parallel deployment, time efficiency

1 Introduction

In recent years, there has been a trend of scaling up deep learning models, resulting in a more robust performance in specific domains. For instance, in the field of natural language processing, large-scale text data has been used to train deep learning models such as GPT-3 (175B) [2], T5 (11B) [19], and Megatron-LM (8.3B) [22], which have demonstrated impressive performance. However, training these models can be quite challenging. To solve the challenges posed by large-scale deep learning models, optimization of distributed computing is crucial.

Model parallelism(MP) is a technique used to solve the problem of a model being too large to fit into the memory of a single GPU or TPU by distributing the model across multiple GPUs or TPUs. However, this approach may introduce communication challenges between GPUs or TPUs during training. On the other hand, data parallelism(DP) can improve time utilization by addressing the batch size issue during training, but it cannot resolve the problem of a model being too large for a single GPU or TPU’s memory capacity.

While DP and model MP have been effective in mitigating communication volume issues in recent years, such as large minibatch SGD [9], Megatron-LM [22], Gpipe [12], and Pathway [1] the challenge of scheduling distributed training across machines in different regions remains unsolved. If a model like GPT-3 with hundreds of billions of parameters exceeds the memory capacity of GPUs in the current region during training, it becomes necessary to schedule machines from other regions to complete the training. This will pose several challenges:

•

Communication latency can be very high when training is distributed across machines in different regions.

•

How can tasks be effectively allocated to different machines, such as assigning specific machines to maintain certain layers of the model’s parameters (e.g., Machine 0 is responsible for Layer X) or designating machines to process specific data (e.g., Machine 2 handles Data Set Y)?

•

How can we address the issue of disaster recovery in training, such as handling scenarios where a machine fails during the process?

•

If you need to train not only a single task but also multiple tasks simultaneously, such as training both a GPT-3 and a GPT-2 model, how can you provide for these tasks?

To elaborate on the first point, we collected all communication logs between the three machines and the eight servers over a three-month period. Our statistics reveal the communication time for every 64 bytes, as presented in Table 1. As observed in the table, the communication latency between certain nodes is high or even unfeasible. Here, the problem of communication time is difficult to solve in a distributed system without optimization.

1.1 Contributions

Graph data structures have been widely adopted since their introduction, as they can effectively represent interconnected structures such as social networks and knowledge graphs. Considering the tremendous success of graph neural networks [7, 14, 26] in recent years, we aim to leverage this powerful capability in real-world industrial systems. With the powerful representational capability of graphs, it becomes easier to model the relevant optimization problems described in our paper. Our design choices were influenced by the types of workloads observed in actual systems. Hulk has the following features:

Efficient Inter-node Communication

Our system minimizes the impact of communication latency between machines, ensuring that each machine is assigned the appropriate task.

Global Optimality

Our model is built upon graph convolutional neural networks (GCNs) [14, 25] to extract features from the entire graph, enabling the selection of a globally optimal solution.

Disaster Recovery

Since GCNs are utilized to assign tasks to different machines in the system, it becomes evident which tasks each machine is responsible for. Furthermore, in the event of a machine failure, the system can quickly recover the entire computation.

Scalability

If a particular machine or machines are no longer needed, you can simply remove the corresponding edge information from the graph structure.

The novelty of the proposed system lies in the utilization of graph neural networks for optimizing machine learning systems. By relying on the neural network’s output values and some algorithms, the scheduling problem of the entire system can be efficiently solved.

1.2 Engineering Challenges

Although graph neural networks are capable of addressing tasks such as node classification [14, 23, 24], link prediction [29, 15, 21], and graph classification [14, 28], there is currently no suitable task that can be directly applied to our system. How to construct a suitable loss function is a crucial problem that cannot be overlooked. Regarding the representation of optimization features, such as computation time and communication time, in the graph data structure, there are also challenges that need to be addressed.

2 Background

This section provides a brief introduction to machine learning systems and graph neural networks.

2.1 Machine Learning Systems

This subsection provides a brief overview of the evolution of machine learning systems.

2.1.1 Data Parallelism

DP [5] is a commonly used technique in distributed training for deep neural networks, where the data is split into multiple copies and distributed to different machines for computation. Each machine calculates the loss and gradient of its assigned data and aggregates these gradients into a parameter server, which updates the model parameters. This method enables multiple machines to process large data sets in parallel, resulting in faster training speeds.

2.1.2 Parameter Server

The parameter server is a distributed deep learning training method proposed by Mu Li et al. [16] that addresses the communication bottleneck problem in training large-scale deep learning models. It achieves this by placing the gradient aggregation and parameter updating process on the server side, and the computational nodes only need to send the locally computed gradient information to the server. This approach reduces communication overhead and improves training efficiency.

2.1.3 Megatron-LM

Megatron-LM [22] combines model parallelism and data parallelism by dividing the model parameters into multiple parts, each trained on a different GPU. This allows for larger models to be used as each GPU only needs to focus on computing a part of the model using model parallelism. Data parallelism is used to assign different batches to different GPUs for processing, which improves training efficiency.

The training objective of Megatron-LM is to minimize the negative log-likelihood of the target sequence given the input sequence, which is expressed as:

[TABLE]

where $T$ is the length of the sequence, $y_{t}$ is the target token at time step $t$ , $y_{<t}$ are the tokens before time step $t$ , $x$ is the input sequence, and $\theta$ represents the model parameters.

2.1.4 Gpipe

In Gpipe [12], the model is split into sub-models, each assigned to a different GPU. DP concatenates Micro-batches along the pipeline to pass data and gradients between GPUs, enabling pipeline parallelism [4]. The training process in Gpipe can be expressed as the following equation:

[TABLE]

where $W_{i,j}$ denotes the weight parameter of the $j$ th layer of the $i$ th submodel, $\Delta W_{i,j}$ denotes the corresponding parameter update, $\eta$ denotes the learning rate, $K$ denotes the number of Micro-batches, $f^{i,j}$ denotes the forward propagation function of the $j$ th layer of the $i$ th submodel, $x_{k}^{i,j}$ denotes the $k$ th Micro-batch of the $j$ th layer in the $i$ th sub-model, $y_{k}^{i,j}$ denotes the label of the $k$ th Micro-batch.

2.2 Graph Neural Networks

Graph Neural Networks (GNNs) [20, 31, 30, 3, 11] are a type of neural network designed to work on graph-structured data, where nodes represent entities and edges represent relationships between them. They have become popular in recent years due to their ability to capture complex relationships and patterns in data, making them useful for tasks such as node classification, link prediction, and graph classification.

2.3 Graph Convolutional Networks

Graph Convolutional Networks (GCNs) [14] are a type of deep learning model designed to work on graph-structured data. They use convolutional operations to aggregate information from neighboring nodes and update node representations. The key formulas for GCNs include the graph convolution operation, which calculates the node representation updates, and the graph pooling operation, which aggregates information across multiple nodes.

[TABLE]

where $\mathbf{v}^{(l)}$ represents the feature representation of node $v$ at layer $l$ , $\mathcal{N}(v)$ denotes the set of neighbors of node $v$ , $W^{(l)}$ is the weight matrix at layer $l$ , $\sigma$ is the activation function, and $c_{u,v}$ is a normalization factor that depends on the number of neighbors of node $u$ and $v$ . This formula is used to iteratively compute the feature representations of nodes in a graph using neighborhood information.

3 Data Representation

To better address the issues raised in Section 1, it is important to select an appropriate data structure to represent the system parameters.We adopt a graph-based data structure to represent our system parameters, with each node (denoted as $v$ ) representing a machine in a different region. Each node has unique features that include its geographic location, computational capacity, and GPU memory. The edges (denoted as $e$ ) between nodes denote the possibility of communication between the two connected machines, with the weight of each edge representing the time in milliseconds required to transmit each 64-byte message.

As depicted in Figure 2, we randomly selected eight machines to construct a graph, where the edge weight represents the communication time, and the node features are embedded in the corresponding vector space.

For example, node 0 can be represented as $v_{0}=\left\{{}^{\prime}Beijing^{\prime},8.6,152\right\}$ . Then we embed the node information using the following formula:

[TABLE]

where $\mathbf{v}^{(0)}$ denotes the initial feature vector of node $v$ and $\mathbf{x}_{v}$ denotes the input feature vector of node $v$ .

The node-to-node edges we represent by the adjacency matrix. The weight of an edge in the adjacency matrix is equal to the communication time between two corresponding nodes. The values for unconnected edges are set to 0, and the diagonal values in this matrix are all 0. Similarly, we then perform the edge information embedding with the following equation:

[TABLE]

where $e_{vu}$ denotes the edge feature between node $v$ and node $u$ , $\mathbf{e}_{vu}$ is the feature vector of edge $vu$ , $\mathbf{u}$ and $\mathbf{v}$ are the feature vectors of node $u$ and node $v$ , respectively, $g$ is a learnable function and $\mathbf{\Theta}_{e}$ is its argument. We then sparsely label this subgraph to enable the neural network to learn the contents of the graph in a supervised manner.

4 Methods

The typical tasks of graph neural networks, such as node classification, do not utilize edge information and only leverage the graph topology. In real-world cases, the information carried by edges is often crucial, such as edge weights and directed edges. To incorporate edge information into nodes, we aim to perform edge pooling, which involves aggregating or pooling edges of neighboring nodes at each node to create a unified node representation that contains edge information. This is expressed in the following equation:

[TABLE]

Where $\mathbf{v}^{(l+1)}$ represents the feature vector of node $v$ in layer $l+1$ , $\sigma$ is the activation function, $\mathcal{N}(v)$ denotes the set of neighboring nodes of node $v$ , $\mathbf{u}^{(l)}$ represents the feature vector of node $u$ in layer $l$ , and $f$ is a learnable function used to merge features of nodes and edges into new features of node $v$ .

As depicted in Figure 2, this is the first layer of the constructed network structure( $l=0$ ) that enables nodes to encode edge information.

After the edge features are embedded into node features, we can use the resulting transformed graph as input for a standard node classification task and train it using a graph convolutional neural network or graph attention network. As shown in Equation 1. If we want to build N-layer GCNs with our $l=2,3,4\cdots N+1$ .

As shown in Figure 3, Y represents the category of the classification, i.e., what tasks are appropriate.

Then we calculate its loss using the cross-entropy loss function [8]:

[TABLE]

Here, $\mathcal{Y}$ denotes the set of all labels, $Y_{i}$ denotes the true label of node $i$ , and $\hat{Y}_{i}$ denotes the predicted label of node $i$ . Then back propagation is performed to update the network parameters.

As depicted in Figure 4, we observed that the accuracy peaked at 99% during the sixth training step.

5 Structure

In this section, we build our system based on the GCNs trained in the previous section 4 and solve the problem presented in section 1.

5.1 Efficiency

We now have two tasks to perform. The first involves training the BERT-large model [6], while the second involves training the GPT-2 model [18]. As the largest GPT-2 model (1.5B parameters) is significantly larger than BERT-large (340M parameters), it is important to carefully allocate tasks to each machine in a sensible manner. The ratio of the number of parameters in GPT-2’s largest model (1.5B) to BERT-large (340M) is approximately 4.4:1. Based on this information, we instruct the graph neural network to classify the classes according to this scale and optimize the communication time within each class. Also, we need to consider the memory and computing power characteristics of each machine.

We use Algorithm 1 to schedule multiple tasks, but it can also be used to determine superiority if there is only one task. Based on the computational power, memory and communication efficiency features, as well as the integration into node information, we only need to determine whether it is appropriate.

Figure 5 demonstrates that the basic graph neural network is capable of carrying out classification tasks effectively and emulating human thought processes.

5.2 Scalability

If we need to add one or more machines to this system, we can simply define their $\left\{City,ComputeCapability,Memory\right\}$ and connect them to the existing nodes that can communicate with them using weights.

As shown in Figure 6, the machine with id 45 $\left\{Rome,7,384\right\}$ in the dataset was added to the Hulk system and still works fine.

6 Experimentation and Evaluation

In this section, we test the Hulk system using multiple deep learning tasks in real industries with 46 high-performance GPU servers.

6.1 Experimental Setting

We have a total of 46 servers distributed across different countries and regions, with a combined total of 368 GPUs of various models such as NVIDIA A100, NVIDIA A40, NVIDIA V100, RTX A5000, GeForce GTX 1080Ti, GeForce RTX 3090, and NVIDIA TITAN Xp. And, we calculated the average of 10 communications between these machines over a 3-month period. Due to network policy restrictions in different countries, there are certain machines that are unable to communicate with each other. We adopt the parameter settings provided in the original paper for the training process.

6.2 Data Building

We use networkx [10] library to build our graph structure data and visualize it as shown in Figure 7. Additionally, we need to read the adjacency matrix of this data and consider the corresponding feature embedding representation.

6.3 Task Assignment

The four tasks we aim to train in this system are OPT (175B) [13], T5 (11B), GPT-2 (1.5B), and BERT-large (350M).

We need to classify all nodes into four distinct classes based on their characteristics and then deploy distributed algorithms tailored to each class.

As presented in Table 2, we feed the graph data into the graph neural network, which was trained in Section 4 and employs Algorithm 1, to derive node classification information. To handle the nodes in each class with different computational performance and memory, we utilize Gpipe to train the model in parallel. Depending on the computational power and memory of each node, we determine which part of the model it will handle.

6.4 Evaluation

To validate the performance of the Hulk system, we have chosen three commonly used distributed computing algorithms for evaluation.

System A

It utilizes all available machines for training while discarding any machine that does not have sufficient memory to accommodate the entire model. It utilizes data parallelism to distribute the batch size across multiple machines, thereby enabling simultaneous training of the model on each machine.

System B

It utilizes Gpipe for parallelism, assigning a certain layer of the model to a particular machine until the entire model is distributed across all machines.

System C

It employs tensor parallelism with Megatron-LM across the entire system, requiring all machines to be utilized for model training.

Result

As shown in Figure 8, the Hulk system can greatly reduce communication time and thus the overall training time. This illustrates that Hulk is effective in dividing the nodes into a specific model for training.

If we need to train 6 models, the parameters of each model are shown in Figure 9. Among them, the parameters of RoBERTa [17] are 355M and the parameters of XLNet [27] are 340M.

Result

As illustrated in Figure 10, when the system needs to handle multiple tasks, the gap in communication time becomes more apparent. Our Hulk system is able to effectively reduce communication time (Because the GPT-3 (175B) model is not open source, we use the OPT (175B) with equivalent parameters instead).

7 Conclusion

In this article, we introduce our novel solution, Hulk, which optimizes regionally distributed computer systems by tackling the challenges of scheduling distributed training across machines in different regions. Our real-world industrial solution, Hulk, utilizes graph neural networks with powerful representation capabilities to enhance communication efficiency between GPUs or TPUs across different countries or regions during training. With its efficient communication, global availability, fast recovery, and excellent scalability, Hulk stands out as a powerful tool for optimizing regionally distributed computer systems. The results demonstrate a significant increase in the efficiency of distributed training, crucial for the success of large-scale deep learning models. Overall, the use of Hulk can streamline the model deployment process and benefit researchers and practitioners seeking to optimize communication efficiency.

Acknowledgement

The authors gratefully acknowledge the support of the AIMTEEL 202201 Open Fund for Intelligent Mining Technology and Equipment Engineering Laboratory in Anhui Province and the Anhui Provincial Department of Education Scientific Research Key Project (Grant No. 2022AH050995). The financial assistance provided by these projects was instrumental in carrying out the research presented in this paper. We would like to thank all the members of the laboratory for their valuable support and assistance. Without their help, this research would not have been possible. Finally, we would like to express our gratitude to the Anhui Polytechnic University for providing the necessary facilities and resources for this study.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., Saeta, B., Schuh, P., Sepassi, R., Shafey, L., Thekkath, C., Wu, Y.: Pathways: Asynchronous distributed dataflow for ml. In: D. Marculescu, Y. Chi, C. Wu (eds.) Proceedings of Machine Learning and Systems, vol. 4, pp. 430–449 (2022). URL https://proceedings.mlsys.org/paper/2022/file/98dce 83da 57b 0395 e 163467 c 9dae 521b-Paper.pdf
2[2] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020).
3[3] Bui, H.H., Luu, K., Nguyen, Q.H.: Structural analysis and role identification for financial networks using graph embeddings. In: 2021 IEEE 7th International Conference on Computational Science and Computational Intelligence (CSCI), pp. 207–214. IEEE (2021)
4[4] Dally, W.J.: Pipeline parallelism revisited. Communications of the ACM 39 (11), 102–108 (1996)
5[5] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. Advances in neural information processing systems 25 , 1232–1240 (2012)
6[6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
7[7] Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: International Conference on Machine Learning (ICML), pp. 1263–1272 (2017)
8[8] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016). 10.7551/mitpress/9446.001.0001 . URL https://www.deeplearningbook.org/ · doi ↗