Densifying Assumed-sparse Tensors: Improving Memory Efficiency and MPI   Collective Performance during Tensor Accumulation for Parallelized Training   of Neural Machine Translation Models

Derya Cavdar; Valeriu Codreanu; Can Karakus; John A. Lockman III,; Damian Podareanu; Vikram Saletore; Alexander Sergeev; Don D. Smith II; Victor; Suthichai; Quy Ta; Srinivas Varadharajan; Lucas A. Wilson; Rengan Xu; Pei; Yang

arXiv:1905.04035·cs.LG·May 13, 2019

Densifying Assumed-sparse Tensors: Improving Memory Efficiency and MPI Collective Performance during Tensor Accumulation for Parallelized Training of Neural Machine Translation Models

Derya Cavdar, Valeriu Codreanu, Can Karakus, John A. Lockman III,, Damian Podareanu, Vikram Saletore, Alexander Sergeev, Don D. Smith II, Victor, Suthichai, Quy Ta, Srinivas Varadharajan, Lucas A. Wilson, Rengan Xu, Pei, Yang

PDF

Open Access

TL;DR

This paper introduces a method to convert assumed-sparse tensors to dense tensors in MPI-based training, significantly improving memory efficiency and enabling larger-scale parallel training of neural machine translation models.

Contribution

It presents a novel approach to reduce memory usage in distributed transformer training by densifying tensors, enhancing scalability and performance.

Findings

01

Achieved 91% weak scaling efficiency up to 1200 MPI processes.

02

Enabled training of larger models with reduced memory footprint.

03

Demonstrated improved scalability on supercomputing resources.

Abstract

Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow "Transformer" model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives. This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient…

Tables2

Table 1. Table 1: Software Environment for Zenith Experiments

Package	Version
Python	2.7.13
TensorFlow	Anaconda TensorFlow 1.12.0 with Intel^®MKL
Horovod	0.15.2
MPI	MVAPICH2 2.1

Table 2. Table 2: Software Environment for Stampede2 Experiments

Package	Version
Python	2.7.13
TensorFlow	Anaconda TensorFlow 1.12.0 with Intel^®MKL
Horovod	0.15.2
MPI	MVAPICH2 2.3

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax

Full text

11institutetext: Dell EMC, Austin TX, USA 22institutetext: Uber, Seattle WA, USA 33institutetext: Amazon, Seattle WA, USA 44institutetext: SURFSara, Utrecht, NL 55institutetext: Intel, Portland OR, USA

Densifying Assumed-sparse Tensors††thanks: Accepted to the 2019 International Supercomputing Conference (ISC)

Improving Memory Efficiency and MPI Collective Performance during Tensor Accumulation for Parallelized Training of Neural Machine Translation Models

Derya Cavdar 33

Valeriu Codreanu 44

Can Karakus 33

John A. Lockman III 11

Damian Podareanu 44

Vikram Saletore 55

Alexander Sergeev 22

Don D. Smith II 11

Victor Suthichai 33

Quy Ta 11

Srinivas Varadharajan 11

Lucas A. Wilson 11

Rengan Xu 11

Pei Yang 11

Abstract

Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow “Transformer” model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives.

This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scale-out capability, with CPU-only scaling tests achieving 91% weak scaling efficiency up to 1200 MPI processes (300 nodes), and up to 65% strong scaling efficiency up to 400 MPI processes (200 nodes) using the Stampede2 supercomputer.

1 Introduction

Neural Machine Translation (NMT) [1, 2, 19] offers numerous improvements and advantages in translation quality compared to traditional machine translation systems, such as statistical phrase-based systems [10]. NMT also paved the way to translate multiple languages using a single model [9]. Continued active research interest in the field of NMT has created many interesting architectures which produce models of high translation quality [22]. Recent research also shows how reduced precision and large batch training could speed-up the training while maintaining translation quality[12].

There are several challenges when scaling out Deep Neural Network (DNN)-based models, such as efficiently exchanging gradients across multiple nodes, scaling up the batch size while maintaining generalized performance, and selecting appropriate hyper-parameters which efficiently train the model while preventing divergence and over-fitting. NMT approaches such as the transformer model [22], which shares the weight matrix between the embedding layer and linear transformation before the softmax layer, must ensure that the gradients from these two layers are updated appropriately without causing performance degradation or out-of-memory (OOM) errors.

In this paper, we begin by understanding the basics of a NMT model, and try to explore the reasons that restrict it’s scalability. We then show how our current solution of forcibly densifying assumed-sparse tensors achieves high scaling efficiency – both weak and strong – when trained with up to 300 nodes on both the Zenith supercomputer at Dell EMC and the Stampede2 supercomputer at TACC. We also illustrate that even when trained with very large batch sizes (402k, 630k and 1 Million tokens), we are still able to achieve comparable or slightly better translation quality when compared to the official TensorFlow benchmark results.

The software changes which we discuss in this paper have been incorporated into Horovod 0.15.2 and later, providing other researchers the opportunity to apply this approach on any models that may benefit.

2 Background

NMT models work much like source-to-source compilers, taking input from a source language (e.g., Fortran) and converting it to a target language (e.g., binary machine code). An NMT model first reads a sentence in a source language and passes it to an encoder, which builds an intermediate representation. This intermediate representation is then passed to the decoder, which processes the intermediate representation to produce the translated sentence in the target language.

Fig. 1 shows an encoder-decoder architecture. The English source sentence, “Hello! How are you?” is read and processed by the architecture to produce a translated German sentence “Hallo! Wie sind Sie?”. Traditionally, Recurrent Neural Networks (RNN) were used in encoders and decoders [2], but other neural network architectures such as Convolutional Neural Networks (CNN) [4] and attention mechanism-based models [16] are also used.

The transformer model [22] is one of the interesting architectures in the field of NMT, which is built with variants of attention mechanism in the encoder-decoder part, eliminating the need for traditional RNNs in the architecture [3]. This model was able to achieve state of the art results in English-German and English-French translation tasks.

Fig. 2 illustrates the multi-head attention block used in the transformer model. At a high-level, the scaled dot-product attention can be imagined as finding the relevant information, values (V) based on Query (Q) and Keys (K) and multi-head attention could be thought as several attention layers in parallel to get distinct aspects of the input.

3 Issues with Scaling the Transformer Model

Encoder-decoder models for NMT make use of an attention mechanism to help the decoders obtain the vital information from the source tokens while discarding irrelevant information. The main structure of the transformer model is the multi-head attention, which provides a way to get different linear transformations of all the inputs. These components allow an NMT model to learn more robustly. But a particular design consideration that needs to be looked at for improving the scaling capabilities is the weight matrix that is shared between the embedding layer and the projection matrix. This type of similar design is also seen in other NMT models such as [4]. Hence, understanding the cause and effect of these specific design considerations is vital for the NMT research community.

This particular design would cause performance degradation or OOM errors if the gradients from these layers are not accumulated correctly. Specifically, gradients from the embedding layer are sparse whereas the gradients from the projection matrix are dense. In TensorFlow both gradients are updated together as a sparse IndexedSlices objects. This has a dramatic effect on TensorFlow’s determination of a gradient accumulation strategy, and subsequently on the total size of the accumulated gradient tensor.

Algorithm 1 describes the algorithm used in TensorFlow to accumulate gradients, based on the assumed type and shape of the gradients being accumulated (see [20]). At present, TensorFlow will either: (1) do nothing if there are less than 2 output gradients, (2) accumulate gradients by reduction if all gradients are expressed as dense tensors with defined shapes, or (3) convert everything to indexed slices and accumulate by concatenation (performing a gather operation).

In this particular use case, the embedding lookup is performed using

tf.gather, which returns an IndexedSlice object. This forces TensorFlow (based on the accumulation algorithm - Algorithm 1) to convert the remaining dense tensors to indexed slices, even though all the gradients being accumulated are dense.

The result of this decision to convert and assume that the gradient tensors are sparse is to accumulate by gathering, rather than reduction. This applies not only to single-node tensor accumulation, but to multi-node accumulation through Horovod due to the use of the main TensorFlow graph in determining which collective operations Horovod will perform using MPI. The result is extremely large message buffers (exceeding 11GB - see Fig. 3a), which cause segmentation faults or out-of-memory (OOM) errors.

Because of the message buffer sizes, we were unable to scale beyond 32 MPI processes, and saw quickly diminishing scaling efficiency, or fraction of ideal scaled speedup. Fig. 4 shows the scaled speedup of the training process up to the maximum achievable 32 MPI processes (8 nodes with 4 processes per node). Scaling efficiency – which is visually expressed as distance from the ideal line – declines rapidly, going from 84% with 4 nodes to 75% for 8 nodes. Eventually scaled speedup would (if the training could be parallelized further) reach an asymptotic limit where additional resources do not further accelerate the algorithm.

4 Densifying Assumed-sparse Tensors

In order to correct for the issue of assumed-sparse tensors in TensorFlow, we have implemented a forced-conversion of all gradient tensors to dense representation inside of Horovod’s DistributedOptimizer method. This will then force TensorFlow to accumulate those tensors via reduction, rather than aggregation (see Listing 1).

The result is an 82x reduction in the amount of memory required (from 11.4GB to 139MB - see Fig. 3a and Fig. 3b, respectively) when using 64 nodes (1 MPI process per node, batch size 5000 tokens). Additionally, the time needed to perform the accumulate operation drops from 4320 ms to 169 ms, which is a 25x reduction (see Fig. 5 for a comparison of accumulate size and time).

These small changes reduce the memory footprint per process to a degree that we can both scale up the batch size per MPI process and increase the number of MPI processes per run. They also reduce the tensor exchange time significantly enough to maintain near-linear scaling when running in a multi-node environment.

This algorithmic change can be made in Horovod 0.15.2 or later by setting the sparse_as_dense option when initializing DistributedOptimizer:

opt = hvd.DistributedOptimizer(opt, sparse_as_dense=True)

5 Experimental Results

The models were trained using the WMT-17 English-German parallel corpus with 4.5M sentence pairs. The newstest2014 dataset was used as unseen test data to capture the translation quality. All the pre-processing and BLEU [13] calculations were in accordance with TensorFlow’s official benchmarks in order to compare performance and translation quality. We also used hyper parameter settings based on best practices in [15, 12]. Model training experiments were run on the Zenith cluster in the Dell EMC HPC & AI Innovation Lab, as well as the Stampede2 cluster at the Texas Advanced Computing Center (TACC) in Austin, Texas.

Each Zenith node contains dual Intel®Xeon®Scalable Gold 6148/F processors, 192GB of memory, and an M.2 boot drive to house the operating system that does not provide user-accessible local storage. Nodes are interconnected by a 100Gbps Intel®Omni-path fabric, and shared storage is provided by a combination of NFS (for HOME directories) and Lustre [17] filesystems.

For our Zenith tests, we used Python 2.7, with Intel’s MKL-optimized version of TensorFlow (1.12). The version of Horovod used for these experiments was a private branch for testing purposes, but all of these optimizations have now been made a part of Horovod 0.15.2. Table 1 gives a complete breakdown of the software environment used for the Zenith experiments, while Listing 2 provides the runtime settings for the experiments.

We also ran scaling tests on the Stampede2 cluster at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin [18]. Stampede2 has two partitions, each with a different set of processors. Our tests were performed on the SKX partition, which consists of 1,736 nodes, each with dual Intel®Xeon®Scalable Platinum 8160 processors, 192GB of memory, and 200GB internal SSD drive for the operating system and local /tmp. The second KNL partition consists of 4,200 nodes, each with a single Intel®Xeon PhiTM 7250 processor with 16GB of on-package MCDRAM, 94GB of main memory, and a 200GB SSD for the operating system and local /tmp. All nodes are interconnected with 100Gbps Intel®Omni-path fabric and connected to Lustre-based shared filesystems.

For our Stampede2 tests, we used Python 2.7, with Intel’s MKL-optimized version of TensorFlow (1.12). The version of Horovod used for these experiments was a private branch for testing purposes, but all of these optimizations have now been made a part of Horovod 0.15.2. Table 2 gives a complete breakdown of the software environment used for the Zenith experiments.

5.1 Weak Scaling Performance

The difference in reducing the output gradient size can be seen when comparing the scaling efficiency – the ratio between observed scaled speedup and ideal – between the default sparse tensor accumulation strategy (gather) and the dense tensor accumulation strategy (reduce). Dense tensor accumulations show significantly better scaling efficiency out to 32 MPI processes (95%) than the default sparse tensor accumulation (75%) (see Fig. 6).

The reduced output gradient size and improved scaling efficiency mean that we can scale to larger process counts than was previously possible. Additional weak scaling experiments on Zenith using 4 processes per node (PPN) on up to 300 compute nodes (1200 MPI processes) show near-linear scaling, with efficiency dropping from 95% for 8 nodes to 91.5% for 300 (see Fig. 7 and Fig. 8). For these particular experiments on Zenith, batch size per process was held constant at 5000 tokens, or 20000 tokens per node. This means in the largest case (1200 MPI processes) we are training with a global batch size of 6M tokens.

The ability to maintain very high weak scaling efficiency above 90% suggests that continued scale-out is worthwhile. We will seek to perform additional experiments on systems larger than Zenith.

5.2 Strong Scaling

Besides good weak scaling efficiency, the reduced output gradient size also gives us the possibility to perform strong scaling experiments. For this purpose, we have selected a global batch size of 819,200 that allows us to produce a near-state-of-the-art model in terms of translation quality (as measured by BLEU score [13]), and as discussed in the following section. Obtaining good strong scaling efficiency is significantly more challenging compared to the weak scaling case, as the effective batch size per worker decreases when increasing the node count.

We have performed strong scaling experiments on both on the Zenith cluster and on the Stampede2 supercomputer from TACC. We have used up to 200 nodes on Zenith, and up to 512 nodes on Stampede2, both systems showing significant reductions in terms of time to solution.

Fig. 10 and Fig. 9 illustrate the strong scaling behavior that can be expected on the Zenith system. When going from 16 nodes up to 200 nodes, we can improve the throughput by a factor exceeding 8 (out of a maximum of around 12). In all these strong scaling cases, we only use 2 processes per node, each being scheduled to run on one socket and exploiting the NUMA affinity. This setting is more appropriate in this scenario, as the batch size that can be used per worker is double compared to the case when using 4 processes per node.

The impact of having good strong scaling efficiency is that training times can be dramatically reduced. This can be best visualized in Fig. 11, where the time to solution drops from around one month when using a single node, down to slightly over 6 hours when using 200 nodes (121 times faster), therefore significantly increasing the productivity for NMT researchers when using CPU-based HPC infrastructures. The results observed were based on the models achieving a baseline BLEU score (case-sensitive) of 27.5.

For the single node case, we have used the largest batch size that could fit in a node’s memory, 25,600 tokens per worker. For all other cases we use a global batch size of 819,200, leading to per-worker batch sizes of 25,600 in the 16-node case, down to only 2,048 in the 200-node case. The number of training iterations is similar for all experiments in the 16-200 node range, and is increased by a factor of 16 for the single-node case (to compensate for the larger batch).

On Stampede2, the behavior is similar to zenith up to 200 nodes. Since Stampede2 is a larger system, we performed larger strong scaling experiments. However, we noticed that using a 819,200 batch size would limit the scaling efficiency when using over 256 nodes. The 200 to 256 node range show improvements in time-to-solution, but when using 400 nodes we have reached the limits of strong scaling, and begin to observe performance degradation. This is due to the fact that a small (1,024) per-worker batch size is used in the 400 nodes experiment. To test that this is the case, we performed a larger experiment using a per-worker batch size of 1,536, and a total of 1,024 workers divided across 512 nodes. This leads to a global batch size of 1,572,864, and requires further attention to in order to reach the translation accuracy performance targets. However, from a throughput perspective, this run is 56% faster compared to a similar 256-node run. This shows that there will be performance improvements as we increase the per-worker batch size to a reasonably large size ( $>1536$ ).

5.3 Model Accuracy

Scaling out transformer model training using MPI and Horovod improves throughput performance, while producing models of similar translation quality (see Fig. 12). Models of comparable quality can be trained in a reduced amount of time by scaling computation over many more nodes, and with larger global batch sizes (GBZ). Our experiments on Zenith demonstrate ability to train models of comparable or higher translation quality (as measured by BLEU score [13]) than the reported best for TensorFlow’s official model [21], even when training with batches of a million or more tokens.

6 Discussion

Our experiments have demonstrated that converting assumed-sparse tensors to dense tensors improves memory utilization as well as time to accumulate, thanks to a switch from gathering to reduction (see Fig. 5). Unlike similar solutions implemented directly within optimized NMT models, such as NVIDIA’s OpenSeq2Seq package [11], our approach does not limit usability strictly to one specific package repository or model implementation. Instead, our approach provides greater generalized use and potential applicability to other models.

Applicability to other Models.

We believe the solution that is now implemented in Horovod will prove useful to most neural network model classes, including various language translation models, image segmentation models, voice/text translation models across multiple voice datasets, time-series models, etc. Future work will quantify the impact of the current solution to these use cases. We also foresee this as a potential workaround for issues in custom architectures, such as multi-branch neural networks [23, 24, 25, 7]. These architectures are typically recollecting gradient data from multiple “separated” neural network branches, which would be likely to encounter similar sparse tensor encoding issues.

Specificity to TensorFlow.

While we have identified a specific edge case within the TensorFlow code base, we do not believe that this particular edge case is common to other deep learning frameworks, such as Caffé2 [8] and PyTorch [14]. However, TensorFlow’s current and continuing popularity and the abundance of pre-built models in TensorFlow mean that any performance benefits we can communicate back to that community are important.

Incorporating Changes into TensorFlow.

Long-term, we believe that the ideal solution is to add additional logic into TensorFlow’s gradient accumulation algorithm to convert and reduce tensors when any of the tensors is dense (see Algorithm 2), rather than only when all of the tensors are dense (as is the case in Algorithm 1).

In the case of Algorithm 2, we propose the addition of an extra conditional block (lines 5–7), which would handle the case that there exists at least 1 tensor which is dense, in which case all of the tensors to be accumulated would be converted to dense and accumulated by reduction. More research has to be done in order to ensure that incorporating this conditional block into the TensorFlow accumulation strategy would not adversely effect other well-behaved tensor accumulations, and we will be testing this inclusion and proposing back to TensorFlow in the future.

7 Future Work & Conclusion

Scaling Neural Machine Translation (NMT) models to multiple nodes can be difficult due to the large corpuses needed for reasonable translation, and the all-to-all mapping nature of the intermediate representation encodings. If tensor accumulation is not performed in a memory and compute-optimized fashion, excessively large tensors can cause buffer overruns which prevent scaling beyond a few MPI processes. These models can take weeks or months to train at low node counts, making it all the more critical that they can be efficiently scaled to hundreds or thousands of MPI processes.

We have identified an edge case in TensorFlow’s tensor accumulation strategy which leads to sub-optimal memory and compute utilization, which prevents scaling of multi-head attention-based transformer models beyond a relatively small number of processes without very large memory buffers. We have proposed and implemented a fix via the Horovod MPI-based framework for distributed memory scaling of TensorFlow models by forcibly converting – through the use of an option to DistributedOptimizer – all tensors to be accumulated to dense and subsequently reducing tensors rather than aggregating them. The result is a more than 82x reduction in memory needed and 25x reduction in time to complete the accumulation step at 64 MPI processes, and the enabled ability to scale the translation model to a thousand MPI processes or more with batches of millions of word part tokens.

These modifications have been incorporated into Horovod, and are available as of version 0.15.2 [6], so that other teams can scale neural machine translation tasks or any other tasks which use similar topologies. We have proposed a potential fix within TensorFlow as a more long-term solution to this issue, and we will be pursuing this going forward once we have determined that there are no additional side-effects from the addition of the new tensor accumulation strategy.

Going forward, we intend to investigate whether other neural network architectures besides multi-head attention can benefit from being able to expressly densify sparse tensor encodings, as well as whether custom architectures could potentially benefit from this solution.

Acknowledgement

The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. http://www.tacc.utexas.edu

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. Computing Research Repository (Co RR) , abs/11409.0473 v 7, September 2014.
2[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Computing Research Repository (Co RR) , abs/1406.1078 v 3, September 2014.
3[3] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav 2Letter: an End-to-End Conv Net-based Speech Recognition System. Computing Research Repository (Co RR) , abs/1609.03193 v 2, September 2016.
4[4] Jonas Gehring et al. Convolutional Sequence to Sequence Learning. Computing Research Repository (Co RR) , abs/1705.03122 v 3, July 2017.
5[5] Horovod. compute_gradients() in horovod/tensorflow/__init__.py. https://github.com/uber/horovod/blob/085cb 1b 5f 3b 30734 a 34d 047841 b 098c 15a 6e 1bae/horovod/tensorflow/__init__.py#L 195 .
6[6] Horovod. Release 0.15.2. https://github.com/uber/horovod/releases/tag/v 0.15.2 .
7[7] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7132–7141, 2018.
8[8] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia , pages 675–678, November 2014.