Large-Scale Stochastic Learning using GPUs

Thomas Parnell; Celestine D\"unner; Kubilay Atasu; Manolis Sifalakis; and Haris Pozidis

arXiv:1702.07005·cs.LG·February 24, 2017

Large-Scale Stochastic Learning using GPUs

Thomas Parnell, Celestine D\"unner, Kubilay Atasu, Manolis Sifalakis, and Haris Pozidis

PDF

Open Access

TL;DR

This paper presents a GPU-accelerated stochastic learning system capable of handling large-scale datasets efficiently, achieving significant speed-ups through parallelization and distributed training techniques.

Contribution

It introduces a GPU implementation of stochastic coordinate descent and a novel method for aggregating updates in distributed settings, enabling scalable, fast training on massive datasets.

Findings

01

Up to 35x speed-up over CPU implementations.

02

Scales out across 8 nodes with minimal loss in training time.

03

Trains on 200 million examples in around 4 seconds using 4 GPUs.

Abstract

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent/ascent algorithm that can provide up to 35x speed-up over a sequential CPU implementation. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed stochastic learning. We propose a novel method for optimally aggregating model updates from worker nodes when the training data is distributed either by example or by feature. Using this technique, we demonstrate that one can scale out stochastic learning across up to 8 worker nodes without any significant loss of training time. Finally, we combine GPU acceleration…

Equations38

P (β) = \frac{1}{2 N} ∣∣ A β - y ∣ ∣^{2} + \frac{λ}{2} ∣∣ β ∣ ∣^{2},

P (β) = \frac{1}{2 N} ∣∣ A β - y ∣ ∣^{2} + \frac{λ}{2} ∣∣ β ∣ ∣^{2},

β^{*} = β arg min P (β) .

β^{*} = β arg min P (β) .

\frac{\partial P ( β )}{\partial β _{m}} = \frac{1}{N} ⟨ A β - y, a_{m} ⟩ + λ β_{m},

\frac{\partial P ( β )}{\partial β _{m}} = \frac{1}{N} ⟨ A β - y, a_{m} ⟩ + λ β_{m},

β^{(t + 1)} = β^{(t)} + (\frac{⟨ y - w ^{(t)} , a _{m} ⟩ - λ N β _{m}^{(t)}}{∣∣ a _{m} ∣ ∣ ^{2} + λ N}) e_{m},

β^{(t + 1)} = β^{(t)} + (\frac{⟨ y - w ^{(t)} , a _{m} ⟩ - λ N β _{m}^{(t)}}{∣∣ a _{m} ∣ ∣ ^{2} + λ N}) e_{m},

w^{(t + 1)} = w^{(t)} + a_{m} (β_{m}^{(t + 1)} - β_{m}^{(t)}) .

w^{(t + 1)} = w^{(t)} + a_{m} (β_{m}^{(t + 1)} - β_{m}^{(t)}) .

D (α) = - \frac{N}{2} ∣∣ α ∣ ∣^{2} - \frac{1}{2 λ} ∣∣ A^{T} α ∣ ∣^{2} + α^{T} y,

D (α) = - \frac{N}{2} ∣∣ α ∣ ∣^{2} - \frac{1}{2 λ} ∣∣ A^{T} α ∣ ∣^{2} + α^{T} y,

α^{*} = α arg max D (α) .

α^{*} = α arg max D (α) .

\frac{\partial D ( α )}{\partial α _{n}} = - N α_{n} - \frac{1}{λ} ⟨ A^{T} α, \overset{a}{ˉ}_{n} ⟩ + y_{n},

\frac{\partial D ( α )}{\partial α _{n}} = - N α_{n} - \frac{1}{λ} ⟨ A^{T} α, \overset{a}{ˉ}_{n} ⟩ + y_{n},

α^{(t + 1)} = α^{(t)} + (\frac{λ y _{n} - ⟨ w ˉ ^{(t)} , a ˉ _{n} ⟩ - λ N α _{i}^{(t)}}{λ N + ∣∣ a ˉ _{n} ∣ ∣ ^{2}}) \overset{e}{ˉ}_{n},

α^{(t + 1)} = α^{(t)} + (\frac{λ y _{n} - ⟨ w ˉ ^{(t)} , a ˉ _{n} ⟩ - λ N α _{i}^{(t)}}{λ N + ∣∣ a ˉ _{n} ∣ ∣ ^{2}}) \overset{e}{ˉ}_{n},

\overset{w}{ˉ}^{(t + 1)} = \overset{w}{ˉ}^{(t)} + \overset{a}{ˉ}_{n} (α_{n}^{(t + 1)} - α_{n}^{(t)}) .

\overset{w}{ˉ}^{(t + 1)} = \overset{w}{ˉ}^{(t)} + \overset{a}{ˉ}_{n} (α_{n}^{(t + 1)} - α_{n}^{(t)}) .

β^{*}

β^{*}

α^{*}

G_{P} (β^{(t)})

G_{P} (β^{(t)})

G_{D} (α^{(t)})

β^{(t + 1)}

β^{(t + 1)}

w^{(t + 1)}

γ_{t}^{(*)} = γ arg min P (β^{(t)} + γ Δ β^{(t)}, w^{(t)} + γ Δ w^{(t)}) .

γ_{t}^{(*)} = γ arg min P (β^{(t)} + γ Δ β^{(t)}, w^{(t)} + γ Δ w^{(t)}) .

γ_{t}^{(*)} = - \frac{( ⟨ w ^{(t)} , Δ w ^{(t)} ⟩ + N λ ⟨ β ^{(t)} , Δ β ^{(t)} ⟩ )}{∣∣Δ w ^{(t)} ∣ ∣ ^{2} + N λ ∣∣Δ β ^{(t)} ∣ ∣ ^{2}} .

γ_{t}^{(*)} = - \frac{( ⟨ w ^{(t)} , Δ w ^{(t)} ⟩ + N λ ⟨ β ^{(t)} , Δ β ^{(t)} ⟩ )}{∣∣Δ w ^{(t)} ∣ ∣ ^{2} + N λ ∣∣Δ β ^{(t)} ∣ ∣ ^{2}} .

⟨ β^{(t)}, Δ β^{(t)} ⟩

⟨ β^{(t)}, Δ β^{(t)} ⟩

∣∣Δ β^{(t)} ∣ ∣^{2}

\overset{γ}{ˉ}_{t}^{(*)} = \frac{⟨ Δ α ^{(t)} , y ⟩ - N ⟨ Δ α ^{(t)} , α ^{(t)} ⟩ - \frac{1}{λ} ⟨ Δ w ˉ ^{(t)} , w ˉ ^{(t)} ⟩}{\frac{1}{λ} ∣∣Δ w ˉ ^{(t)} ∣ ∣ ^{2} + N ∣∣ α ^{(t)} ∣ ∣ ^{2}} .

\overset{γ}{ˉ}_{t}^{(*)} = \frac{⟨ Δ α ^{(t)} , y ⟩ - N ⟨ Δ α ^{(t)} , α ^{(t)} ⟩ - \frac{1}{λ} ⟨ Δ w ˉ ^{(t)} , w ˉ ^{(t)} ⟩}{\frac{1}{λ} ∣∣Δ w ˉ ^{(t)} ∣ ∣ ^{2} + N ∣∣ α ^{(t)} ∣ ∣ ^{2}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Privacy-Preserving Technologies in Data

Full text

Large-Scale Stochastic Learning using GPUs

Thomas Parnell, Celestine Dünner, Kubilay Atasu, Manolis Sifalakis and Haris Pozidis

IBM Research - Zurich

Säumerstrasse 4, Rüschlikon, Switzerland

Email: {tpa, cdu, kat, emm, hap}@zurich.ibm.com

Abstract

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent/ascent algorithm that can provide up to $35\times$ speed-up over a sequential CPU implementation. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed stochastic learning. We propose a novel method for optimally aggregating model updates from worker nodes when the training data is distributed either by example or by feature. Using this technique, we demonstrate that one can scale out stochastic learning across up to 8 worker nodes without any significant loss of training time. Finally, we combine GPU acceleration with the optimized distributed method to train on a dataset consisting of 200 million training examples and 75 million features. We show by scaling out across 4 GPUs, one can attain a high degree of training accuracy in around 4 seconds: a $20\times$ speed-up in training time compared to a multi-threaded, distributed implementation across 4 CPUs.

I Introduction

Graphics processing units (GPUs) have found a wide range of applications in machine learning and other scientific fields. Most recently, they have become widely adopted to tackle the problem of training deep neural networks [1]. Due to their massively parallel architecture, GPUs are well suited to this task, since the training procedure can be naturally expressed as a sequence of matrix operations. By making carefully constructed calls to libraries such as NVIDIA’s cuBLAS (or even cuDNN) one can typically achieve the maximum theoretical floating point performance of the processor.

While neural networks are a topic of significant interest, many other machine learning applications rely on other techniques such as fitting generalized linear models for regression or classification [2, Chapters 3 and 4]. While the training of such models can generally also be mapped to a sequence of matrix operations, this tends to apply only when using batch methods such as gradient descent. The batch gradient descent technique updates the model parameters by computing a gradient vector across all available training examples. It is well known that faster convergence can be achieved over batch methods by using stochastic learning algorithms such as stochastic gradient descent (SGD) [3] or stochastic coordinate descent (SCD) [4]. These algorithms compute an update to the model parameters by computing a gradient using only a single training example or optimizing with respect to a single feature respectively. Compared to batch methods, such algorithms are inherently sequential and each successive iteration involves only a single vector inner product computation. Furthermore, these vectors are sparse for many datasets of interest. For these reasons, mapping such algorithms onto GPUs become more difficult since, within a single iteration, one cannot benefit significantly from the massively parallel architecture.

A further challenge is the issue of limited GPU memory. State-of-the-art GPUs have a main memory capacity of up to 16GB. Modern internet-scale datasets [5] can grow many times larger than this. Therefore, stochastic learning algorithms that run on a single GPU are of limited interest and to build a useful GPU-based implementation it is necessary to consider distributed techniques. Distribution of stochastic learning is an active field of research and many promising techniques have been proposed. In [6] a method was proposed whereby worker nodes perform stochastic updates of a local model and asynchronously communicate their model updates to a parameter server. Alternatively, one may consider synchronous techniques such as [7] in which worker nodes perform stochastic updates using the data that is locally available to them and after a number of steps, all updates are aggregated on a master node and the resulting model (or some representation thereof) is then broadcast back to the workers for the next round. Unless the data has been partitioned in a way to exploit underlying structure, distributed algorithms tend to converge slower (in terms of number of model updates) compared to their non-distributed counterparts due to the delay incurred in sharing model updates between workers [8]. However, when one is truly dealing with very large data, scaling out across multiple machines (or indeed GPUs) becomes a necessity rather than a choice.

In this work we will begin with an overview of the ridge regression problem ( $L^{2}$ -regularized linear regression) and explain how it can be solved using stochastic coordinate descent/ascent techniques in its primal formulation as well as its dual formulation. While we have opted to focus on ridge regression for the sake of simplicity, stochastic coordinate methods are used in the field of machine learning to solve other problem such as regression with elastic net regularization as well as support vector machines. We proceed to review the state-of-the-art implementations of SCD on the CPU, covering both single-threaded and multi-threaded implementations. We will then present a twice parallel, asynchronous implementation of SCD (TPA-SCD) that is specifically designed to take advantage of the massively parallel hardware that is available on modern GPUs. We will demonstrate that this new implementation can achieve up to a $35\times$ speed-up in training time relative to existing single-threaded algorithms. We will then turn our attention to the problem of training on very large datasets and discuss how SCD can be distributed across multiple workers nodes that communicate over a network. We will consider the case where the training data is distributed across the workers nodes by training example as well as the case where the data is distributed by feature. We will then study a core component of synchronous distributed learning algorithms: the aggregation step. We propose a novel technique for direct optimization of the aggregation step for the case of distributed ridge regression. We will demonstrate that by using optimized aggregation, it is possible to scale out across up to 8 nodes without experiencing any significant slow-down in training time. Finally, we will combine these two techniques and demonstrate distributed stochastic learning across clusters of GPUs. We show that by using a cluster of 4 GPUs running TPA-SCD it is possible to train a 40GB sample of the criteo dataset (200 million examples, 75 million non-zero features) to a high degree of accuracy in around 4 seconds, providing a $20\times$ speed-up over a state-of-the-art multi-threaded, distributed implementation.

II Ridge Regression

Let $A\in\mathbb{R}^{N\times M}$ denote the training data matrix where $N$ is the number of training examples and $M$ is the number of features. The corresponding labels for the training examples are provided by the vector $y\in\mathbb{R}^{N}$ . The parameter $\lambda\in\mathbb{R}^{+}$ controls regularization and prevents over-fitting. The operator $||.||$ denotes the standard $L^{2}$ -norm and the operator $\left<.\right>$ denotes the Euclidean inner product between two vectors.

II-A Primal Form

Ridge regression in its primal form is defined by the following objective function:

[TABLE]

where $\beta\in\mathbb{R}^{M}$ are the primal model weights. The function $\mathcal{P}(\beta)$ is strongly convex and the optimal values are given by:

[TABLE]

Coordinate descent methods aim to iteratively minimize the primal objective (1) by successively optimizing individual coordinates. The partial derivative of the primal function with respect to the $m$ -th coordinate is given by:

[TABLE]

where $a_{m}$ denotes the $m$ -th column of data matrix $A$ . Let $\beta^{(t)}$ denote the approximate solution at iteration $t$ . By following the approach of [4] and optimizing with respect to the $m$ -th coordinate while keeping the model weights for all other coordinates fixed, one obtains the following update rule:

[TABLE]

where $w^{(t)}=A\beta^{(t)}\in\mathbb{R}^{N}$ is known as the shared vector at iteration $t$ and $e_{m}\in\{0,1\}^{M}$ is a vector that is all-zero apart from the $m$ -th coordinate. The following update rule for the shared vector follows easily:

[TABLE]

II-B Dual Form

Ridge regression in its dual form is defined by the following objective function:

[TABLE]

where $\alpha\in\mathbb{R}^{N}$ are the dual model weights. The optimal value for the model weights can be found by solving the following maximization problem:

[TABLE]

The dual problem can also be solved using iterative techniques that maximize the objective function (3) using a single coordinate at a time. The partial derivative with respect to the $n$ -th coordinate is given by:

[TABLE]

where $\bar{a}_{n}$ denotes the $n$ -th row of the data matrix $A$ . Let $\alpha_{i}^{(t)}$ denote the estimate of the optimal model weights at iteration $t$ . It was shown in [9] that one can optimize the dual objective function for a selected coordinate $n$ while keeping the model weights for all other coordinates fixed, leading to the following update rule:

[TABLE]

where $\bar{w}^{(t)}=A^{T}\alpha^{(t)}\in\mathbb{R}^{M}$ denotes the dual shared vector and $\bar{e}_{n}\in\{0,1\}^{N}$ is a vector that is all-zero apart from the $n$ -th coordinate. The update rule for the dual shared vector is given by:

[TABLE]

II-C Duality Gap

The Fenchel-Rockafellar duality theorem [10] dictates that $\mathcal{P}(\beta^{*})=\mathcal{D}(\alpha^{*})$ and the following conditions must hold for the optimal values of $\beta^{*}$ and $\alpha^{*}$ :

[TABLE]

In order to compare the convergence behavior of the two methods we can define the duality gap for the primal, dual algorithms respectively as follows:

[TABLE]

We use the duality gap to compare convergence of algorithms that solve the primal and dual formulations of ridge regression since it does not depend on the scale of either objective and its limit when the number of iterations is large is known: it must always converge to zero. In the next section we will implement these algorithms and compare their convergence behavior.

III Stochastic Learning on the GPU

In this section we will consider how to implement stochastic coordinate descent methods. Since the dual maximization problem can always be transformed into a minimization by applying a change of sign, in what follows we will solely refer to stochastic coordinate descent (SCD) methods for solving both formulations of ridge regression. We first review the standard sequential implementation on the CPU and the existing work on asynchronous, multi-threaded implementations. We will then describe a new implementation (TPA-SCD) that is designed to exploit the massively parallel computing resources provided by modern GPUs.

III-A Sequential CPU Implementation

In Algorithm 1 we review the algorithm proposed in [4] for sequential SCD. The algorithm is presented for the primal form of ridge regression, however the equivalent variation for the dual formulation is almost identical up to the update rule (4). For the primal form of the algorithm, one epoch is defined to be one pass through all the $M$ permuted features. Conversely, for the dual form an epoch is defined as one pass through all the $N$ permuted training samples. Both variants of the algorithm have been implemented in C++. Optimization flags were passed to the gcc compiler to ensure that vectorization occurs when evaluating the inner products. The data matrix and model parameters are represented using $32$ -bit floating point data types. The implementation has been designed assuming that the data matrix $A$ is sparse, so that any unnecessary computation is avoided.

III-B Asynchronous CPU Implementations

SCD is an inherently sequential algorithm and is thus non-trivial to parallelize. However, recent work into asynchronous techniques has shown that it is possible to accelerate stochastic learning algorithms by running multiple threads (each updating using a single coordinate or example) that read the current value of the model parameters (and any associated vectors) from shared memory and write back their updates without using complex locking schemes such as those proposed in [11]. In [12] an asynchronous implementation of stochastic gradient descent was proposed (“Hogwild!”) that comprises many parallel threads each computing the gradient using a random training example and updating the model weights using atomic operations. While this work significantly developed the concept of asynchronous learning, asynchronous coordinate descent methods were not considered.

In [13], an asynchronous version of Algorithm 1 was proposed (A-SCD) whereby the inner loop over the shuffled coordinates is parallelized across multiple CPU threads. Since different threads can potentially write updates to the shared vector in the same location, an atomic addition was used to ensure that updates to the shared vector are always applied. The authors found that, while a good speed-up was attained, issues arose due to the shared vector becoming inconsistent with the model weights. To resolve this problem, a scheme for occasionally re-computing the shared vector was proposed. In [14] it was reported that one can achieve faster convergence if instead of using atomic addition, one allows a “wild” behavior where updates to the shared vector can be overwritten or not applied at all. While the authors reported an almost linear speed-up in training time using this algorithm (PASSCoDe-Wild), it was shown that such an implementation will converge to a solution that violates the optimality conditions (5) and (6).

Finally, in [15], an asynchronous coordinate descent algorithm was proposed (AsySCD) and close-to-linear scaling was demonstrated using a 40-core CPU. This algorithm differs from Algorithm 1 in two important respects. Firstly, instead of optimizing for each coordinate exactly, a small gradient descent step is taken thus introducing an additional step size parameter that must be tuned. Secondly, the algorithm is implemented without the use of a shared vector. Instead, the computation of a Hessian matrix is required. This takes a considerable amount of time and significantly increases the memory requirements, thus rendering the algorithm unsuitable for very large datasets. Both of these differences were already noted in [14], in which the authors were able to reproduce the linear scaling behavior of AsySCD but demonstrated that it is slower than even a single threaded implementation of Algorithm 1.

For comparison with our GPU-based implementation we have implemented the algorithm proposed in [13] that uses atomic addition (A-SCD) and the “wild” implementation proposed in [14] (PASSCoDE-Wild). Both implementations were written in C++ using the OpenMP library. All CPU-based experiments were run on 8-core Intel Xeon111Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other product or service names may be trademarks or service marks of IBM or other companies. CPUs with a clock frequency of 2.40GHz. Each core can run up to 2 threads resulting in a maximum number of 16 threads.

III-C Twice Parallel, Asynchronous GPU Implementation

In Algorithm 2 we present a twice parallel, asynchronous implementation of SCD (TPA-SCD) designed to run on massively parallel GPU architectures. The algorithm is presented for the primal form of ridge regression, however the equivalent variation for the dual formulation is almost identical up to the update rule (4). Our algorithm exploits two levels or parallelism. Firstly, in any given epoch, every coordinate is updated by a dedicated thread block and the thread blocks are scheduled for execution in parallel (or concurrently) on the streaming multiprocessors (SMs) that are available on the GPU. Secondly, within each thread block the computations that are required to perform the coordinate update are divided up amongst multiple threads at a very fine granularity. Furthermore, the updates to the shared vector are written out to the GPU main memory using all available threads. This helps to ensure that the shared vector and the model weights remain consistent throughout operation and thus a good convergence behavior is achieved. Rather than implement a complex locking scheme, we implemented the shared vector updates using the floating point atomic additions operations that are offered by most modern GPUs. These operations ensure that all updates to the shared vector are applied without any blocking occurring. The implementation of TPA-SCD was written in CUDA/C++ and all data is represented using $32$ -bit floating point data types. We have implemented and tested the algorithm on the NVIDIA Quadro M4000 GPU as well as the GeForce GTX Titan X.

III-D Performance Comparison

In Fig. 1 we compare the convergence behavior of a number of the algorithms that have been proposed to solve the primal form of ridge regression. The dataset that was used was a training sample of the webspam dataset [16] that consists of $262,938$ examples and $680,715$ non-zero features. This sample was obtained by sampling the training examples uniformly at random to create a $75\%/25\%$ train/test split of the full dataset. A compressed sparse column format was used to represent the data matrix in the memory of the GPU when solving the primal problem and a compressed sparse row format was used when solving the dual. The sampled webspam dataset consumes around $7.3$ GB of GPU memory and thus fits inside the memory capacity of the M4000 (the limit is $8$ GB). In Fig. 1a we study the convergence as function of epochs and in Fig. 1b we compare the convergence as a function of time. When we refer to an algorithm exhibiting a “speed up” in training time we mean that the same level of duality gap can be achieved in a shorter amount of time (even if more epochs are required). Firstly, let us consider the sequential SCD (Algorithm 1) using a single CPU thread and compare its performance with that of A-SCD and PASSCoDe-Wild (both using 16 threads). While the atomic implementation (A-SCD) has exactly the same convergence properties as the sequential algorithm as a function of epochs, we observe only a modest speed-up (around $2\times$ ) which we attribute to the lack of hardware support for floating point atomic addition on this particular CPU. On the other hand, for the wild implementation (PASSCoDe-Wild) we see a much more significant speed-up ( $4\times$ ), but the algorithm converges to a solution that violates the optimality conditions (5) and (6). Accordingly, the duality gap does not tend towards zero. Now turning our attention to the GPU-based implementations of TPA-SCD, we observe near-perfect convergence for the algorithm on both GPUs as a function of epochs and significant gains in training time: around $14\times$ for the M4000 and $25\times$ for the Titan X. All speed-ups are measured relative to the sequential single-threaded implementation.

In Fig. 2 we compare the convergence behavior of the same set of algorithms for the dual form of ridge regression using the same dataset. From Fig. 2a we can see that things look very similar to the primal case: all implementations converge in the same manner as the sequential algorithm except PASSCoDE-Wild. The convergence behavior as a function of time is shown in Fig. 2b. We observe similar speed-ups for the multi-threaded CPU implementations (relative to the sequential algorithm) as were observed for the primal case. For the TPA-SCD on the M4000 we achieve a $10\times$ speed-up and on the Titan X we achieve a $35\times$ speed-up relative to the sequential SCD algorithm.

IV Distributed Stochastic Learning

Modern GPUs have a memory capacity of up to $16$ GB thus severely limiting the size of the datasets on which we are able to learn. If we want to train on very large datasets and still benefit from the large acceleration that has been demonstrated in the previous section, it is essential that we are able to scale out the training across multiple GPUs. In this section we will review distributed SCD-based methods and introduce a technique for optimizing the aggregation of workers’ model updates to accelerate convergence and improve scaling.

IV-A Distributed SCD

Training algorithms that can be distributed across multiple machines have been the subject of a significant amount of research. Distributed techniques based on stochastic gradient descent have been proposed (see [17] and [18]) as well as methods based on coordinate descent/ascent (see [7], [19], [20], [21] and [22]). These distributed learning algorithms typically involve each machine (or worker) performing a number of optimization steps to approximately minimize the global objective function using the local data that it has available. The training data can either be distributed by sample (rows of the matrix $A$ ) or by feature (columns of the matrix $A$ ). The model updates from all of the workers are then communicated over the network to a master node. The master then aggregates all of the updates and computes a new set of model parameters. The updated model parameters are then broadcast back to the workers over the network and the process repeats.

Training of ridge regression models using stochastic coordinate methods can be distributed across a cluster of machines (or a cluster of GPUs) following the aforementioned approach. One can choose whether to distribute the data matrix A across the workers by features and solve the primal form of the problem or distribute the data by example and solve the problem in its dual form. During each epoch, each worker performs a permuted pass through its set of local coordinates and performs incremental optimization of the objective function (keeping all unselected coordinates fixed, including those that exist on the other workers). The coordinate updates on each worker can be computed using any of the techniques discussed in the previous section. After all workers have finished passing through their coordinates, an aggregation step is performed whereby the updates to the shared vector on each worker are sent over the network to a master node where they are aggregated. An updated value for the shared vector is then computed on the master and broadcast back to the workers and the next epoch can begin.

The algorithm that has been implemented in described in detail in Algorithm 3 for the primal formulation of ridge regression where data is distributed by features. It should be appreciated that the same procedure can be applied to dual formulation without significant modification. The procedure that is described can be thought of as a special case of the more general CoCoA framework [7] applied specifically to the ridge regression problem (with the CoCoA hyper-parameter $\sigma$ set to $1$ ). The distributed aspects of the algorithm were implemented in C++ using MPI. In particular, the implementation leverages the Broadcast and Reduce functions that are offered by the Open MPI library.

In Fig. 3 we plot the convergence in duality gap as a function of epochs for an increasing number of workers for both the primal form (where the data is distributed by features) and the dual form (where the data is distributed by examples). The experiment was run using a cluster of 4 Intel Xeon-based machines connected via a 10Gbit ethernet link with up to two workers per machine. Each worker uses the single-threaded, sequential Algorithm 1 as its local solver. One can see that in both cases, the distributed algorithm converges to the optimum but there appears to be an approximately linear slow-down in convergence speed as a function of epochs. This is an inevitable effect that arises due to the workers using an out-of-date shared vector during each epoch. This effect can be somewhat alleviated if one was able to communicate shared vector updates more frequently and thus perform fewer coordinate updates on the workers between communication stages. It has been shown in [23] that there exists an infrastructure-dependent trade-off between computation and communication for distributed learning algorithms. By carefully tuning the ratio of communication to computation, it may be possible to improve the convergence behavior of the distributed algorithm further but we consider such optimizations beyond the scope of this paper.

IV-B Adaptive Aggregation

The convergence behavior of the distributed SCD algorithm can be improved by optimizing the aggregation step. Existing work has considered both averaging and adding of updates [24], introducing an aggregation parameter that can be set freely [25] and even performing a line search method to explicitly optimize the aggregation parameter [21]. We propose a new method to optimize aggregation for distributed ridge regression whereby an optimal value of an aggregation parameter is precisely computed in a distributed manner.

Let us denote the aggregated model weights and shared vector at the end of epoch $t$ as follows:

[TABLE]

where $\gamma_{t}$ is the aggregation parameter in the $t$ -th epoch. For the primal formulation of ridge regression, we can then optimize the objective function to explicitly find the best aggregation parameter at every epoch:

[TABLE]

The above equation has the following explicit solution:

[TABLE]

While the aggregated changes to the shared vector $\Delta w^{(t)}$ are already available on the master node, the aggregated changes to the model weights $\Delta\beta^{(t)}$ are not. However, since all workers only update the coordinates corresponding to their local data, the following property allows us to compute $\left<\beta^{(t)},\Delta\beta^{(t)}\right>$ and $||\Delta\beta^{(t)}||^{2}$ in a distributed manner:

[TABLE]

The distributed algorithm is defined precisely in Algorithm 4 for the primal form of ridge regression. The additional communication that is introduced in order to achieve the optimized aggregation amounts to the transfer of a few scalars over the network interface per epoch. The equivalent algorithm for the dual form follows easily from the following expression for the optimal aggregation parameter in the dual setting:

[TABLE]

In Fig. 4 we plot the convergence behavior of distributed SCD using adaptive aggregation on the ridge regression problem and compare it with the algorithm that uses averaging for aggregating the updates to the shared vector. We observe that for the algorithm that solves the primal formulation there is a speed-up in convergence that approaches $2\times$ for small values of duality gap. For the algorithm that solves the dual, the effect is less pronounced: for relatively large values of the duality gap the algorithm with adaptive aggregation can be slower (since we explicitly minimize the dual objective, the duality gap is not necessarily minimized) but for small values of the duality gap we observe an speed-up of around $1.2\times$ .

In Fig. 5 we show the evolution of the optimal value of the aggregation parameter as a function of epochs. We can observe a trend: it tends to start off relatively low before increasing and finally converging to some value. It is interesting to note that the value to which it converges to is significantly larger than the value that corresponds to averaging (i.e., $\gamma=1/K$ ).

In Fig. 6 we plot the time to reach a desired duality gap as a function of the number of workers for the webspam dataset. In Fig. 6a we show the scaling behavior for the distributed solver for the primal form of ridge regression and in Fig. 6b we show the same but for the dual formulation. In both cases we can compare the scaling behavior, for different levels of desired accuracy, with and without the adaptive aggregation technique. In both cases we observe that the adaptive aggregation technique allows us to scale out across multiple worker nodes while keeping the training time roughly constant. For the dual problem on the webspam dataset, we see that for relatively high values of the duality gap, the adaptive aggregation can slow down convergence somewhat. This is consistent with Fig. 4b where we observed a crossover point at around duality gap $5\times 10^{-4}$ .

This scaling behavior is very consistent with that reported for CoCoA+ in [24]. The acceleration that comes from each worker processing a factor of $K$ less data per iteration is just enough to compensate for the linear slow-down in convergence that occurs due to each worker using an outdated model (see Fig. 3). The scaling behavior strongly depends on the nature of the underlying dataset. In particular, the slow-down in convergence is determined by the level of correlation between coordinates on the different workers. If there exists some additional structure (for instance, a large number of one-hot encoded categorical variables) then one can partition the coordinates in an intelligent way to achieve a faster convergence and thus better scaling [22].

V Scaling out across multiple GPUs

In this section we will combine the methods of Sections III and IV to construct an accelerated implementation of TPA-SCD that can scale across multiple GPUs connected over a network and train on datasets much larger than the memory capacity of single GPU device.

V-A Distributed TPA-SCD

The general approach is illustrated in Fig. 7. The training is distributed across $K$ workers using the algorithms described in the previous section. Each worker consists of a CPU-based machine with at least one GPU attached over a PCIe interface. During each epoch, every worker runs the TPA-SCD algorithm on the streaming multiprocessors of its GPU and computes updates to its local model weights as well the shared vector. Each worker is then responsible for copying the shared vector updates from the GPU device memory into its host memory and then communicating the updates to the master over the network interface. The master then aggregates the updates and broadcasts the new shared vector back to the workers. The worker must then copy the new shared vector from its host memory back into the GPU device memory. Thus, we have opted to use synchronous communication between the workers at the network level and asynchronous communication between the “sub-workers” at the GPU level (i.e., the thread blocks that are processing different coordinates). Note that the dataset on which we are training is transferred into the GPU memory once at the beginning of operation and does not move. Thus the penalties associated with transferring large amount of data over the network are for the most part avoided. Communication of vectors on and off the GPU during each epoch was implemented using the pinned memory functionality offered by CUDA to achieve maximum throughput over the PCIe interface between the workers’ host memory and device memory.

In Fig. 8 we show the scaling behavior of distributed TPA-SCD (with averaging) for the webspam dataset using two different GPU clusters. The dual formulation of ridge regression is being solved and the data is thus distributed across the GPU memory by training examples. In Fig. 8a we have used a cluster of eight NVIDIA Quadro M4000 GPUs that are connected via a 10Gbit ethernet network link. We observe a $10\times$ speed-up over the equivalent distributed implementation that uses sequential SCD. In Fig. 8b we show results using a cluster of 4 GeForce GTX Titan X GPUs that are attached to a single machine and communicate over the PCIe interface. These GPUs are significantly faster than the M4000s and we observe around a $30\times$ speed-up and similar scaling behavior. Note that in these results we have not applied the adaptive aggregation technique and thus all speed-ups reported are solely due to execution of the local solver on the GPU hardware.

In Fig. 9 we examine the scaling behavior on the M4000 cluster in more detail. The execution time is broken down into the time spent computing (both on the GPU and on the host), the time spent transferring data on/off the GPU over PCIe and the time spent communicating over the 10Gbit ethernet network. While we observe that the time spent computing on the GPU dominates the execution time in all cases, we notice that the communication overheads increase as the number of workers grows. However, with 8 workers the communication time is still only around $17\%$ of the total execution time, suggesting that it should be possible to scale out across more workers before the communication overheads become prohibitive. Naturally, these results indicate that the use of a 100Gbit ethernet network interface would improve the scaling behavior further. We would like to stress that the scaling behavior that has been demonstrated does not imply that training can be accelerated if the size of the dataset remains fixed. However, as we will now demonstrate, this scaling property allows one to leverage GPU acceleration when training massive datasets that do not fit inside the memory of a single GPU.

V-B Large-scale data

While the speed-ups we observe for the webspam dataset are consistent with the results reported in Fig. 2b using a single GPU, we now have the ability to train using much larger datasets that do not fit inside the memory of a single GPU. For our next experiment we sampled one day’s worth of data from the criteo dataset [5]. This sample consists of approximately 200 million training examples and 75 million unique features and occupies around 40GB of GPU memory using a compressed sparse row format222For this sample the values in the training data matrix are always $1$ and so one could halve the memory usage by re-writing the code to explicitly assume this. Even so, the dataset would not fit in the memory of a single GPU.

We partition the dataset by training example and thus randomly distribute the rows of the training data matrix across the 4 workers of the Titan X GPU cluster. We then ran distributed TPA-SCD with adaptive aggregation and compared the convergence behavior (as a function of time) with that of two reference distributed implementations. The first reference implementation is distributed SCD (Algorithm 3) using 4 workers. Each worker uses single-threaded, sequential SCD as its local solver. The second reference is the same except all workers use PASSCoDe-Wild (with 16 threads) as their local solver. We have decided not to compare with the distributed implementation consisting of 64 single-threaded workers (i.e., 16 workers on each of the 4 CPUs) since it has been established in Section IV that using more workers does not lead to faster convergence.

The convergence in duality gap for these three schemes is presented in Fig. 10. We see around a $40\times$ speed-up relative to the single-threaded SCD and around a $20\times$ speed-up relative to PASSCoDe-Wild. Note that since the optimality conditions are violated by the multi-threaded CPU implementation, the duality gap does not converge to zero. However, the solution that it has found may still be useful depending on the application.

VI Conclusion

In this work we have presented a new implementation of stochastic coordinate descent (TPA-SCD) that has been carefully designed to efficiently make use of the compute architecture provided by modern GPUs. We have demonstrated that GPUs can be used to train a ridge regression model to a desired degree of accuracy $35\times$ faster than a single-threaded CPU implementation and $10\times$ faster than a multi-threaded CPU implementation. In order to scale up to very large datasets that consist of hundreds of millions of training examples and features we have demonstrated that it is possible to scale out our stochastic learning system across 8 GPUs without any significant loss of training speed or accuracy. Furthermore, we have presented a novel distributed method for exact optimization of the aggregation step for distributed ridge regression. By scaling out across 4 Titan X GPUs and using the adaptive aggregation method we were able to train on a 40GB dataset and demonstrate a $20\times$ speed-up relative to a multi-threaded distributed implementation across 4 CPU-based workers.

Acknowledgment

The authors would like to thank Evangelos Eleftheriou, IBM Research - Zurich for his support of this work and Martin Jaggi, EPFL for useful discussions regarding distributed learning algorithms.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.
2[2] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning . Springer series in statistics Springer, Berlin, 2001, vol. 1.
3[3] L. Bottou, “Online learning and stochastic approximations,” On-line learning in neural networks , vol. 17, no. 9, p. 142, 1998.
4[4] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software , vol. 33, no. 1, p. 1, 2010.
5[5] Terabyte Click Logs. 2015. Criteo Labs. [Online]. Available: http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
6[6] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) , 2014, pp. 583–598.
7[7] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-efficient distributed dual coordinate ascent,” in Advances in Neural Information Processing Systems , 2014, pp. 3068–3076.
8[8] R. Bekkerman, M. Bilenko, and J. Langford, Scaling up machine learning: Parallel and distributed approaches . Cambridge University Press, 2011.