Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed   SR1

Majid Jahani; Mohammadreza Nazari; Sergey Rusakov; Albert S. Berahas; and Martin Tak\'a\v{c}

arXiv:1905.13096·math.OC·May 15, 2020·LOD

Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1

Majid Jahani, Mohammadreza Nazari, Sergey Rusakov, Albert S. Berahas, and Martin Tak\'a\v{c}

PDF

TL;DR

This paper introduces DS-LSR1, a communication-efficient distributed quasi-Newton algorithm that significantly reduces communication overhead and scales effectively for large neural network training tasks.

Contribution

The paper proposes DS-LSR1, a novel distributed implementation of S-LSR1 that is communication-efficient, matrix-free, inverse-free, and scalable for high-dimensional problems.

Findings

01

Reduces communication rounds in distributed S-LSR1

02

Achieves better load balancing across nodes

03

Demonstrates effective scaling on neural network training

Abstract

In this paper, we present a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: (i) drastically reduces the amount of data communicated at every iteration, (ii) has favorable work-load balancing across nodes, and (iii) is matrix-free and inverse-free. The proposed method scales well in terms of both the dimension of the problem and the number of data points. Finally, we illustrate the empirical performance of DS-LSR1 on a standard neural network training task.

Tables6

Table 1. Table 1: Details of quantities communicated and computed.

Variable	Dimension
$w_{k}, \nabla F (w_{k}), \nabla F_{i} (w_{k})$
$p_{k}, Y_{k, i} M_{k}^{- 1} Y_{k}^{T} p_{k}, B_{k} d$	$d \times 1$
\hdashline $F (w_{k}), F_{i} (w_{k})$	$1$
\hdashline $S_{k}, S_{k, i}, Y_{k}, Y_{k, i}$	$d \times m$
\hdashline $S_{k}^{T} Y_{k, i}, S_{k, i}^{T} Y_{k, i}, M_{k}^{- 1}$	$m \times m$
\hdashline $M_{k}^{- 1} Y_{k, i}^{T} p_{k}$	$m \times 1$

Table 2. Table 2: Communication Details.

	Naive DS-LSR1	DS-LSR1
Broadcast:	$w_{k}$	$w_{k}, p_{k}, M^{- 1}$
\hdashlineReduce:	$\nabla F_{i} (w_{k}), F_{i} (w_{k}),$ $S_{k, i}, Y_{k, i}$	$\nabla F_{i} (w_{k}), F_{i} (w_{k}), S_{k}^{T} Y_{k, i},$ $Y_{k, i} M_{k}^{- 1} Y_{k, i} p_{k}, M_{k}^{- 1} Y_{k, i}^{T} p_{k}$

Table 3. Table 3: Computation Details.

	Naive DS-LSR1	DS-LSR1
Worker:	$\nabla F_{i} (w_{k}), F_{i} (w_{k}), Y_{k, i}$	$\nabla F_{i} (w_{k}), F_{i} (w_{k}), Y_{k, i}, S_{k, i}^{T} Y_{k, i}$ $M_{k}^{- 1} Y_{k, i}^{T} p_{k}, Y_{k, i} M_{k}^{- 1} Y_{k}^{T} p_{k},$ CG
\hdashlineMaster:	$M_{k}^{- 1}, w_{k + 1}, B_{k} d$ , CG	$M_{k}^{- 1}, w_{k + 1}$

Table 4. Table 4: Details of quantities communicated and computed.

Variable	Dimension
$w_{k}$	$d \times 1$
\hdashline $F (w_{k}), F_{i} (w_{k})$	$1$
\hdashline $\nabla F (w_{k}), \nabla F_{i} (w_{k})$	$d \times 1$
\hdashline $p_{k}$	$d \times 1$
\hdashline $S_{k}, S_{k, i}$	$d \times m$
\hdashline $Y_{k}, Y_{k, i}$	$d \times m$
\hdashline $S_{k}^{T} Y_{k, i}, S_{k, i}^{T} Y_{k, i}$	$m \times m$
\hdashline $M_{k}^{- 1}$	$m \times m$
\hdashline $B_{k} d$	$d \times 1$
\hdashline $M_{k}^{- 1} Y_{k, i}^{T} p_{k}$	$m \times 1$
\hdashline $Y_{k, i} M_{k}^{- 1} Y_{k}^{T} p_{k}$	$d \times 1$
\hdashline $M_{k}^{- 1}$	$m \times m$

Table 5. Table 5: Details for Shallow Networks.

Network	# Hidden Layers	# Nodes/ Layer	$𝒅$
1	1	1	805
\hdashline2	1	10	7960
\hdashline4	1	100	79510
\hdashline3	1	1000	795010

Table 6. Table 6: Details for Deep Networks.

Network	# Hidden Layers	# Nodes/ Layer	$𝒅$
1	7	2-2-2-2-2-2-2	817
\hdashline2	7	10-10-10-10-10-10-10	8620
\hdashline4	7	100-100-100-10-10-10-10	100150
\hdashline3	7	1000-100-100-10-10-10-10	896650

Equations31

w \in R^{d} min F (w) := \frac{1}{n} i = 1 \sum n f (w; x^{i}, y^{i}) = \frac{1}{n} i = 1 \sum n f_{i} (w),

w \in R^{d} min F (w) := \frac{1}{n} i = 1 \sum n f (w; x^{i}, y^{i}) = \frac{1}{n} i = 1 \sum n f_{i} (w),

w_{k + 1} = w_{k} + p_{k},

w_{k + 1} = w_{k} + p_{k},

min_{∥ p ∥ \leq Δ_{k}} m_{k} (p) = F (w_{k}) + \nabla F (w_{k})^{T} p + \frac{1}{2} p^{T} B_{k} p,

min_{∥ p ∥ \leq Δ_{k}} m_{k} (p) = F (w_{k}) + \nabla F (w_{k})^{T} p + \frac{1}{2} p^{T} B_{k} p,

B_{k + 1} = B_{k} + \frac{( y _{k} - B _{k} s _{k} ) ( y _{k} - B _{k} s _{k} ) ^{T}}{( y _{k} - B _{k} s _{k} ) ^{T} s _{k}},

B_{k + 1} = B_{k} + \frac{( y _{k} - B _{k} s _{k} ) ( y _{k} - B _{k} s _{k} ) ^{T}}{( y _{k} - B _{k} s _{k} ) ^{T} s _{k}},

B_{k + 1} v = B_{k}^{(0)} v + (Y_{k} - B_{k}^{(0)} S_{k}) (M_{k} D_{k} + L_{k} + L_{k}^{T} - S_{k}^{T} B_{k}^{(0)} S_{k})^{- 1} (Y_{k} - B_{k}^{(0)} S_{k})^{T} v,

B_{k + 1} v = B_{k}^{(0)} v + (Y_{k} - B_{k}^{(0)} S_{k}) (M_{k} D_{k} + L_{k} + L_{k}^{T} - S_{k}^{T} B_{k}^{(0)} S_{k})^{- 1} (Y_{k} - B_{k}^{(0)} S_{k})^{T} v,

D_{k} = d ia g [s_{k, 1}^{T} y_{k, 1}, \dots, s_{k, m}^{T} y_{k, m}], (L_{k})_{j, l} = {s_{k, j - 1}^{T} y_{k, l - 1} 0 if j > l, otherwise .

∣ s_{k, j}^{T} (y_{k, i} - B_{k}^{(j - 1)} s_{k, j}) ∣ \geq η ∥ s_{k, j} ∥∥ y_{k, i} - B_{k}^{(j - 1)} s_{k, j} ∥,

∣ s_{k, j}^{T} (y_{k, i} - B_{k}^{(j - 1)} s_{k, j}) ∣ \geq η ∥ s_{k, j} ∥∥ y_{k, i} - B_{k}^{(j - 1)} s_{k, j} ∥,

B_{k + 1} v = Y_{k} M_{k}^{- 1} Y_{k}^{T} v,

B_{k + 1} v = Y_{k} M_{k}^{- 1} Y_{k}^{T} v,

(M_{k}^{(j+1)})^{-1}=\left[\begin{array}[]{c:c}(M_{k}^{(j)})^{-1}+\zeta(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}&-\zeta(M_{k}^{(j)})^{-1}u\\ \hdashline-\zeta v^{T}(M_{k}^{(j)})^{-1}&\zeta\end{array}\right]

(M_{k}^{(j+1)})^{-1}=\left[\begin{array}[]{c:c}(M_{k}^{(j)})^{-1}+\zeta(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}&-\zeta(M_{k}^{(j)})^{-1}u\\ \hdashline-\zeta v^{T}(M_{k}^{(j)})^{-1}&\zeta\end{array}\right]

M_{k}^{(j+1)}=\left[\begin{array}[]{c:c}M_{k}^{(j)}&u\\ \hdashline v^{T}&c\end{array}\right],

M_{k}^{(j+1)}=\left[\begin{array}[]{c:c}M_{k}^{(j)}&u\\ \hdashline v^{T}&c\end{array}\right],

(M_{k}^{(j+1)})^{-1}=\left[\begin{array}[]{c:c}(M_{k}^{(j)})^{-1}+\zeta(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}&-\zeta(M_{k}^{(j)})^{-1}u\\ \hdashline-\zeta v^{T}(M_{k}^{(j)})^{-1}&\zeta\end{array}\right]

(M_{k}^{(j+1)})^{-1}=\left[\begin{array}[]{c:c}(M_{k}^{(j)})^{-1}+\zeta(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}&-\zeta(M_{k}^{(j)})^{-1}u\\ \hdashline-\zeta v^{T}(M_{k}^{(j)})^{-1}&\zeta\end{array}\right]

M_{k}^{(j+1)}=\left[\begin{array}[]{c:c}M_{k}^{(j)}&u\\ \hdashline v^{T}&c\end{array}\right]

M_{k}^{(j+1)}=\left[\begin{array}[]{c:c}M_{k}^{(j)}&u\\ \hdashline v^{T}&c\end{array}\right]

\displaystyle\left[\begin{array}[]{c:c|c:c}M_{k}^{(j)}&u&I&0\\ \hdashline v^{T}&c&0&1\end{array}\right]

\displaystyle\left[\begin{array}[]{c:c|c:c}M_{k}^{(j)}&u&I&0\\ \hdashline v^{T}&c&0&1\end{array}\right]

\displaystyle\Rightarrow\left[\begin{array}[]{c:c|c:c}I&(M_{k}^{(j)})^{-1}u&(M_{k}^{(j)})^{-1}&0\\ \hdashline 0&c-v^{T}(M_{k}^{(j)})^{-1}u&-v^{T}(M_{k}^{(j)})^{-1}&1\end{array}\right]

\displaystyle\Rightarrow\left[\begin{array}[]{c:c|c:c}I&(M_{k}^{(j)})^{-1}u&(M_{k}^{(j)})^{-1}&0\\ \hdashline 0&1&\dfrac{-v^{T}(M_{k}^{(j)})^{-1}}{c-v^{T}(M_{k}^{(j)})^{-1}u}&\dfrac{1}{c-v^{T}(M_{k}^{(j)})^{-1}u}\end{array}\right]

\displaystyle\Rightarrow\left[\begin{array}[]{c:c|c:c}I&0&(M_{k}^{(j)})^{-1}+\dfrac{(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}}{c-v^{T}(M_{k}^{(j)})^{-1}u}&\dfrac{-(M_{k}^{(j)})^{-1}u}{c-v^{T}(M_{k}^{(j)})^{-1}u}\\ \hdashline 0&1&\dfrac{-v^{T}(M_{k}^{(j)})^{-1}}{c-v^{T}(M_{k}^{(j)})^{-1}u}&\dfrac{1}{c-v^{T}(M_{k}^{(j)})^{-1}u}\end{array}\right]

\displaystyle\Rightarrow\left[\begin{array}[]{c:c|c:c}I&0&(M_{k}^{(j)})^{-1}+\zeta(M_{k}^{(j)})^{-1}uv^{T}(M_{k}^{(j)})^{-1}&-\zeta(M_{k}^{(j)})^{-1}u\\ \hdashline 0&1&-\zeta v^{T}(M_{k}^{(j)})^{-1}&\zeta\end{array}\right]

S U \leq \frac{1}{t + \frac{( 1 - t )}{K}} .

S U \leq \frac{1}{t + \frac{( 1 - t )}{K}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Lehigh University, Bethlehem, PA, 18015, USA

Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1

Majid Jahani

Mohammadreza Nazari

Sergey Rusakov

Albert S. Berahas

Martin Takáč

Abstract

In this paper, we present a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: $(i)$ drastically reduces the amount of data communicated at every iteration, $(ii)$ has favorable work-load balancing across nodes, and $(iii)$ is matrix-free and inverse-free. The proposed method scales well in terms of both the dimension of the problem and the number of data points. Finally, we illustrate the empirical performance of DS-LSR1 on a standard neural network training task.

Keywords:

SR1 Distributed Optimization Deep Learning.

1 Introduction

In the last decades, significant efforts have been devoted to the development of optimization algorithms for machine learning. Currently, due to its fast learning properties, low per-iteration cost, and ease of implementation, the stochastic gradient (SG) method [32, 8], and its adaptive [19, 25], variance-reduced [22, 35, 18] and distributed [31, 40, 45, 17] variants are the preferred optimization methods for large-scale machine learning applications. Nevertheless, these methods have several drawbacks; they are highly sensitive to the choice of hyper-parameters and are cumbersome to tune, and they suffer from ill-conditioning [2, 43, 9]. More importantly, these methods offer a limited amount of benefit in distributed computing environments since they are usually implemented with small mini-batches, and thus spend more time communicating instead of performing “actual” computations. This shortcoming can be remedied to some extent by increasing the batch sizes, however, there is a point after which the increase in computation is not offset by the faster convergence [39].

Recently, there has been an increased interest in (stochastic) second-order and quasi-Newton methods by the machine learning community; see e.g., [10, 29, 6, 34, 42, 36, 11, 16, 4, 5, 21, 23]. These methods judiciously incorporate curvature information, and thus mitigate some of the issues that plague first-order methods. Another benefit of these methods is that they are usually implemented with larger batches, and thus better balance the communication and computation costs. Of course, this does not come for free; (stochastic) second-order and quasi-Newton methods are more memory intensive and more expensive (per iteration) than first-order methods. This naturally calls for distributed implementations.

In this paper, we propose an efficient distributed variant of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) method [3]—DS-LSR1—that operates in the master-worker framework (Figure 1). Each worker node has a portion of the dataset, and performs local computations using solely that information and information received from the master node. The proposed method is matrix-free (Hessian approximation never explicitly constructed) and inverse-free (no matrix inversion). To this end, we leverage the compact form of the SR1 Hessian approximations [12], and utilize sketching techniques [41] to approximate several required quantities. We show that, contrary to a naive distributed implementation of S-LSR1, the method is communication-efficient and has favorable work-load balancing across nodes. Specifically, the naive implementation requires $\mathcal{O}(md)$ communication, whereas our approach only requires $\mathcal{O}(m^{2})$ communication, where $d$ is the dimension of the problem, $m$ is the LSR1 memory and $m\ll d$ 111Note, these costs are on top of the communications that are common to both approaches.. Furthermore, in our approach the heavy computations are done by the worker nodes and the master node performs only simple aggregations, whereas in the naive approach the computationally intensive operations, e.g., Hessian-vector products, are computed locally by the master node. Finally, we show empirically that DS-LSR1 has good strong and weak scaling properties, and illustrate the performance of the method on a standard neural network training task.

Problem Formulation and Notation

We focus on machine learning empirical risk minimization problems that can be expressed as:

[TABLE]

where $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is the composition of a prediction function (parametrized by $w$ ) and a loss function, and $(x^{i},y^{i})_{i=1}^{n}$ denote the training examples (samples). Specifically, we focus on deep neural network training tasks where the function $F$ is nonconvex, and the dimension $d$ and number of samples $n$ are large.

The paper is organized as follows. We conclude this section with a discussion of related work. We describe the classical (L)SR1 and sampled LSR1 (S-LSR1) methods in Section 2. In Section 3, we present DS-LSR1, our proposed distributed variant of the sampled LSR1 method. We illustrate the scaling properties of DS-LSR1 and the empirical performance of the method on deep learning tasks in Section 4. Finally, in Section 5 we provide some final remarks.

Related Work

The Symmetric Rank-1 (SR1) method [15, 24] and its limited-memory variant (LSR1) [28] are quasi-Newton methods that have gained significant attention by the machine learning community in recent years [3, 20]. These methods incorporate curvature (second-order) information using only gradient (first-order) information. Contrary to arguably the most popular quasi-Newton method, (L)BFGS [30, 27], the (L)SR1 method does not enforce that the Hessian approximations are positive definite, and is usually implemented with a trust-region [30]. This has several benefits: $(1)$ the method is able to exploit negative curvature, and $(2)$ the method is able to efficiently escape saddle points.

There has been a significant volume of research on distributed algorithms for machine learning; specifically, distributed gradient methods [45, 7, 31, 40, 14], distributed Newton methods [37, 21, 44] and distributed quasi-Newton methods [13, 17, 1]. Possibly the closest work to ours is VF-BFGS [13], in which the authors propose a vector-free implementation of the classical LBFGS method. We leverage several of the techniques proposed in [13], however, what differentiates our work is that we focus on the S-LSR1 method. Developing an efficient distributed implementation of the S-LSR1 method is not as straight-forward as LBFGS for several reasons: $(1)$ the construction and acceptance of the curvature pairs, $(2)$ the trust-region subproblem, and $(3)$ the step acceptance procedure.

2 Sampled limited-memory SR1 (S-LSR1)

In this section, we review the sampled LSR1 (S-LSR1) method [3], and discuss the components that can be distributed. We begin by describing the classical (L)SR1 method as this will set the stage for the presentation of the S-LSR1 method. At the $k$ th iteration, the SR1 method computes a new iterate via

[TABLE]

where $p_{k}$ is the minimizer of the following subproblem

[TABLE]

$\Delta_{k}$ is the trust region radius, $B_{k}$ is the SR1 Hessian approximation

[TABLE]

and $(s_{k},y_{k})=(w_{k}-w_{k-1},\nabla F(w_{k})-\nabla F(w_{k-1}))$ are the curvature pairs. In the limited memory version, the matrix $B_{k}$ is defined as the result of applying $m$ SR1 updates to a multiple of the identity matrix using the set of $m$ most recent curvature pairs $\{s_{i},y_{i}\}_{i=k-m}^{k-1}$ kept in storage.

The main idea of the S-LSR1 method is to use the SR1 updating formula, but to construct the Hessian approximations using sampled curvature pairs instead of pairs that are constructed as the optimization progresses. At every iteration, $m$ curvature pairs are constructed via random sampling around the current iterate; see Algorithm 2. The S-LSR1 method is outlined in Algorithm 1. The components of the algorithms that can be distributed are highlighted in magenta.

Several components of the above algorithms can be distributed. Before we present the distributed implementations of the S-LSR1 method, we discuss several key elements of the method: $(1)$ Hessian-vector products; $(2)$ curvature pair construction; $(3)$ curvature pair acceptance; $(4)$ search direction computation; $(5)$ step acceptance procedure; and $(6)$ initial Hessian approximations.

For the remainder of the paper, let $S_{k}=[s_{k,1},s_{k,2},\dots,s_{k,m}]\in\mathbb{R}^{d\times m}$ and $Y_{k}=[y_{k,1},y_{k,2},\dots,y_{k,m}]\in\mathbb{R}^{d\times m}$ denote the curvature pairs constructed at the $k$ th iteration, $S_{k}^{i}\in\mathbb{R}^{d\times m}$ and $Y_{k}^{i}\in\mathbb{R}^{d\times m}$ denote the curvature pairs constructed at the $k$ th iteration by the $i$ th node, and $B_{k}^{(0)}=\gamma_{k}I\in\mathbb{R}^{d\times d}$ , $\gamma_{k}\geq 0$ , denote the initial Hessian approximation at the $k$ th iteration.

Hessian-vector products

Several components of the algorithms above require the calculation of Hessian vector products of the form $B_{k}v$ . In the large-scale setting, it is not memory-efficient, or even possible for some applications, to explicitly compute and store the $d\times d$ Hessian approximation matrix $B_{k}$ . Instead, one can exploit the compact representation of the SR1 matrices [12] and compute:

[TABLE]

Computing $B_{k+1}v$ via (2.3) is both memory and computationally efficient; the complexity of computing $B_{k+1}v$ is $\mathcal{O}(m^{2}d)$ [12].

Curvature pair construction

For ease of exposition, we presented the curvature pair construction routine (Algorithm 2) as a sequential process. However, this need not be the case; all pairs can be constructed simultaneously. First, generate a random matrix $S_{k}\in\mathbb{R}^{d\times m}$ , and then compute $Y_{k}=\nabla^{2}F(w_{k})S_{k}\in\mathbb{R}^{d\times m}$ . We discuss a distributed implementation of this routine in the following sections.

Curvature pair acceptance

In order for the S-LSR1 Hessian update (2.2) to be well defined, and for numerical stability, we require certain conditions on the curvature pairs employed; see [30, Chapter 6]. Namely, for a given $\eta>0$ , we impose that the Hessian approximation $B_{k+1}$ is only updated using the curvature pairs that satisfy the following condition:

[TABLE]

for $j=1,\dots,m$ , where $B_{k}^{(0)}$ is the initial Hessian approximation and $B_{k}^{(j-1)}$ , for $j=2,\dots,m$ , is the Hessian approximation constructed using only curvature pairs $\{s_{l},y_{l}\}$ , for $l<j$ , that satisfy (2.4). Note, $B_{k+1}=B_{k}^{(m)}$ . Thus, potentially, not all curvature pairs returned by Algorithm 2 are used to update the S-LSR1 Hessian approximation. Checking this condition is not trivial and requires $m$ Hessian vector products. In [3, Appendix B.5], the authors propose a recursive memory-efficient mechanism to check and retain only the pairs that satisfy (2.4).

Search direction computation

The search direction $p_{k}$ is computed by solving subproblem (2.1) using CG-Steihaug; see [30, Chapter 7]. This procedure requires the computation of Hessian vectors products of the form (2.3).

Step acceptance procedure

In order to determine if a step is successful (Line 6, Algorithm 1) one has to compute the function value at the trial iterate and the predicted model reduction. This entails a function evaluation and a Hessian vector product. The acceptance ratio $\rho_{k}$ determines if a step is successful, after which the trust region radius has to be adjusted accordingly. For brevity we omit the details from the paper and refer the interested reader to [3, Appendix B.3].

Initial Hessian approximations $B_{k}^{(0)}$

In practice, it is not clear how to choose the initial Hessian approximation. We argue, that in the context of S-LSR1, a good choice is $B_{k}^{(0)}=0$ . In Figure 2 we show the eigenvalues of the true Hessian and the eigenvalues of the S-LSR1 matrices for different values of $\gamma_{k}$ ( $B_{k}^{(0)}=\gamma_{k}I$ ) for a toy problem. As is clear, the eigenvalues of the S-LSR1 matrices with $\gamma_{k}=0$ better match the eigenvalues of the true Hessian. Moreover, by setting $\gamma_{k}=0$ , the rank of the approximation is at most $m$ and thus the CG algorithm (used to compute the search direction) terminates in at most $m$ iterations, whereas the CG algorithm may require as many as $d\gg m$ iterations when $\gamma_{k}\neq 0$ . Finally, $B_{k}^{(0)}=0$ removes a hyper-parameter. Henceforth, we assume that $B_{k}^{(0)}=0$ , however, we note that our method can be extended to $B_{k}^{(0)}\neq 0$ .

2.1 Naive Distributed Implementation of S-LSR1

In this section, we describe a naive distributed implementation of the S-LSR1 method, where the data is stored across $\mathcal{K}$ machines. At each iteration $k$ , we broadcast the current iterate $w_{k}$ to every worker node. The worker nodes calculate the local gradient, and construct local curvature pairs $S_{k}^{i}$ and $Y_{k}^{i}$ . The local information is then reduced to the master node to form $\nabla F(w_{k})$ , $S_{k}$ and $Y_{k}$ . Next, the SR1 curvature pair condition (2.4) is recursively checked on the master node. Given a set of accepted curvature pairs, the master node computes the search direction $p_{k}$ . We should note that the last two steps could potentially be done in a distributed manner at the cost of $m+1$ extra expensive rounds of communication. Finally, given a search direction the trial iterate is broadcast to the worker nodes where the local objective function is computed and reduced to the master node, and a step is taken.

As is clear, in this distributed implementation of the S-LSR1 method, the amount of information communicated is large, and the amount of computation performed on the master node is significantly larger than that on the worker nodes. Note, all the Hessian vector products, as well as the computations of the $M_{k}^{-1}$ are performed on the master node. The precise communication and computation details are summarized in Tables 2 and 3.

3 Efficient Distributed S-LSR1 (DS-LSR1)

The naive distributed implementation of S-LSR1 has several significant deficiencies. We propose a distributed variant of the S-LSR1 method that alleviates these issues, is communication-efficient, has favorable work-load balancing across nodes and is inverse-free and matrix-free. To do this, we leverage the form of the compact representation of the S-LSR1 updating formula ( $B_{k}^{(0)}=0$ )

[TABLE]

and the form of the SR1 condition (2.4). We observe the following: one need not communicate the full $S_{k}$ and $Y_{k}$ matrices, rather one can communicate $S_{k}^{T}Y_{k}$ , $S_{k}^{T}S_{k}$ and $Y_{k}^{T}Y_{k}$ . We now discuss the means by which we: $(1)$ reduce the amount of information communicated and $(2)$ balance the computation across nodes.

3.1 Reducing the Amount of Information Communicated

As mentioned above, communicating curvature pairs is not necessary; instead one can just communicate inner products of the pairs, reducing the amount of communication from $2md$ to $3m^{2}$ . In this section, we show how this can be achieved, and in fact show that this can be further reduced to $m^{2}$ .

Construction of $S_{k}^{T}S_{k}$ and $S_{k}^{T}Y_{k}$

Since the curvature pairs are scale invariant [3], $S_{k}$ can be any random matrix. Therefore, each worker node can construct this matrix by simply sharing random seeds. In fact, the matrix $S_{k}^{T}S_{k}$ need not be communicated to the master node as the master node can construct and store this matrix. With regards to the $S_{k}^{T}Y_{k}$ , each worker node can construct local versions of the $Y_{k}$ curvature pair, $Y_{k}^{i}$ , and send $S_{k}^{T}Y_{k}^{i}$ to the master node for aggregation, i.e., $S_{k}^{T}Y_{k}=\nicefrac{{1}}{{\mathcal{K}}}\sum_{i=1}^{\mathcal{K}}S_{k}^{T}Y_{k}^{i}$ . Thus, the amount of information communicated to the master node is $m^{2}$ .

Construction of $Y_{k}^{T}Y_{k}$

Constructing the matrix $Y_{k}^{T}Y_{k}$ in distributed fashion, without communicating local $Y_{k}^{i}$ matrices, is not that simple. In our communication-efficient method, we propose that the matrix is approximated via sketching [41], using quantities that are already computed, i.e., $Y_{k}^{T}Y_{k}\approx Y_{k}^{T}S_{k}S_{k}^{T}Y_{k}$ . In order for the sketch to be well defined, $S_{k}\sim\mathcal{N}(0,I/m)$ , thus satisfying the conditions of sketching matrices [41]. By using this technique, we construct an approximation to $Y_{k}^{T}Y_{k}$ with no additional communication. Note, the sketch size in our setting is equal to the memory size $m$ . We should also note that this approximation is only used in checking the SR1 condition (2.4), which is not sensitive to approximation errors, and not in the Hessian vector products.

3.2 Balancing the Computation Across the Nodes

Balancing the computation across the nodes does not come for free. We propose the use of a few more rounds of communication. The key idea is to exploit the compact representation of the SR1 matrices and perform as much computation as possible on the worker nodes.

Computing Hessian vector products $B_{k+1}v$

The Hessian vector products (3.1), require products between the matrices $Y_{k}$ , $M_{k}^{-1}$ and a vector $v$ . Suppose that the we have $M_{k}^{-1}$ on the master node, and that the master node broadcasts this information as well as the vector $v$ to the worker nodes. The worker nodes then locally compute $M_{k}^{-1}(Y_{k}^{i})^{T}v$ , and send this information back to the master node. The master node then reduces this to form $M_{k}^{-1}(Y_{k})^{T}v$ , and broadcasts this vector back to the worker nodes. This time the worker nodes compute $Y_{k}^{i}M_{k}^{-1}(Y_{k})^{T}v$ locally, and then this quantity is reduced by the master node; the cost of this communication is $d$ . Namely, in order to compute Hessian vector products, the master node performs two aggregations, the bulk of the computation is done on the worker nodes and the communication cost is $m^{2}+2m+2d$ .

Checking the SR1 Condition 2.4

As proposed in [3], at every iteration condition (2.4) is checked recursively by the master node. For each pair in memory, checking this condition amounts to a Hessian vector product as well as the use of inner products of the curvature pairs. Moreover, it requires the computation of $(M_{k}^{(j)})^{-1}\in\mathbb{R}^{j\times j}$ , for $j=1,\dots,m$ , where $M_{k}^{-1}=(M_{k}^{(m)})^{-1}$ .

Inverse-Free Computation of $M_{k}^{-1}$

The matrix $M_{k}^{-1}$ is non-singular [12], depends solely on inner products of the curvature pairs, and is used in the the computation of Hessian vector products (3.1). This matrix is constructed recursively (its dimension grows with the memory) by the master node as condition (2.4) is checked. We propose an inverse-free approach for constructing this matrix. Suppose we have the matrix $(M_{k}^{(j)})^{-1}$ , for some $j=1,\dots,m-1$ , and that the new curvature pair $(s_{k,j+1},y_{k,j+1})$ satisfies (2.4). One can show that

[TABLE]

where $\zeta=\nicefrac{{1}}{{c-v^{T}(M_{k}^{(j)})^{-1}u}}$ , $v^{T}=s_{k,j+1}^{T}Y_{k,1:l}$ and $Y_{k,1:l}=[y_{k,1},\dots,y_{k,l}]$ for $l\leq j$ , $u=v$ , and $c=s_{k,j+1}^{T}y_{k,j+1}$ . We should note that the matrix $(M_{k}^{(1)})^{-1}$ is a singleton. Consequently, constructing $(M_{k}^{(j)})^{-1}$ in an inverse-free manner allows us to compute Hessian vector products and check condition (2.4) efficiently.

3.3 The Distributed S-LSR1 (DS-LSR1) Algorithm

Pseudo-code for our proposed distributed variant of the S-LSR1 method and the curvature pair sampling procedure are given in Algorithms 3 and 4, respectively. Right arrows denote broadcast steps and left arrows denote reduce steps. For brevity we omit the details of the distributed CG-Steihaug algorithm (Line 5, Algorithm 3), but note that it is a straightforward adaptation of [30, Algorithm 7.2] using quantities described above computed in distributed fashion.

3.4 Complexity Analysis - Comparison of Methods

We compare the complexity of a naive distributed implementation of S-LSR1 and DS-LSR1. Specifically, we discuss the amount of information communicated at every iteration and the amount of computation performed by the nodes. Tables 2 and 3 summarize the communication and computation costs, respectively, and Table 4 summarizes the details of the quantities presented in the tables.

As is clear from Tables 2 and 3 the amount of information communicated in the naive implementation ( $2md+d+1$ ) is significantly larger than that in the DS-LSR1 method ( $m^{2}+2d+2m+1$ ). Note, $m\ll d$ . This can also be seen in Figure 3 where we show for different dimension $d$ and memory $m$ the number of floats communicated at every iteration. To put this into perspective, consider a training problem where $d=9.2M$ (e.g., VGG11 network [38]) and $m=256$ , DS-LSR1 and naive DS-LSR1 need to communicate $0.0688\,GB$ and $8.8081\,GB$ , respectively, per iteration. In terms of computation, it is clear that in the naive approach the amount of computation is not balanced between the master and worker nodes, whereas for DS-LSR1 the quantities are balanced.

4 Numerical Experiments

The goals of this section are threefold: $(1)$ To illustrate the scaling properties of the method and compare it to the naive implementation (Figures 4 & 5); $(2)$ To deconstruct the main computational elements of the method and show how they scale in terms of memory (Figure 6); and $(3)$ To illustrate the performance of DS-LSR1 on a neural network training task (Figure 7). We should note upfront that the goal of this section is not to achieve state-of-the-art performance and compare against algorithms that can achieve this, rather to show that the method is communication efficient and scalable.222All algorithms were implemented in Python (PyTorch library), using the MPI for Python distributed environment. The experiments were conducted on XSEDE clusters using GPU nodes. Each physical node includes 4 K80 GPUs, and each MPI process is assigned to a distinct GPU. Code available at: https://github.com/OptMLGroup/DSLSR1.

4.1 Scaling

Weak Scaling

We considered two different types of networks: $(1)$ Shallow (one hidden layer), and $(2)$ Deep (7 hidden layers), and for each varied the number of neurons in the layers (MNIST dataset [26], memory $m=64$ ). Figure 4 shows the time per iteration for DS-LSR1 for different number of variables and batch sizes.

Strong Scaling

We fix the problem size (LeNet, CIFAR10, $d=62006$ [26]), vary the number of nodes and measure the speed-up achieved. Figure 5 illustrates the relative speedup (normalized speedup of each method with respect to the performance of that method on a single node) of the DS-LSR1 method and the naive variant for $m=256$ . The DS-LSR1 method achieves near linear speedup as the number of nodes increases, and the speedup is better than that of the naive approach. We should note that the times of our proposed method are lower than the respective times for the naive implementation. The reasons for this are: $(1)$ DS-LSR1 is inverse free, and $(2)$ the amount of information communicated is significantly smaller.

Scaling of Different Components of DS-LSR1

We deconstruct the main components of the DS-LSR1 method and illustrate the scaling (per iteration) with respect to memory size. Figure 6 shows the scaling for: $(1)$ reduce time; $(2)$ total time; $(3)$ CG time; $(4)$ time to sample $S$ , $Y$ pairs. For all these plots, we ran $10$ iterations, averaged the time and also show the variability. As is clear, our proposed method has lower times for all components of the algorithm. We attribute this to the aforementioned reasons.

4.2 Performance of DS-LSR1

In this section, we show the performance of DS-LSR1 on a neural network training task; LeNet [26], CIFAR10, $n=50000$ , $d=62006$ , $m=256$ . Figure 7 illustrates the training accuracy in terms of wall clock time and amount of data (GB) communication (left and center plots), for different number of nodes. As expected, when using larger number of compute nodes training is faster. Similar results were obtained for testing accuracy. We also plot the performance of the naive implementation (dashed lines) in order to show that: $(1)$ the accuracy achieved is comparable, and $(2)$ one can train faster using our proposed method.

Finally, we show that the curvature pairs chosen by our approach are almost identical to those chosen by the naive approach even though we use an approximation (via sketching) when checking the SR1 condition. Figure 7 (right plot), shows the Jaccard similarity for the sets of curvature pairs selected by the methods; the pairs are almost identical, with differences on a few iterations.

5 Final Remarks

This paper describes a scalable distributed implementation of the sampled LSR1 method which is communication-efficient, has favorable work-load balancing across nodes and that is matrix-free and inverse-free. The method leverages the compact representation of SR1 matrices and uses sketching techniques to drastically reduce the amount of data communicated at every iteration as compared to a naive distributed implementation. The DS-LSR1 method scales well in terms of both the dimension of the problem and the number of data points.

Acknowledgements

This work was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796, and XSEDE Startup grant IRI180020.

Appendix 0.A Theoretical Results and Proofs

In this section, we prove a theoretical result about the matrix $(M_{k}^{(j)})^{-1}$ .

Lemma 1

The matrix $M_{k}^{(j+1)}$ , for $j=0,\dots,m-1$ , has the form:

[TABLE]

where $v^{T}=s_{k,j+1}^{T}Y_{k,1:l}$ and $l\leq j$ , $u=v$ and $c=s_{k,j+1}^{T}y_{k,j+1}$ , and is nonsingular. Moreover, its inverse can be calculated as following:

[TABLE]

where $\zeta=\dfrac{1}{c-v^{T}(M_{k}^{(j)})^{-1}u}$ .

Proof

It is trivial to show that $M_{k}^{(j+1)}$ shown in (0.A.1) is equivalent to the corresponding matrix in (2.3). Moreover, the second part of the lemma follows immediately from the fact that $M_{k}^{(i+1)}$ is itself non-singular and symmetric as shown in [12]. Lets consider the following matrix $M_{k}^{(i+1)}$ :

[TABLE]

We know that $M_{k}^{(i)}$ is invertible, and in the following by simple linear algebra, we calculate the inverse of $M_{k}^{(i+1)}$ :

[TABLE]

The last line is by putting $\zeta=\dfrac{1}{c-v^{T}(M_{k}^{(j)})^{-1}u}$ .

Lemma 1 describes a recursive method for computing $(M_{k}^{(j)})^{-1}\in\mathbb{R}^{j\times j}$ , for $j=1,\dots,m$ . Specifically, one can calculate $(M_{k}^{(j+1)})^{-1}$ using $(M_{k}^{(j)})^{-1}$ . We should note, that the first matrix $(M_{k}^{(1)})^{-1}$ is simply a number. Overall, this procedure allows us to compute $(M_{k}^{(j)})^{-1}$ without explicitly computing an inverse.

Appendix 0.B Additional Algorithm Details

In this section, we present additional details about the S-LSR1 and DS-LSR1 algorithms discussed in the Sections 2 and 3.

0.B.1 CG Steihaug Algorithm - Serial

In this section, we describe CG-Steihaug Algorithm [30] which is used for computing the search direction $p_{k}$ .

0.B.2 CG Steihaug Algorithm - Distributed

In this section, we describe a distributed variant of CG Steihaug algorithm that is used as a subroutine of the DS-LSR1 method. The manner in which Hessian vector products are computed was discussed in Section 3.

0.B.3 Trust-Region Management Subroutine

In this section, we present the Trust-Region management subroutine $\Delta_{k+1}=\texttt{adjustTR}(\Delta_{k},\rho_{k})$ . See [30] for further details.

0.B.4 Load Balancing

In distributed algorithms, it is very important to have work-load balancing across nodes. In order for an algorithm to be scalable, every machine (worker) should have similar amount of assigned computation, and each machine should be equally busy. According to Amdahl’s law [33] if the parallel/distributed algorithm runs $t$ portion of time only on one of the machines (e.g., the master node), the theoretical speedup (SU) is limited to at most

[TABLE]

As is clear from Tables 2 and 3, the DS-LSR1 method makes each machine almost equally busy, and as a result DS-LSR1 has a near linear speedup. On the other hand, in the naive DS-LSR1 approach the master node is significantly busier than the remainder of the nodes, and thus by Adamhl’s law, the speedup will not be linear and is bounded above by (0.B.1).

0.B.5 Communication and Computation Details

In this section, we present details about the quantities that are communicated and computed at every iteration of the distributed S-LSR1 methods. All the quantities below are in Tables 2 and 3.

0.B.6 Floats Communicated per Iteration

In this section, we should the number of floats communicated per iteration for DS-LSR1 and naive DS-LSR1 for different memory size and dimension.

Appendix 0.C Additional Numerical Experiments and Experimental Details

In this section, we present additional experiments and experimental details.

0.C.1 Initial Hessian Approximation $B_{k}^{(0)}$

In this section, we show additional results motivating the use of $B_{k}^{(0)}$ . Figure 9, is identical to Figure 2. Figure 10 shows similar results for a larger problem. See [3] for details about the problems.

0.C.2 Shallow and Deep Network Details

In this section, we describe the networks used in the weak scaling experiments. For the problems corresponding to the Tables 5 and 6 we used ReLU activation functions and soft-max cross-entropy loss.

0.C.3 Weak Scaling

In this section, we show the weak scaling properties of DS-LSR1 for two different networks, different batch sizes and different number of variables.

0.C.4 Strong Scaling

In this section, we show the strong scaling properties of DS-LSR1 and naive DS-LSR1 for different memory sizes. The problem details for these experiments were as follows: LeNet, CIFAR10, $d=62006$ , [26].

0.C.5 Scaling of Different Components of DS-LSR1

In this section, we show the scaling properties of the different components of the DS-LSR1 method and compare with the naive distributed implementation. We deconstruct the main components of the DS-LSR1 method and illustrate the scaling with respect to memory. Specifically, we show the scaling for: $(1)$ reduce time/iteration; $(2)$ time/iteration; $(3)$ CG time/iteration; $(4)$ time to sample $S$ , $Y$ pairs/iteration. For all these plots, we ran $10$ iterations and averaged the time, and also show the variability.

0.C.6 Performance of DS-LSR1

In this section, we show training and testing accuracy in terms of wall clock time and amount of data communicated (in GB).

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. Journal of Machine Learning Research 15 , 1111–1133 (2014)
2[2] Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of newton-sketch and subsampled newton methods. ar Xiv preprint ar Xiv:1705.06211 (2017)
3[3] Berahas, A.S., Jahani, M., Takác, M.: Quasi-newton methods for deep learning: Forget the past, just sample. ar Xiv preprint ar Xiv: 1901.09997 (2019)
4[4] Berahas, A.S., Nocedal, J., Takác, M.: A multi-batch l-bfgs method for machine learning. In: Neur IPS. pp. 1055–1063 (2016)
5[5] Berahas, A.S., Takáč, M.: A robust multi-batch l-bfgs method for machine learning. Optimization Methods and Software 35 (1), 191–219 (2020)
6[6] Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis (2016)
7[7] Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
8[8] Bottou, L., Cun, Y.L.: Large scale online learning. In: Neur IPS. pp. 217–224 (2004)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1

Abstract

Keywords:

1 Introduction

Problem Formulation and Notation

Related Work

2 Sampled limited-memory SR1 (S-LSR1)

Hessian-vector products

Curvature pair construction

Curvature pair acceptance

Search direction computation

Step acceptance procedure

Initial Hessian approximations Bk(0)B_{k}^{(0)}Bk(0)​

2.1 Naive Distributed Implementation of S-LSR1

3 Efficient Distributed S-LSR1 (DS-LSR1)

3.1 Reducing the Amount of Information Communicated

Construction of SkTSkS_{k}^{T}S_{k}SkT​Sk​ and SkTYkS_{k}^{T}Y_{k}SkT​Yk​

Construction of YkTYkY_{k}^{T}Y_{k}YkT​Yk​

3.2 Balancing the Computation Across the Nodes

Computing Hessian vector products Bk+1vB_{k+1}vBk+1​v

Checking the SR1 Condition 2.4

Inverse-Free Computation of Mk−1M_{k}^{-1}Mk−1​

3.3 The Distributed S-LSR1 (DS-LSR1) Algorithm

3.4 Complexity Analysis - Comparison of Methods

4 Numerical Experiments

4.1 Scaling

Weak Scaling

Strong Scaling

Scaling of Different Components of DS-LSR1

4.2 Performance of DS-LSR1

5 Final Remarks

Acknowledgements

Appendix 0.A Theoretical Results and Proofs

Lemma 1

Proof

Appendix 0.B Additional Algorithm Details

0.B.1 CG Steihaug Algorithm - Serial

0.B.2 CG Steihaug Algorithm - Distributed

0.B.3 Trust-Region Management Subroutine

0.B.4 Load Balancing

0.B.5 Communication and Computation Details

0.B.6 Floats Communicated per Iteration

Appendix 0.C Additional Numerical Experiments and Experimental Details

0.C.1 Initial Hessian Approximation Bk(0)B_{k}^{(0)}Bk(0)​

0.C.2 Shallow and Deep Network Details

0.C.3 Weak Scaling

0.C.4 Strong Scaling

0.C.5 Scaling of Different Components of DS-LSR1

0.C.6 Performance of DS-LSR1

Initial Hessian approximations $B_{k}^{(0)}$

Construction of $S_{k}^{T}S_{k}$ and $S_{k}^{T}Y_{k}$

Construction of $Y_{k}^{T}Y_{k}$

Computing Hessian vector products $B_{k+1}v$

Inverse-Free Computation of $M_{k}^{-1}$

0.C.1 Initial Hessian Approximation $B_{k}^{(0)}$