A continuous-time analysis of distributed stochastic gradient

Nicholas M. Boffi; Jean-Jacques E. Slotine

arXiv:1812.10995·math.OC·December 18, 2020

A continuous-time analysis of distributed stochastic gradient

Nicholas M. Boffi, Jean-Jacques E. Slotine

PDF

TL;DR

This paper studies how synchronization in distributed stochastic gradient algorithms reduces noise and improves convergence, using a biological analogy and providing theoretical and empirical evidence on non-convex objectives and neural networks.

Contribution

It introduces a quorum sensing-inspired analysis of synchronization effects, derives convergence bounds, and explores new algorithms with regularizing properties for distributed and non-distributed optimization.

Findings

01

Synchronization reduces noise in distributed SGD.

02

Coupling stabilizes higher noise levels and enhances convergence.

03

EASGD exhibits a surprising regularizing effect even in non-distributed settings.

Abstract

We analyze the effect of synchronization on distributed stochastic gradient algorithms. By exploiting an analogy with dynamical models of biological quorum sensing - where synchronization between agents is induced through communication with a common signal - we quantify how synchronization can significantly reduce the magnitude of the noise felt by the individual distributed agents and by their spatial mean. This noise reduction is in turn associated with a reduction in the smoothing of the loss function imposed by the stochastic gradient approximation. Through simulations on model non-convex objectives, we demonstrate that coupling can stabilize higher noise levels and improve convergence. We provide a convergence analysis for strongly convex functions by deriving a bound on the expected deviation of the spatial mean of the agents from the global minimizer for an algorithm based on…

Figures29

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Comparison of minimum test loss achieved, minimum error achieved, and minimum training loss achieved for EASGD-WM, EASGD, and MSGD on the CIFAR-10 dataset (each with p = 1 𝑝 1 p=1 , providing details on the effect of hyperparameter choices not seen in Fig. 10 ). Each experiment was run three times and the minimum was taken over the average trajectory. In each run, the algorithms were initialized from the same starting location. Surprisingly, EASGD-WM consistently achieves the lowest test error (all but one setting) and the lowest test loss (all but four settings) in comparison to EASGD and MSGD. For high learning rate and high δ 𝛿 \delta , MSGD and EASGD eventually run into convergence issues, while EASGD-WM does not (error of .9 .9 .9 and test loss of 6.91 6.91 6.91 indicate convergence issues).

		Minimum Test Loss						Minimum Error
		$δ = .1$	$δ = .25$	$δ = .5$	$δ = .75$	$δ = .9$	$δ = .99$	$δ = .1$	$δ = .25$	$δ = .5$	$δ = .75$	$δ = .9$	$δ = .99$
$η = .005$	EASGD-WM	4.25	4.26	4.28	4.29	3.22	3.24	.267	.266	.264	.269	.268	.270
	EASGD	4.72	4.83	4.56	4.28	3.17	3.11	.304	.313	.301	.282	.280	.277
	MSGD	4.75	4.87	4.64	4.33	3.21	3.29	.310	.323	.306	.286	.292	.295

$η = .01$	EASGD-WM	4.09	5.15	4.03	4.12	3.15	3.10	.262	.259	.253	.261	.267	.257
	EASGD	4.57	5.75	4.27	4.14	3.04	3.22	.297	.300	.280	.275	.263	.283
	MSGD	4.59	5.81	4.48	4.46	3.22	3.33	.300	.307	.294	.294	.287	.301

$η = .05$	EASGD-WM	3.95	3.96	3.86	3.97	3.07	3.00	.252	.258	.250	.255	.262	.253
	EASGD	4.41	4.27	4.05	4.06	3.19	4.04	.286	.283	.265	.267	.276	.417
	MSGD	4.46	4.57	4.48	4.43	3.21	6.91	.294	.307	.295	.290	.292	0.9

$η = .1$	EASGD-WM	4.08	4.01	4.04	4.05	3.11	3.15	.267	.264	.268	.265	.268	.269
	EASGD	4.24	4.23	4.14	4.13	3.17	6.91	.282	.283	.277	.272	.280	0.9
	MSGD	4.62	4.55	4.22	4.47	3.38	6.91	.288	.307	.287	.288	.301	0.9

Equations316

f (x) = \frac{1}{N} i = 1 \sum N l (x, y^{i}) .

f (x) = \frac{1}{N} i = 1 \sum N l (x, y^{i}) .

\hat{g} (x) = \frac{1}{b} y \in B \sum \nabla l (x, y)

\hat{g} (x) = \frac{1}{b} y \in B \sum \nabla l (x, y)

x_{t + 1} = x_{t} - η \hat{g} (x) .

x_{t + 1} = x_{t} - η \hat{g} (x) .

x_{t + 1} = x_{t} - η \nabla f (x_{t}) - \frac{η}{b} ζ_{t},

x_{t + 1} = x_{t} - η \nabla f (x_{t}) - \frac{η}{b} ζ_{t},

Σ (x) = \frac{1}{N} i = 1 \sum N [(\nabla l (x, y^{i}) - \nabla f (x)) (\nabla l (x, y^{i}) - \nabla f (x))^{T}] .

Σ (x) = \frac{1}{N} i = 1 \sum N [(\nabla l (x, y^{i}) - \nabla f (x)) (\nabla l (x, y^{i}) - \nabla f (x))^{T}] .

d x = (- \nabla f (x) - \frac{1}{4} η \nabla ∥ \nabla f (x) ∥^{2}) d t + \frac{η}{b} B (x) d W

d x = (- \nabla f (x) - \frac{1}{4} η \nabla ∥ \nabla f (x) ∥^{2}) d t + \frac{η}{b} B (x) d W

d x = - \nabla f (x) d t + \frac{η}{b} B (x) d W .

d x = - \nabla f (x) d t + \frac{η}{b} B (x) d W .

x min F (x) = E_{ζ} [f (x, ζ)],

x min F (x) = E_{ζ} [f (x, ζ)],

x^{1}, \dots, x^{p}, \tilde{x} min i = 1 \sum p (E_{ζ^{i}} [f (x^{i}, ζ^{i})] + \frac{k}{2} ∥ x^{i} - \tilde{x} ∥^{2}),

x^{1}, \dots, x^{p}, \tilde{x} min i = 1 \sum p (E_{ζ^{i}} [f (x^{i}, ζ^{i})] + \frac{k}{2} ∥ x^{i} - \tilde{x} ∥^{2}),

x_{t + 1}^{i}

x_{t + 1}^{i}

\tilde{x}_{t + 1}

d x^{i}

d x^{i}

d \tilde{x}

d x^{i} = (- \nabla f (x^{i}) + k (x^{∙} - x^{i})) d t + \frac{η}{b} B (x^{i}) d W^{i} .

d x^{i} = (- \nabla f (x^{i}) + k (x^{∙} - x^{i})) d t + \frac{η}{b} B (x^{i}) d W^{i} .

\dot{x} = f (x, t),

\dot{x} = f (x, t),

(Θ \nabla f (x, t) Θ^{- 1})_{s} \leq - λ I

(Θ \nabla f (x, t) Θ^{- 1})_{s} \leq - λ I

∥ x_{1} (t) - x_{2} (t) ∥_{M} \leq e^{- λ t} ∥ x_{1} (0) - x_{2} (0) ∥_{M}

∥ x_{1} (t) - x_{2} (t) ∥_{M} \leq e^{- λ t} ∥ x_{1} (0) - x_{2} (0) ∥_{M}

\dot{x} = f (x, t) + ϵ (x, t) .

\dot{x} = f (x, t) + ϵ (x, t) .

\dot{R} + λ R \leq ∥ Θ ϵ (x, t) ∥

\dot{R} + λ R \leq ∥ Θ ϵ (x, t) ∥

∥ x_{1} (t) - x_{2} (t) ∥ \leq \frac{χ B}{λ}

∥ x_{1} (t) - x_{2} (t) ∥ \leq \frac{χ B}{λ}

R (t) \leq ∥ Θ ∥ (\frac{B}{λ} + \frac{A e ^{- λ t}}{a - λ} - \frac{B e ^{- λ t}}{λ} - \frac{A e ^{- a t}}{a - λ})

R (t) \leq ∥ Θ ∥ (\frac{B}{λ} + \frac{A e ^{- λ t}}{a - λ} - \frac{B e ^{- λ t}}{λ} - \frac{A e ^{- a t}}{a - λ})

d x = f (x, t) d t + σ (x, t) d W,

d x = f (x, t) d t + σ (x, t) d W,

E [∥ a (t) - b (t) ∥^{2}]

E [∥ a (t) - b (t) ∥^{2}]

E [∥ x (t) - x_{n f} (t) ∥^{2}] \leq \frac{1}{β} (E [(∥ x (0) - x_{n f} (0) ∥_{M}^{2} - \frac{C}{2 λ})^{+}] e^{- 2 λ t} + \frac{C}{2 λ}) .

E [∥ x (t) - x_{n f} (t) ∥^{2}] \leq \frac{1}{β} (E [(∥ x (0) - x_{n f} (0) ∥_{M}^{2} - \frac{C}{2 λ})^{+}] e^{- 2 λ t} + \frac{C}{2 λ}) .

\dot{y} = g (y, x, t)

\dot{y} = g (y, x, t)

d y = g (y, x, t) d t + Ξ (y, x, t) d W

d y = g (y, x, t) d t + Ξ (y, x, t) d W

\dot{x}

\dot{x}

d y

\dot{x}^{i} = - \nabla f (x^{i}) + k (z - x^{i}),

\dot{x}^{i} = - \nabla f (x^{i}) + k (z - x^{i}),

\dot{y} = - \nabla f (y) + k (z - y),

\dot{y} = - \nabla f (y) + k (z - y),

J = - \nabla^{2} f (y) - k I .

J = - \nabla^{2} f (y) - k I .

E [i \sum ∥ x^{i} - x^{∙} ∥^{2}] \leq \frac{( p - 1 ) C η}{2 b ( k - λ ˉ )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent

Full text

A continuous-time analysis of distributed stochastic gradient

Nicholas M. Boffi

John A. Paulson School of Engineering and Applied Sciences

Harvard University

Cambridge, MA 02138

[email protected]

&Jean-Jacques E. Slotine

Nonlinear Systems Laboratory

Massachusetts Institute of Technology

Cambridge, MA 02139

[email protected]

Abstract

We analyze the effect of synchronization on distributed stochastic gradient algorithms. By exploiting an analogy with dynamical models of biological quorum sensing – where synchronization between agents is induced through communication with a common signal – we quantify how synchronization can significantly reduce the magnitude of the noise felt by the individual distributed agents and by their spatial mean. This noise reduction is in turn associated with a reduction in the smoothing of the loss function imposed by the stochastic gradient approximation. Through simulations on model non-convex objectives, we demonstrate that coupling can stabilize higher noise levels and improve convergence. We provide a convergence analysis for strongly convex functions by deriving a bound on the expected deviation of the spatial mean of the agents from the global minimizer for an algorithm based on quorum sensing, the same algorithm with momentum, and the Elastic Averaging SGD (EASGD) algorithm. We discuss extensions to new algorithms that allow each agent to broadcast its current measure of success and shape the collective computation accordingly. We supplement our theoretical analysis with numerical experiments on convolutional neural networks trained on the CIFAR-10 dataset, where we note a surprising regularizing property of EASGD even when applied to the non-distributed case. This observation suggests alternative second-order in-time algorithms for non-distributed optimization that are competitive with momentum methods.

1 Introduction

Stochastic gradient descent (SGD) and its variants have become the de-facto algorithms for large-scale machine learning applications such as deep neural networks [Bottou, 2010, Goodfellow et al., 2016, LeCun et al., 2015, Mallat, 2016]. SGD is used to optimize finite-sum loss functions, where a stochastic approximation to the gradient is computed using only a random selection of the input data points. Well-known results on almost-sure convergence rates to global minimizers for strictly convex functions and to stationary points for non-convex functions exist under sufficient regularity conditions [Bottou, 1998, Robbins and Siegmund, 1971]. Classic work on iterate averaging for SGD [Polyak and Juditsky, 1992] and other more recent extensions [Defazio et al., 2014, Roux et al., 2012, Bach and Moulines, 2013, Schmidt et al., 2017] can improve convergence under a set of reasonable assumptions typically satisfied in the machine learning setting. Convergence proofs rely on a suitably chosen decreasing step size; for constant step sizes and strictly convex functions, the parameters ultimately converge to a distribution peaked around the optimum.

For large-scale machine learning applications, parallelization of SGD is a critical problem of significant modern research interest [Dean et al., 2012, Recht and Ré, 2013, Recht et al., 2011, Chaudhari et al., 2017]. Recent work in this direction includes the Elastic Averaging SGD (EASGD) algorithm, in which $p$ distributed agents coupled through a common signal optimize the same loss function. EASGD can be derived from a single SGD step on a global variable consensus objective with a quadratic penalty, and the common signal takes the form of an average over space and time of the parameter vectors of the individual agents [Zhang et al., 2015, Boyd et al., 2010]. At its core, the EASGD algorithm is a system of identical, coupled, discrete-time dynamical systems. And indeed, the EASGD algorithm has exactly the same structure as earlier mathematical models of synchronization [Russo and Slotine, 2010, Chung and Slotine, 2009] inspired by quorum sensing in bacteria [Miller and Bassler, 2001, Waters and Bassler, 2005]. In these models, which have typically been analyzed in continuous-time, the dynamics of the common (quorum) signal can be arbitrary [Russo and Slotine, 2010], and in fact may simply consist of a weighted average of individual signals. Motivated by this immediate analogy, we present here a continuous-time analysis of distributed stochastic gradient algorithms, of which EASGD is a special case. A significant focus of this work is the interaction between the degree of synchronization of the individual agents, characterized rigorously by a bound on the expected distance between all agents and governed by the coupling strength, and the amount of noise induced by their stochastic gradient approximations.

The effect of coupling between identical continuous-time dynamical systems has a rich history. In particular, synchronization phenomena in such coupled systems have been the subject of much mathematical [Wang and Slotine, 2005], biological [Russo and Slotine, 2010], neuroscientific [Tabareau et al., 2010], and physical interest [Javaloyes et al., 2008]. In nonlinear dynamical systems, synchronization has been shown to play a crucial role in protection of the individual systems from independent sources of noise [Tabareau et al., 2010]. The interaction between synchronization and noise has also been posed as a possible source of regularization in biological learning, where quorum sensing-like mechanisms could be implemented between neurons through local field potentials [Bouvrie and Slotine, 2013]. Given the significance of stochastic gradient [Zhang et al., 2018b] and externally injected [Neelakantan et al., 2015] noise in regularization of large-scale machine learning models such as deep networks [Zhang et al., 2017], it is natural to expect that the interplay between synchronization of the individual agents and the noise from their stochastic gradient approximations is of central importance in distributed SGD algorithms.

Recently, there has been renewed interest in a continuous-time view of optimization algorithms [Wilson et al., 2016, Wibisono et al., 2016, Wibisono and Wilson, 2015, Betancourt et al., 2018]. Nesterov’s accelerated gradient method [Nesterov, 1983] was fruitfully analyzed in continuous-time in Su et al. [2014], and a unifying extension to other algorithms can be found in Wibisono et al. [2016]. Continuous-time analysis has also enabled discrete-time algorithm development through classical discretization techniques from numerical analysis [Zhang et al., 2018a]. This paper further adds to this line of work by deriving new results with the mathematical tools afforded by the continuous-time view, such as stochastic calculus and nonlinear contraction analysis [Lohmiller and Slotine, 1998].

The paper is organized as follows. In Sec. 2, we provide some necessary mathematical preliminaries: a review of SGD in continuous-time, a continuous-time limit of the EASGD algorithm, a review of stochastic nonlinear contraction theory, and a statement of some needed assumptions. In Sec. 3, we demonstrate that the effect of synchronization of the distributed SGD agents is to reduce the magnitude of the noise felt by each agent and by their spatial mean. We derive this for an algorithm where all-to-all coupling is implemented through communication with the spatial mean of the distributed parameters, and we refer to this algorithm as quorum SGD (QSGD). In the appendix, a similar derivation is presented with arbitrary dynamics for the quorum variable, of which EASGD is a special case. In Sec. 4, we connect this noise reduction property with a recent analysis in Kleinberg et al. [2018], which shows SGD can be interpreted as performing gradient descent on a smoothed loss in expectation. We use this derivation to garner intuition about the qualitative performance of distributed SGD algorithms as the coupling strength is varied, and we verify this intuition with simulations on model nonconvex loss functions in low and high dimensions. In Sec. 5, we provide new convergence results for QSGD, QSGD with momentum, and EASGD for a strongly convex objective. In Sec. 6, we explore the properties of EASGD and QSGD on deep neural networks, and in particular, test the stability and performance of variants proposed throughout the paper. We also propose a new class of second-order in time algorithms motivated by the EASGD algorithm with a single agent, which consists of standard SGD coupled in feedback to the output of a nonlinear filter of the parameters. We close with some concluding remarks in Sec. 7.

2 Mathematical preliminaries

In this section, we provide a brief review of the necessary mathematical tools employed in this work.

2.1 Convex optimization

For the convergence proofs in Sec. 5, and for synchronization of momentum methods, we will require a few standard definitions from convex optimization.

Definition 2.1.

(Strong Convexity) A function $f\in\mathcal{C}^{2}(\mathbb{R}^{n},\mathbb{R})$ is $l$ -strongly convex with $l>0$ if its Hessian is uniformly lower bounded by $l\mathbf{I}$ with respect to the positive semidefinite order, $\nabla^{2}f(\mathbf{x})>l\mathbf{I}$ for all $\mathbf{x}\in\mathbb{R}^{n}$ .

Definition 2.2.

(L-Smoothness) A function $f\in\mathcal{C}^{2}(\mathbb{R}^{n},\mathbb{R})$ is $L$ -smooth with $L>0$ if its Hessian is uniformly upper bounded by $L\mathbf{I}$ with respect to the positive semidefinite order, $\nabla^{2}f(\mathbf{x})<L\mathbf{I}$ for all $\mathbf{x}\in\mathbb{R}^{n}$ .

2.2 Stochastic gradient descent in discrete-time

Minibatch SGD has been essential for training large-scale machine learning models such as deep neural networks, where empirical risk minimization leads to finite-sum loss functions of the form

[TABLE]

Above, $\mathbf{y}^{i}$ is the $i^{th}$ input data example and the vector $\mathbf{x}$ holds the model parameters. In the typical machine learning setting where $N$ is very large, the gradient of $f$ requires $N$ gradient computations of $l$ , which is prohibitively expensive.

To avoid this calculation, a stochastic gradient is computed by taking a random selection $\mathcal{B}$ of size $b<N$ , typically known as a minibatch. It is simple to see that the stochastic gradient

[TABLE]

is an unbiased estimator of the true gradient. The parameters are updated according to the iteration

[TABLE]

By adding and subtracting the true gradient, the SGD iteration can be rewritten

[TABLE]

where $\boldsymbol{\zeta}_{t}\sim N(0,\boldsymbol{\Sigma}(\mathbf{x}_{t}))$ is a data-dependent noise term. $\boldsymbol{\zeta}_{t}$ can be taken to be Gaussian under a central limit theorem argument, assuming that the size of the minibatch is large enough [Jastrzȩbski et al., 2017, Mandt et al., 2015]. $\boldsymbol{\Sigma}(\mathbf{x})$ is then given by the variance of a single-element stochastic gradient

[TABLE]

2.3 Stochastic gradient descent in continuous-time

A significant difficulty in a continuous-time analysis of SGD is formulating an accurate stochastic differential equation (SDE) model. Recent works have proved rigorously [Hu et al., 2017, Feng et al., 2018, Li et al., 2018] that the sequence of values $\mathbf{x}(k\eta)$ generated by the SDE

[TABLE]

approximates the SGD iteration with weak error $\mathcal{O}(\eta^{2})$ , where $\mathbf{W}$ is a Wiener process, $\|\cdot\|$ denotes the Euclidean 2-norm111For the remainder of this paper, unless otherwise specified, we will use $\|\cdot\|$ to denote the 2-norm., and where $\mathbf{B}\mathbf{B}^{T}=\boldsymbol{\Sigma}$ . Dropping the small term proportional to $\eta$ reduces the weak error to $\mathcal{O}(\eta)$ [Hu et al., 2017]. This leads to the SDE

[TABLE]

Equation (2) has appeared in a number of recent works [Mandt et al., 2017, 2016, 2015, Chaudhari and Soatto, 2018, Chaudhari et al., 2018, Jastrzȩbski et al., 2017], and is generally obtained by making the replacement $\eta\rightarrow dt$ and $\sqrt{\eta}\boldsymbol{\zeta}\rightarrow\mathbf{B}d\mathbf{W}$ in (1), as a sort of reverse Euler-Maruyama discretization [Kloeden and Platen, 1992].

2.4 EASGD in continuous-time

Following [Zhang et al., 2015], we provide a brief introduction to the EASGD algorithm, and convert the resulting sequences to continuous-time. We imagine a distributed optimization setting with $p\in\mathbb{N}$ agents and a single master. We are interested in solving a stochastic optimization problem

[TABLE]

where $\mathbf{x}\in\mathbb{R}^{n}$ is the vector of parameters and $\boldsymbol{\zeta}$ is a random variable representing the stochasticity in the objective. This is equivalent to the distributed optimization problem [Boyd et al., 2010]

[TABLE]

where each $\mathbf{x}^{i}$ is a local vector of parameters and $\tilde{\mathbf{x}}$ is the quorum variable. The quadratic penalty ensures that all local agents remain close to $\tilde{\mathbf{x}}$ , and $k$ sets the coupling strength. Smaller values of $k$ allow for more exploration, while larger values ensure a greater degree of synchronization. Intuitively, the interaction between agents mediated by $\tilde{\mathbf{x}}$ is expected to help individual trajectories escape local minima, saddle points, and flat regions, unless they all fall into the same deep or wide minimum together.

We assume the expectation in (2.4) is approximated by a sum over input data points, and that the stochastic gradient is computed by taking a minibatch of size $b$ . After taking an SGD step, the updates for each agent and the quorum variable become

[TABLE]

where $\mathbf{x}^{\bullet}_{t}=\frac{1}{p}\sum_{i=1}^{p}\mathbf{x}^{i}_{t}$ and $\mathbb{E}\left[\boldsymbol{\zeta}^{i}\left(\boldsymbol{\zeta}^{i}\right)^{T}\right]=\boldsymbol{\Sigma}(\mathbf{x}^{i}_{t})$ . Transferring to the continuous-time limit, these equations become,

[TABLE]

with $\mathbf{B}\mathbf{B}^{T}=\boldsymbol{\Sigma}$ . Note that in (5), the dynamics of $\tilde{\mathbf{x}}$ represent a simple low-pass filter of the center of mass (spatial mean) variable $\mathbf{x}^{\bullet}$ . In the limit of large $p$ , the dynamics of this filter will be much faster than the SGD dynamics, and the continuous-time EASGD system can be approximately replaced by

[TABLE]

We refer to (6) as Quorum SGD (QSGD), and it will be a significant focus of this work.

2.5 Background on nonlinear contraction theory

The main mathematical tool used in this work is nonlinear contraction theory, a form of incremental stability for nonlinear systems. In particular, we specialize to the case of time- and state-independent metrics; further details can be found in Lohmiller and Slotine [1998].

Definition 2.3.

(Contraction) The nonlinear dynamical system

[TABLE]

with $\mathbf{x}\in\mathbb{R}^{n}$ and $\mathbf{f}\in\mathcal{C}^{1}(\mathbb{R}^{n}\times\mathbb{R},\mathbb{R}^{n})$ is said to be contracting with rate $\lambda>0$ and invertible metric transformation $\boldsymbol{\Theta}\in\mathbb{R}^{n\times n}$ if the symmetric part of the generalized Jacobian

[TABLE]

is uniformly negative definite for all $\mathbf{x}\in\mathbb{R}^{n}$ and all $t\in\mathbb{R}$ . Above, subscript $s$ denotes the symmetric part of a matrix, $\mathbf{A}_{s}=\frac{1}{2}\left(\mathbf{A}+\mathbf{A}^{T}\right)$ . Equivalently, the system is said to be contracting in the corresponding metric $\mathbf{M}=\boldsymbol{\Theta}^{T}\boldsymbol{\Theta}$ .

If condition (8) is satisfied, all trajectories exponentially converge to one another regardless of initial conditions. That is, for two solutions $\mathbf{x}_{1}(t)$ and $\mathbf{x}_{2}(t)$ of (7),

[TABLE]

where $\|\mathbf{x}\|_{\mathbf{M}}=\sqrt{\mathbf{x}^{T}\mathbf{M}\mathbf{x}}$ . Intuitively, because of the property (9), a nonlinear system is called contracting if differences in system trajectories due to initial conditions and temporary disturbances are exponentially forgotten. This behavior is proved differentially, by considering the time evolution of the squared Euclidean norm of the virtual displacement $\delta\mathbf{z}=\boldsymbol{\Theta}\delta\mathbf{x}$ , which formally obeys the differential equation $\delta\dot{\mathbf{z}}=\boldsymbol{\Theta}\nabla\mathbf{f}(\mathbf{x})\boldsymbol{\Theta}^{-1}\delta\mathbf{z}$ [Lohmiller and Slotine, 1998]. As an immediate and powerful corollary, if the system is contracting and a single trajectory is known, then all trajectories must converge to the single known trajectory exponentially.

In this work, we will interchangeably refer to $\mathbf{f}$ , the system, and the generalized Jacobian as contracting depending on the context. In particular, for stochastic differential equations, we will refer to $\mathbf{f}$ as contracting if the deterministic system is contracting. Two specific robustness results for contracting systems needed for the derivations in this work are summarized below.

Lemma 2.1.

Consider the dynamical system (7), and assume that it is contracting with metric transformation $\boldsymbol{\Theta}$ and contraction rate $\lambda$ . Let $\chi=\|\boldsymbol{\Theta}^{-1}\|\|\boldsymbol{\Theta}\|$ denote the condition number of $\boldsymbol{\Theta}$ , where $\|\boldsymbol{\Theta}\|=\sup_{\|\mathbf{y}\|=1}\|\boldsymbol{\Theta}\mathbf{y}\|$ denotes the induced matrix 2-norm. Consider the perturbed dynamical system

[TABLE]

Then, for a solution $\mathbf{x}_{1}$ of (7) and a solution $\mathbf{x}_{2}$ of (10), with $R=\|\boldsymbol{\Theta}\left(\mathbf{x}_{1}-\mathbf{x}_{2}\right)\|$ ,

[TABLE]

Furthermore, if $\|\boldsymbol{\epsilon}\|\leq Ae^{-at}+B$ with $A,B\in\mathbb{R}$ and $a\in\mathbb{R}^{+}$ , then after exponential transients of rates $a$ and $\lambda$ ,

[TABLE]

Proof.

See point (vii) of “linear properties of generalized contraction analysis” in Lohmiller and Slotine [1998] for the derivation of (11). From (11), $\dot{R}+\lambda R\leq\|\boldsymbol{\Theta}\|\|\boldsymbol{\epsilon}\|\leq\|\boldsymbol{\Theta}\|\left(Ae^{-at}+B\right)$ . Convolving $e^{-\lambda t}$ with the right-hand side yields the inequality

[TABLE]

Noting that $\|\mathbf{x}_{1}(t)-\mathbf{x}_{2}(t)\|=\|\boldsymbol{\Theta}^{-1}\boldsymbol{\Theta}\left(\mathbf{x}_{1}-\mathbf{x}_{2}\right)\|\leq\|\boldsymbol{\Theta}^{-1}\|\|\boldsymbol{\Theta}\left(\mathbf{x}_{1}-\mathbf{x}_{2}\right)\|=\|\boldsymbol{\Theta}^{-1}\|R$ yields the result (12). ∎

Theorem 2.1.

Consider the stochastic differential equation

[TABLE]

with $\mathbf{x}\in\mathbb{R}^{n}$ and where $\mathbf{W}$ denotes an $n$ -dimensional Wiener process. Assume that there exists a positive definite metric $\mathbf{M}=\boldsymbol{\Theta}^{T}\boldsymbol{\Theta}$ such that $\mathbf{x}^{T}\mathbf{M}\mathbf{x}\geq\beta\|\mathbf{x}\|^{2}$ with $\beta>0$ , and that $\mathbf{f}$ is contracting in this metric. Further assume that $Tr\left(\boldsymbol{\sigma}(\mathbf{x},t)^{T}\mathbf{M}\boldsymbol{\sigma}(\mathbf{x},t)\right)\leq C$ where $C\in\mathbb{R}^{+}$ . Then, for two trajectories $\mathbf{a}(t)$ and $\mathbf{b}(t)$ driven by independent sources of noise with stochastic initial conditions given by a probability distribution $p(\boldsymbol{\zeta}_{1},\boldsymbol{\zeta}_{2})$ ,

[TABLE]

where $(\cdot)^{+}$ denotes the unit ramp (or ReLU) function. The expectation on the left-hand side is over the noise $d\mathbf{W}(s)$ for all $s<t$ , and the expectation on the right-hand side is over the distribution of initial conditions.

See Pham et al. [2009], Thm. 2 for a proof of Thm. 2.1. A corollary that will be useful in Sec. 5 is as follows.

Corollary 2.1.

Assume that the conditions of Thm. 2.1 are satisfied. Then, for a trajectory $\mathbf{x}_{nf}(t)$ of (7) and a trajectory $\mathbf{x}(t)$ of (14),

[TABLE]

Cor. 2.1 is obtained by following the proof of Thm. 2 in Pham et al. [2009] with the restriction that one system is deterministic. To reduce the appearance of decaying exponential terms, in applications of Thm. 2.1, Cor. 2.1, and other related contraction-based bounds, we will simply state the final constant and the corresponding rate of exponential transients. The conditions of Thm. 2.1 are worthy of their own definition.

Definition 2.4.

(Stochastic contraction) If the conditions of Thm. 2.1 are satisfied, the system (14) is said to be stochastically contracting in the metric $\mathbf{M}$ (or with metric transformation $\boldsymbol{\Theta}$ ) with bound $C$ and rate $\lambda$ .

In this work, we will also make use of an extension of contraction known as partial contraction originally introduced in Wang and Slotine [2005]. The procedure is summarized below.

Theorem 2.2.

Consider the nonlinear dynamical system (7) – not assumed to be contracting – and consider a contracting auxiliary system of the form

[TABLE]

with the requirement that $\mathbf{g}(\mathbf{x},\mathbf{x},t)=\mathbf{f}(\mathbf{x},t)$ .222For example, say $\mathbf{f}(\mathbf{x},t)=-\mathbf{P}(\mathbf{x})\mathbf{x}$ with $\mathbf{P}(\mathbf{x})$ a symmetric and uniformly positive definite matrix. Then $\mathbf{g}(\mathbf{y},\mathbf{x},t)=-\mathbf{P}(\mathbf{x})\mathbf{y}$ satisfies this restriction requirement. The $\mathbf{y}$ system is also contracting in $\mathbf{y}$ , as the symmetric part of the Jacobian $\mathbf{J}_{s}=-\mathbf{P}(\mathbf{x})<0$ uniformly. On the other hand, the $\mathbf{f}(\mathbf{x},t)$ system has Jacobian $\frac{\partial f_{i}}{\partial x_{j}}=-P_{ij}(\mathbf{x})-\sum_{k}\frac{\partial P_{ik}(\mathbf{x})}{\partial x_{j}}x_{k}$ , which has symmetric part with unknown definiteness without further assumptions on $\mathbf{P}$ . Assume a single trajectory $\mathbf{y}(t)$ of (15) is known. Then all trajectories of (7) converge to $\mathbf{y}(t)$ .

Proof.

By assumption, (15) is contracting, and so all trajectories converge to $\mathbf{y}(t)$ . Because $\mathbf{g}(\mathbf{x},\mathbf{x},t)=\mathbf{f}(\mathbf{x},t)$ , any solution $\mathbf{x}(t)$ of (7) is also a solution of (15), and hence must converge to $\mathbf{y}(t)$ . ∎

We will commonly refer to the auxiliary $\mathbf{y}$ system in the above theorem as a virtual system, and $\mathbf{f}$ is said to be partially contracting. Thm. 2.2 enables the application of contraction to systems which in themselves are not contracting, but can be embedded in a virtual system which is.

This notion also extends to stochastic systems through the use of stochastic contraction. If a stochastically contracting system

[TABLE]

can be found such that $\mathbf{g}(\mathbf{x},\mathbf{x},t)=\mathbf{f}(\mathbf{x},t)$ and $\boldsymbol{\Xi}(\mathbf{x},\mathbf{x})=\boldsymbol{\sigma}(\mathbf{x},t)$ , then trajectories of (16) can be compared to trajectories of (7) through the application of Cor. 2.1 or (14) through the application of Thm. 2.1.

2.6 Assumptions

We require two main assumptions about the objective function $f(\mathbf{x})$ , both of which have been employed in previous work analyzing synchronization and noise in nonlinear systems [Tabareau et al., 2010]. The first is an assumption on the nonlinearity of the components of the gradient.

Assumption 2.1.

Assume that the Hessian matrix of each component of the negative gradient has bounded maximum eigenvalue, $\nabla^{2}\left[\left(-\nabla f(\mathbf{x})\right)_{j}\right]\leq Q\mathbf{I}$ for all $j$ .

The second assumption is a condition on the robustness of the distributed gradient flows studied in this work to small, potentially stochastic perturbations.

Assumption 2.2.

Consider two dynamical systems

[TABLE]

where $\mathbf{P}_{l}$ is a continuous-time stochastic process dependent on a parameter $l$ and $\beta_{q}\in\mathbb{R}$ is a real coefficient dependent on a parameter $q$ . Denote by $\mathbf{x}(t)$ the solution to (17) and by $\mathbf{y}_{l,q}(t)$ the solution to (18) with the same initial condition, $\mathbf{x}(0)=\mathbf{y}_{l,q}(0)$ . We assume that $\lim_{l\rightarrow\infty}\mathbb{E}\left(\|\mathbf{P}_{l}\|\right)=0$ and $\lim_{q\rightarrow\infty}\beta_{q}=0$ implies that $\lim_{l\rightarrow\infty}\lim_{q\rightarrow\infty}\|\mathbf{x}-\mathbf{y}_{l,q}\|=0$ almost surely.

Continuous dependence of trajectories on parameters of the dynamics in the sense of Assumption 2.2 can be characterized for deterministic systems through continuity assumptions on the dynamics – see, for example, Section 3.2 in Khalil [2002] – here we assume a natural stochastic extension. Assumption 2.2 has been verified for FitzHugh-Nagumo oscillators where $\mathbf{P}_{l}$ is a white noise process [Tuckwell and Rodriguez, 1998], and validated in simulation for more complex nonlinear oscillators [Tabareau et al., 2010]. We remark that $\mathbb{E}\left[\|\mathbf{P}\|\right]\rightarrow 0$ implies that $\|\mathbf{P}\|\rightarrow 0$ almost surely, and hence that $\mathbf{P}\rightarrow\mathbf{0}$ almost surely.

3 Synchronization and noise

In this section, we analyze the interaction between synchronization of the distributed QSGD agents and the noise they experience. We begin with a derivation of a quantitative measure of synchronization that applies to a class of distributed SGD algorithms involving coupling to a common external signal with no communication delays. We then present the section’s primary contribution, which will serve as a basis for the theory in the remainder of the paper, as well as for the intuition for various experiments.

3.1 A measure of synchronization

We now present a simple theorem on synchronization in the deterministic setting, which will allow us to prove a bound on synchronization in the stochastic setting using Thm. 2.1.

Theorem 3.1.

Consider the coupled gradient descent system

[TABLE]

where $\mathbf{z}$ represents a common external signal. Let $\bar{\lambda}$ denote the maximum eigenvalue of $-\nabla^{2}f(\mathbf{x})$ . For $k>\bar{\lambda}$ , the individual $\mathbf{x}^{i}$ trajectories synchronize exponentially with rate $k-\bar{\lambda}$ regardless of initial conditions.

Proof.

Consider the auxiliary virtual system

[TABLE]

where $\mathbf{z}$ is an external input. Note that with $\mathbf{y}=\mathbf{x}^{i}$ , we recover (19) – i.e., (20) admits the trajectories of each agent $\mathbf{x}^{i}$ as particular solutions. The Jacobian of (20) is given by

[TABLE]

Equation (21) is symmetric and negative definite for $k>\bar{\lambda}$ for any external input $\mathbf{z}$ . Because the individual $\mathbf{x}^{i}$ are particular solutions of this virtual system, contraction implies that for all $i$ and $j$ , $\|\mathbf{x}^{i}-\mathbf{x}^{j}\|\rightarrow 0$ exponentially. The contraction rate is given by $k-\bar{\lambda}$ . ∎

This theorem motivates a definition.

Definition 3.1.

(Global exponential synchronization) We will say the agents in a distributed algorithm globally exponentially synchronize if they all converge to one another exponentially regardless of initial conditions.

Thm. 3.1 gives a simple condition on the coupling gain $k$ for synchronization of the individual agents in (19). Because $\mathbf{z}$ can represent any input, Thm. 3.1 applies to any dynamics of the quorum variable: with $\mathbf{z}=\mathbf{x}^{\bullet}$ , it applies to the QSGD algorithm, and with $\mathbf{z}=\tilde{\mathbf{x}}$ , it applies to the EASGD algorithm. Under the assumption of a contracting deterministic system, we can use the stochastic contraction results in Thm. 2.1 to bound the expected distance between individual agents in the stochastic setting.

Lemma 3.1.

Assume that $k>\bar{\lambda}$ and that $Tr(\mathbf{B}\mathbf{B}^{T})=Tr(\boldsymbol{\Sigma})<C$ uniformly. Then, after exponential transients of rate $2(k-\bar{\lambda})$ ,

[TABLE]

where each $\mathbf{x}^{i}$ is a solution of (4) or (6).

Proof.

Consider the systems for $i=1,\ldots,p$

[TABLE]

which reproduces (4) with $\mathbf{z}=\tilde{\mathbf{x}}$ and (6) with $\mathbf{z}=\mathbf{x}^{\bullet}$ . Each solution $\mathbf{x}^{i}$ to (23) is a solution of the stochastic virtual system

[TABLE]

which has contracting deterministic part under the assumptions of the lemma and by Thm. 3.1. For fixed $i$ and $j$ , applying the results of Thm. 2.1 in the Euclidean metric leads to

[TABLE]

after exponential transients of rate $2(k-\bar{\lambda})$ . Summing (24) over $i$ and $j$ leads to

[TABLE]

Finally, as in Tabareau et al. [2010], we can rewrite

[TABLE]

which proves the result. $\square$ ∎

We will refer to (22) as a synchronization condition.

3.2 Reduction of noise due to synchronization

We now provide a mathematical characterization of how synchronization reduces the amount of noise felt by the individual QSGD agents. The derivation follows the mathematical procedure first employed in Tabareau et al. [2010] in the study of neural oscillators.

Theorem 3.2.

*(The effect of synchronization on stochastic gradient noise)

Let $\mathbf{x}^{\bullet}_{k,p}(t)$ denote the center of mass trajectory of the continuous-time QSGD system (6) with coupling gain $k$ and $p$ agents. In the simultaneous limits $k\rightarrow\infty$ and $p\rightarrow\infty$ , the difference between $\mathbf{x}^{\bullet}_{k,p}(t)$ and a trajectory of the noise-free dynamics*

[TABLE]

tends to zero, $\lim_{k\rightarrow\infty}\lim_{p\rightarrow\infty}\|\mathbf{x}^{\bullet}_{k,p}-\mathbf{x}_{\text{nf}}\|\rightarrow 0$ almost surely, with $\mathbf{x}_{\text{nf}}(0)=\mathbf{x}^{\bullet}_{k,p}(0)$ .

Proof.

Summing the stochastic dynamics (6) over $p$ , we find

[TABLE]

To make clear the dependence of the dynamics on $\mathbf{x}^{\bullet}$ , we define the disturbance term

[TABLE]

so that we can rewrite (26) as

[TABLE]

Each term $\sqrt{\frac{\eta}{bp^{2}}}\mathbf{B}(\mathbf{x}^{i})d\mathbf{W}^{i}$ is a Gaussian random variable with covariance $\frac{\eta}{bp^{2}}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ , and each $d\mathbf{W}^{i}$ is independent of all other $d\mathbf{W}^{j}$ . Hence the sum over the noise terms in (27) can also be written as a single Gaussian random variable with covariance $\frac{\eta}{bp^{2}}\sum_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ ,

[TABLE]

where $\mathbf{T}=\mathbf{T}(\mathbf{x}^{1},\ldots,\mathbf{x}^{p})$ and $\mathbf{T}\mathbf{T}^{T}=\sum_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ . (28) leads to an additional simplification of (27),

[TABLE]

(29) shows that the effect of the additive noise is eliminated as the number of agents $p\rightarrow\infty$ 333Indeed, the covariance $\frac{\eta}{bp^{2}}\sum_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})\leq\frac{\eta}{bp}\bar{\boldsymbol{\Sigma}}$ where $\bar{\boldsymbol{\Sigma}}=\max_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ and the $\max$ and $\leq$ are with respect to the positive semidefinite order. The covariance $\frac{\eta}{bp}\bar{\boldsymbol{\Sigma}}$ tends to zero as $p\rightarrow\infty$ , so that Gaussian random variables drawn from a distribution with this covariance will become increasingly concentrated around zero with increasing $p$ . Because the true covariance $\frac{\eta}{bp^{2}}\mathbf{T}\mathbf{T}^{T}$ is less positive semidefinite, random variables drawn from the true distribution will too become concentrated around zero as $p\rightarrow\infty$ .. We now let $\mathbf{F}_{j}$ denote the gradient of $\left(-\nabla f(\mathbf{x})\right)_{j}$ , and we let $\mathbf{H}_{j}$ denote its Hessian. We apply the Taylor formula with integral remainder to $\left(-\nabla f(\mathbf{x})\right)_{j}$ ,

[TABLE]

Summing (30) over $i$ and applying the assumed bound $\mathbf{H}_{j}\leq Q\mathbf{I}$ leads to the inequality

[TABLE]

The left-hand side of the above inequality is $p|\boldsymbol{\epsilon}_{j}|$ . Squaring both sides and summing over $j$ provides a bound on $p^{2}\|\boldsymbol{\epsilon}\|^{2}$ . Taking a square root of this bound, we find

[TABLE]

where the factor of $\sqrt{n}$ originates from the sum over the components of $\boldsymbol{\epsilon}$ . Performing an expectation over the noise $d\mathbf{W}(s)$ for all $s<t$ and using the synchronization condition in (22), we conclude that after exponential transients of rate $2(k-\bar{\lambda})$ ,

[TABLE]

The bound in (31) depends on the synchronization rate of the agents $k-\bar{\lambda}$ , the dimensionality of space $n$ , the bound on the third derivative of the objective $Q$ , and the bound on the noise strength $\frac{\eta}{b}C$ . In the limit of large $p$ , the dependence on $p$ becomes negligible. The expected effect of the disturbance term $\boldsymbol{\epsilon}$ tends to zero as the coupling gain $k$ tends to infinity, corresponding to the fully synchronized limit.

By Assumption 2.2 and Thm. 2.1, as $k\rightarrow\infty$ and $p\rightarrow\infty$ , the difference between trajectories of (29) and the unperturbed, noise-free system tends to zero almost surely, as the effects of both the stochastic disturbance $\boldsymbol{\epsilon}$ and the additive noise term are eliminated in this simultaneous limit. ∎

3.3 Discussion

Thm. 3.2 demonstrates that for distributed SGD algorithms, roughly speaking, the noise strength is set by the ratio parameter $\frac{\eta}{bp}$ at the expense of a distortion term which tends to zero with synchronization. Whether this noise reduction is a benefit or a drawback for non-convex optimization depends on the problem at hand.

If the use of a stochastic gradient is purely as an approximation of the true gradient – for example, due to single-node or single-GPU memory limitations – then synchronization can be seen as improving this approximation and eliminating undesirable noise while simultaneously parallelizing the optimization problem. The analysis in this section then gives rigorous bounds on the magnitude of noise reduction. The $\boldsymbol{\epsilon}$ term could be measured in practice to understand the empirical size of the distortion, and $k$ could be increased until $\boldsymbol{\epsilon}$ tends approximately to zero and the noise is reduced to a desired level.

On the other hand, many studies have reported the importance of stochastic gradient noise in deep learning, particularly in the context of generalization performance [Poggio et al., 2017, Zhu et al., 2018, Chaudhari and Soatto, 2018, Zhang et al., 2017]. Furthermore, large batches are known to cause issues with generalization, and this has been hypothesized to be due to a reduction in the noise magnitude due to a higher $b$ in the ratio $\frac{\eta}{b}$ [Keskar et al., 2016]. In this context, reduction of noise may be undesirable, and one may only be interested in parallelization of the problem. The above analysis then suggests choosing $k$ high enough such that the quorum variable represents a meaningful average of the parameters, but low enough that the noise in the SGD iterations is not reduced. Indeed, in Sec. 6, we will find the best generalization performance for low values of $k$ which still result in convergence of the quorum variable. For deep networks, the level of synchronization for a given value of $k$ will be both architecture and dataset-dependent.

We remark that the condition in Thm. 3.1 is merely a sufficient condition for synchronization, and synchronization may occur for significantly lower values of $k$ than predicted by contraction in the Euclidean metric. However, independent of when synchronization exactly occurs, so long as there is a fixed upper bound as in (22), the results in this section will apply with the corresponding estimate of $\mathbb{E}[\|\boldsymbol{\epsilon}\|]$ .

3.4 Extension to multiple learning rates

Our analysis can be extended to the case when each individual agent has a different learning rate $\eta_{i}$ (or equivalently, different batch size), and thus a different noise level. In effect, this is because each agent still follows the same dynamics, though with different integration errors, and at a different rate. In this case, the synchronization condition (22) is modified to

[TABLE]

so that

[TABLE]

The noise term $\sum_{i}\sqrt{\frac{\eta^{i}}{bp^{2}}}\mathbf{B}(\mathbf{x}^{i})d\mathbf{W}^{i}$ becomes a sum of $p$ independent Gaussians each with covariance $\frac{\eta^{i}}{bp^{2}}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ , and can be written as a single Gaussian random variable $\sqrt{\frac{1}{bp^{2}}}\mathbf{T}d\mathbf{W}$ with $\mathbf{T}\mathbf{T}^{T}=\sum_{i}\eta^{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ . An analogous argument as given in Sec. 3.2 shows that the effect of this additive noise will tend to zero as $p\rightarrow\infty$ . This could allow, for example, for multiresolution optimization, where agents with larger learning rates may help avoid sharper local minima, saddle points, and flat regions of the parameter space, while agents with finer learning rates may help converge to robust local minima which generalize well. Standard learning rate schedules can also be applied agent-wise using the validation loss of individual agents, rather than decreasing all learning rates using the validation loss of the quorum variable.

3.5 Extension to momentum methods

Our analysis can also be extended to momentum methods, modeled using the differential equation [Su et al., 2014]

[TABLE]

in component-wise form

[TABLE]

Coupling the agents in both position and velocity leads to the dynamics,

[TABLE]

where $\mathbf{x}^{\bullet}_{l}=\frac{1}{p}\sum_{j}\mathbf{x}_{l}^{j}$ .

Lemma 3.2.

Consider the QSGD with momentum system given by (33) and (34). Assume that $f$ is $\underline{\lambda}$ -strongly convex and $\bar{\lambda}$ -smooth. For $k_{1}>\frac{1}{4\left(\inf_{t}\mu(t)+k_{2}\right)}\max\left(\left(1-\bar{\lambda}\right)^{2},\left(1-\underline{\lambda}\right)^{2}\right)$ , the individual $\mathbf{x}^{i}$ systems globally exponentially synchronize with rate $\xi$ , where

[TABLE]

Proof.

The virtual system

[TABLE]

has system Jacobian

[TABLE]

and will be contracting for $\left(\inf_{t}\mu(t)+k_{2}\right)k_{1}>\sup_{\mathbf{x}}\left(\sigma^{2}\left(\frac{1}{2}\left(-\nabla^{2}f(\mathbf{x})+\mathbf{I}\right)\right)\right)$ , where $\sigma^{2}(\cdot)$ denotes the largest squared singular value [Wang and Slotine, 2005]. Because $\mathbf{I}-\nabla^{2}f$ is symmetric, the square singular values are simply the square eigenvalues. This leads to the condition $\left(\inf_{t}\mu(t)+k_{2}\right)k_{1}>\frac{1}{4}\max\left((1-\bar{\lambda}^{2}),(1-\underline{\lambda})^{2}\right)$ , which may be rearranged to yield the condition in the theorem.

(36) and (37) also admit the $\mathbf{x}_{l}^{i}$ as particular solutions, so that the agents globally exponentially synchronize with a rate $\xi=|\lambda_{\text{max}}(\mathbf{J})|$ . The lower bound on $\xi$ can be obtained by application of the result in Slotine [2003], Example 3.8. ∎

Hence, a bound similar to (22) can be derived just as in Lemma 3.1. Because the $\mathbf{x}^{\bullet}_{1}$ dynamics are linear, and because the $\mathbf{x}^{\bullet}_{2}$ dynamics are only nonlinear through the gradient of the loss, Assumption 2.1 does not need to be modified. For $\inf_{t}\mu(t)>0$ , $k_{2}$ can be set to zero, so that coupling is only through the position variables.

4 An alternative view of distributed stochastic gradient descent

In this section, we connect the above discussion of synchronization and noise reduction with the analysis in Kleinberg et al. [2018], which interprets SGD as performing gradient descent on a smoothed loss in expectation. Specifically, we show that the reduction of noise due to synchronization can be viewed as a reduction in the smoothing of the loss function. This provides further geometrical intuition for the effect of synchronization on distributed SGD algorithms. It furthermore sheds light as to why one may want to use low values of $k$ to prevent noise reduction in learning problems involving generalization, where optimization of the empirical risk rather than the expected risk introduces spurious defects into the loss function that may be removed by sufficient smoothing.

Defining the auxiliary sequence $\mathbf{y}_{t}=\mathbf{x}_{t}-\eta\nabla f(\mathbf{x}_{t})$ and comparing with (1) shows that $\mathbf{x}_{t+1}=\mathbf{y}_{t}-\frac{\eta}{\sqrt{b}}\boldsymbol{\zeta}_{t}$ , yielding

[TABLE]

so that

[TABLE]

This demonstrates that the $\mathbf{y}$ sequence performs gradient descent on the loss function convolved with the $\frac{\eta}{\sqrt{b}}$ -scaled noise in expectation444In Kleinberg et al. [2018], the authors group the factor of $\sqrt{1/b}$ with the covariance of the noise.. Using this argument, it is shown in Kleinberg et al. [2018] that SGD can converge to minimizers for a much larger class of functions than just convex functions, though the convolution operation can disturb the locations of the minima.

4.1 The effect of synchronization on the convolution scaling

The analysis in Sec. 3 suggests that synchronization of the $\mathbf{x}^{i}$ variables should reduce the convolution prefactor for a $\mathbf{y}$ variable related to the center of mass, and we now make this intuition more precise for the QSGD algorithm. We have that

[TABLE]

so that

[TABLE]

with $\boldsymbol{\epsilon}_{t}=\nabla f(\mathbf{x}^{\bullet}_{t})-\frac{1}{p}\sum_{i}\nabla f(\mathbf{x}^{i}_{t})$ as usual. Define the auxiliary variable $\mathbf{y}^{\bullet}_{t}=\mathbf{x}^{\bullet}_{t}-\eta\nabla f(\mathbf{x}^{\bullet}_{t})$ , so that

[TABLE]

Equation (38) can then be used to state

[TABLE]

Taylor expanding the gradient term, we find

[TABLE]

which alters the discrete $y^{\bullet}$ update to

[TABLE]

Equation (39) says that, in expectation, $y^{\bullet}$ performs gradient descent on a convolved loss with noise scaling reduced by a factor of $\frac{1}{\sqrt{p}}$ . The reduced scaling comes at the expense of the usual disturbance term $\boldsymbol{\epsilon}$ , which decreases to zero with increasing synchronization in expectation over the noise $\zeta_{s}$ for $s<t$ . Equation (39) differs from the non-distributed case by an additional $\mathcal{O}(\eta^{2})$ factor of the Hessian.

4.2 Discussion

To better understand the interplay of synchronization and noise in SGD, we can consider several limiting cases. Consider a choice of $\eta$ corresponding to a fairly high noise level, so that the loss function is sufficiently smoothed for the iterates of SGD ( $k=0$ ) to avoid local minima, saddle points, and flat regions, but so that the iterates would not reliably converge to a desirable region of parameter space, such as a deep and robust minimum.

For $k\rightarrow\infty$ and $p$ sufficiently large, the quorum variable will effectively perform gradient descent on a minimally smoothed loss, and will converge to a local minimum of the true loss function close to its initialization. Due to the strong coupling, the agents will likely get pulled into this minimum, leading to convergence as if a single agent had been initialized using deterministic gradient descent at $\mathbf{x}^{\bullet}(t=0)$ , despite the high value of $\eta$ .

With an intermediate value of $k$ so that the agents remain in close proximity to each other, but not so strong that $\|\boldsymbol{\epsilon}\|\rightarrow 0$ , the $\mathbf{x}$ variables will be concentrated around the minima of the smoothed loss (the coupling will pull the agents together, but because $\|\boldsymbol{\epsilon}\|\neq 0$ , the smoothing will not be reduced in the sense of (39)). The stationary distribution of SGD is thought to be biased towards concentration around degenerate minima of high volume [Banburski et al., 2019]; the coupling force should thus amplify this effect, and lead to an accumulation of agents in wider and deeper minima in which all agents can approximately fit. Eventually, if sufficiently many agents arrive in a single minimum, it will be extremely difficult for any one agent to escape, leading to a consensus solution chosen by the agents even at a high noise level.

4.3 Numerical simulations in nonconvex optimization

In this subsection, we consider simulations on a model one-dimensional nonconvex loss function, as well as one possible high-dimensional generalization. There are several goals of the discussion. The first is to show that the intuition presented in Sec. 4.2 is correct. The second is to provide a setting where visualization of the loss function, its analytically smoothed counterpart, and the distribution of possible convergent points is straightforward. The third is to elucidate qualitative trends in distributed nonconvex optimization as a function of $k$ in low- and high-dimensional settings, and to show to what extent properties of the low-dimensional setting translate to the high-dimensional setting. We consider the loss function

[TABLE]

where the sinusoidal oscillations in (40) introduce spurious local minima. The constant factor $F\in\mathbb{R}^{+}$ is used for numerical stability for a wider range of $\eta$ values, in order to reduce the large gradient magnitudes introduced by the high-frequency modes. We simulate the dynamics of QSGD using a forward Euler discretization,

[TABLE]

with $f(x)$ given by (40). We include $1000$ agents in each of $250$ simulations per $k$ value. Each simulation is allowed to run for $20,000$ iterations with $\eta=.15$ 555We choose a relatively high value of $\eta$ so that the convolved loss will be qualitatively different from the true loss to a degree that is visible by eye. This enables us to distinguish convergence to true minima from convergence to minima of the convolved loss. An alternative and equivalent choice would be to choose $\eta$ smaller, with a correspondingly wider distribution of the noise.. The corresponding distributions of final points, computed via a kernel density estimate, are plotted over a range of $k$ values in Fig. 1. In each subfigure, the true loss function is plotted in orange and the loss function convolved with the noise distribution is plotted in blue. The loss functions are normalized so they can appear on the same scale as the distributions, and the $y$ scale is thus omitted. The agents are initialized uniformly over the interval $[-3,3]$ , and each experiences an i.i.d. uniform noise term $\zeta^{i}_{t}\sim U(-1.5,1.5)$ per iteration. $F$ is fixed at $150$ .

In Fig. 1(a), there is no coupling and the distribution of final iterates for the agents is nearly uniform across the parameter space with a slightly increased probability of convergence to the two deepest regions. The distribution of the quorum variable is sharply peaked around zero666Note that without coupling each agent performs basic SGD. Hence, the results in Fig. 1(a) are equivalent to $p\times n$ single-agent SGD simulations, where $n$ is the total number of simulations and $p$ is the number of agents per simulation. As $k$ increases to $k=0.4$ in Fig. 1(b), the agents concentrate around the wide basins of the convolved loss function and avoid the sharp local minima of the true loss function. The distribution for the quorum variable is similar, but is too wide to imply reliable convergence to a minimum with loss near the global optimum.

As $k$ is increased further to $k=0.8$ in Fig. 1(c) and $k=1.0$ in Fig. 1(d), performance increases significantly. The distribution of the agents is centered around the global optimum of the smoothed loss, and the distribution of the quorum variable is very sharp around the same minimum; this represents the regime in which the agents have chosen a consensus solution. As demonstrated by Fig. 1(a), this improved convergence is not possible with standard SGD. As $k$ is increased again in Figs. 1(e) and (f), the coupling force becomes too great, and performance decreases – there is no initial exploratory phase to find the deeper regions of the landscape, and convergence is simply near the initialization of $x^{\bullet}$ .

These simulation results suggest a useful combination of high noise, coupling, and traditional learning rate schedules. High noise levels can lead to rapid exploration and avoidance of problematic regions in parameter space – such as local minima, saddle points, or flat regions – while coupling can stabilize the dynamics towards a distribution around a wide and deep minimum of the convolved loss. The learning rate can then be decreased to improve convergence to minima of the true loss that lie within the spread of the distribution. In the uncoupled case, similar levels of noise would lead to a random walk.

This intuition is supported by the simulation results in Fig. 2. The same simulation parameters are used, except the learning rate is now decreased by a factor of two every $4000$ iterations until $\eta\leq 0.001$ where it is fixed. In the uncoupled case in Fig. 2(a), the schedule only slightly improves convergence around minima of the smoothed loss when compared to Fig. 1(a). Fig. 2(b) again reflects a mild improvement relative to Fig. 1(b). For the two best values of $k=0.8$ and $k=1.0$ in Figs. 2(c) and (d), convergence of the agents and the quorum variable around the deepest minimum of the true loss that lies within the distribution of the agents in Figs. 1(c) and (d) is excellent. In the very high $k$ regime in Figs. 2(e) and (f), the coupling force is too strong to enable exploration, and convergence is again near the initialization of $\mathbf{x}^{\bullet}$ , but now to the minima of the true loss.

The preceding results also qualitatively apply to momentum methods. We now turn to simulate the following iteration

[TABLE]

with the loss function again given by (40). The distributions of final iterates after $20,000$ steps with $\eta=0.1$ , computed from $250$ simulations per $k$ value with $1000$ agents per simulation, are shown in Fig. 3.

Fig. 3(a) is identical to Fig. 1(a) except for the difference in learning rate: the agents converge uniformly across the parameter space. As $k$ is increased to $k=2$ in Fig. 3(b), the distribution of the agents becomes more localized around the center of parameter space, but not around any minima. When $k$ is increased to $k=4$ in Fig. 3(c), $k=8$ in Fig. 3(d), and $k=10$ in Fig. 3(e), the distributions of the agents and the quorum variable become localized on the two deepest minima of the convolved loss, but are still too wide for reliable convergence. The value $k=15$ in Fig. 3(f) leads to reliable convergence around the deep minimum on the right, and would combine well with a learning rate schedule as in Fig. 2. Overall, the trend is similar to the case without momentum, though much higher values of $k$ are tolerated before degradation in performance. Despite high $k$ values rapidly pulling the agent positions close to $\mathbf{x}^{\bullet}(t=0)$ , significant differences in the velocities of the agents prevents convergence to a local minimum nearby $\mathbf{x}^{\bullet}(t=0)$ in the high $k$ regime.

To demonstrate that these qualitative results also hold in higher dimensions, we now consider a $d$ -dimensional objective function inspired by the one-dimensional objective function (40). The loss function is given by

[TABLE]

Equation (44) represents a separable sum of double well loss functions with pairwise sinusoidal coupling between all parameters. We include $1000$ agents in each of $250$ simulations per $k$ value with $d=250$ . Each simulation is allowed to run for $10,000$ steps with $1000$ agents per simulation. The parameters are updated according to the vector forms of (42) and (43) with $\eta=.15$ and $\delta=.9$ . No learning schedule is used. The agents are all randomly initialized uniformly in $[-4,4]\times[-4,4]$ and each experiences an i.i.d. noise term $\zeta^{i}_{t}\sim U(-.75,.75)$ . $F$ is fixed at $50$ .

For visualization purposes, we plot the contours of a two-dimensional cross section of the loss function by evaluating the last $d-2$ coordinates at the value $-1.2$ . This value was chosen to represent the bottom-left cluster apparent in Figs. 5 and 6; it also lies close to the global minimum of the uncorrupted loss function $(-1.426,-1.426,\ldots,-1.426)^{T}\in\mathbb{R}^{d}$ . Visualization of high-dimensional loss functions is difficult, and using such a crossz section has its drawbacks; in particular, a saddle point may show up as a local minimum, correctly as a saddle point, or as a local maximum depending on the cross section taken. Nevertheless, the employed cross sections enable qualitative visualization of the clustering of the quorum variable and the individual agents, and provide assurance that the general phenomena seen in one dimension in Figs. 1-3 generalize naturally to higher dimensions.

The loss function itself is shown in Fig. 4(a) and the smoothed loss is shown in Fig. 4(b), which has significantly reduced complexity. Fig. 4(c) displays the loss value of the quorum variable, averaged over all simulations, as a function of iteration number for a set of possible $k$ values. The results are much the same as was described qualitatively in one dimension. Low values of $k$ such as $k=0$ and $k=0.5$ do not successfully minimize the loss function as the agents are too spread out. Despite a significant ability to explore the loss landscape with such small coupling, the agents are not concentrated enough for $\mathbf{x}^{\bullet}$ to represent a meaningful average. As $k$ increases, the ability to optimize the loss function at first significantly improves. While better than $k=0$ and $k=0.5$ , $k=1.5$ still represents the regime of too little coupling. $k=2.5$ and $k=3.5$ obtain much lower loss values than $k=0$ and $k=0.5$ , with $k=2.5$ achieving the lowest loss of the displayed $k$ values. As $k$ is increased further, performance starts to degrade. $k=4.5$ performs worse than $k=2.5$ , and $k=3.5$ obtains similar performance to $k=1.5$ . Increasing $k$ to $k=7.5$ , $k=9.5$ , and $k=11.5$ continues to deteriorate the ability of the algorithm to minimize the loss. The optimum $k$ value represents, for the given noise level and loss function, the correct balance of exploration and resistance to noise.

As in the case of any algorithmic hyperparameter, it is natural to expect that there will be an optimum value of $k$ . To see that the manifestation of this optimum is precisely a high-dimensional analogue of the qualitative behavior observed in the one-dimensional simulations in Figs. 1-3, we visualize the final points found by the quorum variable and a random selection of $25$ agents per simulation in Figs. 5 and 6 respectively for a representative subset of the $k$ values seen in Fig. 4(c).

Fig. 5(a) shows that $k=0$ results in essentially uniform convergence of the agents across parameter space to local minima and saddle points, and hence the quorum variable simply converges near the origin in Fig. 6(a). The small amount of coupling $k=0.5$ in Fig. 5(b) leads to increased, but still insufficient, clustering of the agents. This manifests itself in Fig. 6(b) as a shift of the ball of quorum convergence points towards the bottom left corner. $k=1.5$ and $k=2.5$ in Figs. 5(c) and (d) have significantly improved convergence, with strong clustering of the agents in four balls around $(\pm 1.2,\pm 1.2)^{T}$ . These clusters are located near the minima of the uncorrupted loss function, which occur at $(\pm 1.426,\pm 1.426,\ldots,\pm 1.426)^{T}$ .

$k=1.5$ and $k=2.5$ have similar quorum convergence plots in Figs. 6(c) and (d), though the value of the loss in Fig. 4(c) is noticeably different at iteration $10,000$ . The difference in the loss function values for the quorum variables are likely hidden by the low-dimensional visualization method. Figs. 5(c) and (d) show that $k=1.5$ has more “straggler” agents between the four corner clusters than $k=2.5$ , which may shift the quorum convergence points uphill. From a qualitative perspective, both are good choices for tracking minima of the uncorrupted or the non-smoothed loss functions, and could be combined with a learning rate schedule to improve convergence from the cloud of “starting points” in Figs. 5(c) and (d).

As $k$ is increased further to $k=7.5$ , the coupling begins to grow too strong. The distinct agent clusters attempt to merge, as seen in Fig. 5(e). The result of this is seen in Fig. 6(e), where there are scattered quorum convergence points between the clusters. Finally, for $k=11.5$ , the coupling is too great, and convergence of both the agents and the quorum variables in Figs. 5(f) and 6(f) respectively are both near the origin.

Taken together, Figs. 1-6 provide significant qualitative insight into the convergence of distributed SGD algorithms, both with and without momentum. In one-dimension and in high-dimensional simulations, there is an optimum level of coupling which represents an ideal balance between a) the ability of the agents to explore the loss function, and b) concentration of the distribution of final iterates. Pushing $k$ too high will lead to convergence near the initialization of $\mathbf{x}^{\bullet}$ and ultimately to reduced smoothing of the loss function, while setting $k$ too low will lead to poor convergence of the quorum variable due to a lack of clustering of the agents. Intermediate values of $k$ lead to concentration of the agents around deep and wide minima of the smoothed loss, which will generally lie close to the minima of the uncorrupted loss; convergence can be improved from here with a learning rate schedule.

The optimum value of $k$ is set by the size of the gradients in comparison to the noise level. In the simulation setup used here, this corresponds to a tradeoff between the value of $F$ , which sets the gradient magnitudes, and the width of the noise distribution. By setting the width of the noise distribution very high, the optimum $k$ value can be shifted to a large value, so that numerical stability issues arise before performance begins to degrade. Similarly, with small width and small $F$ , the optimum value of $k$ can be very small. In Sec. 6, we will see a manifestation of a similar phenomenon in deep networks for the testing loss.

5 Convergence analysis

We now provide contraction-based convergence proofs for QSGD and EASGD in the strongly convex setting. In the original work on EASGD, rigorous bounds were found for multivariate quadratic objectives in discrete-time, and the analysis for a general strongly convex objective was restricted to an inequality on the iteration for several relevant variances [Zhang et al., 2015]. The results in this section thus extend previously available convergence results for EASGD, and contain new results for QSGD. We furthermore present convergence results for QSGD with momentum.

A significant theme of this section is that the general methodology of Thm. 3.2 can be applied to produce bounds on the expected distance of the quorum variable from the global minimizer of a strongly convex function, again split into a sum of two terms, one based on the averaged noise and one based on bounding the distortion vector $\boldsymbol{\epsilon}$ . We also demonstrate in this section that an optimality result obtained for EASGD in discrete-time in Zhang et al. [2015] can be obtained through a straightforward application of stochastic calculus in continuous-time, and that the same result applies for QSGD.

5.1 QSGD convergence analysis

We first present a simple lemma describing convergence of deterministic distributed gradient descent with arbitrary coupling.

Lemma 5.1.

Consider the all-to-all coupled system of ordinary differential equations

[TABLE]

with $\mathbf{x}^{i}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ . Assume that $-\nabla f-p\mathbf{u}$ is contracting in some metric with rate $\lambda_{1}$ , and that $-\nabla f$ is contracting in some (not necessarily the same) metric with rate $\lambda_{2}$ . Then all $\mathbf{x}^{i}$ globally exponentially converge to a critical point of $f$ .

Proof.

Consider the virtual system

[TABLE]

This system is contracting by assumption, and each of the individual agents is a particular solution. The agents therefore globally exponentially synchronize with rate $\lambda_{1}$ . After this exponential transient, the dynamics of each agent is described by the reduced-order virtual system

[TABLE]

By assumption, this system is contracting in some metric with rate $\lambda_{2}$ , and has a particular solution at any critical point $\mathbf{x}^{*}$ such that $\nabla f(\mathbf{x}^{*})=0$ . ∎

*Remark 5.1**.*

This simple lemma demonstrates that any form of coupling can be used so long as the quantity $-\nabla f(\mathbf{y})-p\mathbf{u}(\mathbf{y})$ is contracting to guarantee exponential convergence to a critical point. A simple choice is $\mathbf{u}(\mathbf{x}^{j})=\frac{k}{p}\mathbf{x}^{j}$ where $k$ is the coupling gain, corresponding to balanced and equal-strength all-to-all coupling. Then (45) can be simplified to

[TABLE]

which is QSGD without noise. Note that all-to-all coupling can thus be implemented with only $2p$ directed connections by communicating with the center of mass variable.

*Remark 5.2**.*

If $f$ is $l$ -strongly convex, $-\nabla f$ will be contracting in the identity metric with rate $l$ .

*Remark 5.3**.*

If $f$ is locally $l$ -strongly convex, $-\nabla f$ will be locally contracting in the identity metric with rate $l$ . For example, for a non-convex objective with initializations $\mathbf{x}^{i}(0)$ in a strongly convex region of parameter space, we can conclude exponential convergence to a local minimizer for each agent.

If $f$ is strongly convex, the coupling between agents provides no advantage in the deterministic setting, as they would individually contract towards the minimum regardless. For stochastic dynamics, however, coupling can improve convergence. We now demonstrate the ramifications of the results in Sec. 3 in the context of QSGD agents with the following theorem.

Theorem 5.1.

Consider the QSGD algorithm

[TABLE]

with $\mathbf{x}^{i}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ . Assume that the conditions in Assumption 2.1 hold, that $\mathbf{B}\mathbf{B}^{T}=\boldsymbol{\Sigma}$ is bounded such that $Tr(\boldsymbol{\Sigma})\leq C$ uniformly, and that $f$ is $\lambda$ -strongly convex. Then, after exponential transients of rate $\lambda$ and $\lambda+k$ , the expected difference between the center of mass trajectory $\mathbf{x}^{\bullet}$ and the global minimizer $\mathbf{x}^{*}$ of $f$ is given by

[TABLE]

Proof.

We first sum the dynamics of the individual agents to compute the dynamics of the center of mass variable. This leads to the SDE

[TABLE]

with $\boldsymbol{\epsilon}=\nabla f(\mathbf{x}^{\bullet})-\frac{1}{p}\sum_{i}\nabla f(\mathbf{x}^{i})$ and $\mathbf{T}\mathbf{T}^{T}=\sum_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ defined exactly as in Sec. 3. Consider the hierarchy of virtual systems

[TABLE]

The $\mathbf{y}^{1}$ system is contracting by assumption, and admits a particular solution $\mathbf{y}^{1}=\mathbf{x}^{*}$ . As in the proof of Lemma 2.1, we can write with $R=\|\mathbf{y}^{1}-\mathbf{y}^{2}\|$ ,

[TABLE]

which shows that $R$ is bounded. Hence, by dominated convergence,

[TABLE]

As shown in Sec. 3, $\mathbb{E}[\|\boldsymbol{\epsilon}\|]\leq\frac{Q(p-1)C\eta\sqrt{n}}{4p(\lambda+k)b}$ after exponential transients of rate $\lambda+k$ 777In Sec. 3, the denominator contained the factor $k-\bar{\lambda}$ rather than $k+\lambda$ . Strong convexity of $f$ was not assumed, so that the contraction rate of the coupled system was $k-\bar{\lambda}$ . In this proof, strong convexity of $f$ implies that the contraction rate of the coupled system is $k+\lambda$ .. Hence by Lemma 2.1, the difference between the $\mathbf{y}^{1}$ and $\mathbf{y}^{2}$ systems can be bounded as

[TABLE]

after exponential transients of rate $\lambda$ . The $\mathbf{y}^{2}$ system is contracting for any input $\boldsymbol{\epsilon}$ , and the $\mathbf{y}^{3}$ system is identical with the addition of an additive noise term. By Cor. 2.1, after exponential transients of rate $\lambda$ ,

[TABLE]

By Jensen’s inequality, and noting that $\sqrt{\cdot}$ is a concave, increasing function,

[TABLE]

Finally, note that $\mathbf{x}^{\bullet}$ is a particular solution of the $\mathbf{y}^{3}$ virtual system. From these observations and an application of the triangle inequality, after exponential transients,

[TABLE]

This completes the proof. ∎

As in Sec. 3, the bound (47) consists of two terms. The first term originates from a lack of complete synchronization and can be decreased by increasing $k$ . The second term comes from the additive noise, and can be decreased by increasing the number of agents. Both terms can be decreased by decreasing $\frac{\eta}{b}$ , as this ratio sets the magnitude of the noise, and hence the size of both the disturbance and the noise term.

State- and time-dependent couplings of the form $k(\mathbf{x}^{\bullet},t)$ are also immediately applicable with the proof methodology above. For example, increasing $k$ over time can significantly decrease the influence of the first term in (47), leaving only a bound essentially equivalent to linear noise averaging. For non-convex objectives, this suggests choosing low values of $k(\mathbf{x}^{\bullet},t)$ in the early stages of training for exploration, and larger values near the end of training to reduce the variance of $\mathbf{x}^{\bullet}$ around a minimum. By the synchronization and noise argument in Sec. 3 and the considerations in Sec. 4, this will also have the effect of improving convergence to a minimum of the true loss function, rather than the smoothed loss. If accessible, local curvature information could be used to determine when $\mathbf{x}^{\bullet}$ is near a local minimum, and therefore when to increase $k$ . Using state- and time-dependent couplings would change the duration of exponential transients, but the result in Thm. 5.1 would still hold.

It is worth comparing Eq. 5.1 to a bound obtained with the same methodology for standard SGD. With the stochastic dynamics

[TABLE]

and the same assumptions as in Thm. 5.1, the expected difference after exponential transients between a critical point of $f$ and the stochastic $\mathbf{x}$ is given by Cor. 2.1 and an application of Jensen’s inequality as

[TABLE]

In the distributed, synchronized case described by Thm. 5.1, the deviation is reduced by a factor of $\frac{1}{\sqrt{p}}$ in exchange for an additional additive term. This additive term is related to the noise strength $\frac{C\eta}{b}$ , the bound $Q$ , the number of parameters $n$ , and is divided by $\lambda(\lambda+k)$ - i.e., is smaller for more strongly convex functions and with more synchronized dynamics.

5.2 EASGD convergence analysis

We now incorporate the additional dynamics present in the EASGD algorithm. First, we prove a lemma demonstrating convergence to the global minimum of a strongly convex function in the deterministic setting.

Lemma 5.2.

Consider the deterministic continuous-time EASGD algorithm

[TABLE]

with $\mathbf{x}^{i}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ . Assume $f$ is $\lambda$ -strongly convex. Then all agents and the quorum variable $\tilde{\mathbf{x}}$ globally exponentially converge to the unique global minimum $\mathbf{x}^{*}$ with rate

[TABLE]

Proof.

By Thm. 3.1 and strong convexity of $f$ , the individual $\mathbf{x}^{i}$ trajectories globally exponentially synchronize with rate $\lambda+k$ . On the synchronized subspace, the system can be described by the reduced-order virtual system

[TABLE]

The system Jacobian is then given by

[TABLE]

Choosing a metric transformation $\boldsymbol{\Theta}=\begin{pmatrix}\sqrt{p}\mathbf{I}&0\\ 0&\mathbf{I}\end{pmatrix}$ , the generalized Jacobian becomes

[TABLE]

which is clearly symmetric. A sufficient condition for negative definiteness of this matrix is that $\left(\lambda+k\right)kp>k^{2}p$ [Wang and Slotine, 2005, Horn and Johnson, 2012]. Rearranging leads to the condition $\lambda>0$ , which is satisfied by strong convexity of $f$ . The virtual system is therefore contracting. Finally, note that $\mathbf{y}=\tilde{\mathbf{x}}=\mathbf{x}^{*}$ where $\mathbf{x}^{*}$ is the unique global minimum is a particular solution. All trajectories thus globally exponentially converge to this minimum. The lower bound on the contraction rate in the statement of the theorem can be found by applying the result in Slotine [2003], Example 3.8. ∎

Just as in Thm. 5.1, we now turn to a convergence analysis for the EASGD algorithm using the results of Lemma 5.2.

Theorem 5.2.

Consider the continuous-time EASGD algorithm

[TABLE]

for $i=1,\ldots,p$ . Assume that $f$ is $\lambda$ -strongly convex and that the conditions in Assumption 2.1 are satisfied. Let $\gamma$ denote the contraction rate of the deterministic, fully synchronized EASGD system in the metric $\mathbf{M}=\boldsymbol{\Theta}^{T}\boldsymbol{\Theta}$ with $\boldsymbol{\Theta}$ the metric transformation from Lemma 5.2, as lower bounded in (50). Further assume that $Tr(\mathbf{B}^{T}\mathbf{M}\mathbf{B})\leq C(p)$ with $C$ a positive constant potentially dependent on $p$ through the dependence of $\mathbf{M}$ on $p$ . Then, after exponential transients of rate $\gamma$ and $\lambda+k$ ,

[TABLE]

where $\mathbf{z}=\left(\mathbf{x}^{\bullet},\tilde{\mathbf{x}}\right)$ and $\mathbf{z}^{*}=\left(\mathbf{x}^{*},\mathbf{x}^{*}\right)$ .

Proof.

Adding up the agent dynamics, the center of mass trajectory follows

[TABLE]

with the usual definitions of $\boldsymbol{\epsilon}$ and $\mathbf{T}$ . Consider the hierarchy of virtual systems,

[TABLE]

The first system is contracting towards the unique global minimum with rate $\gamma$ by the assumptions of the theorem and Lemma 5.2. The second system is contracting for any external input $\boldsymbol{\epsilon}$ , and we have already bounded $\mathbb{E}[\|\boldsymbol{\epsilon}\|]$ in Sec. 3 (the application of the bound is independent of the dynamics of the quorum variable - see App. A for details). Let $\mathbf{z}^{i}=\begin{pmatrix}\mathbf{y}^{i},\tilde{\mathbf{y}}^{i}\end{pmatrix}$ and $\mathbf{z}^{*}=\begin{pmatrix}\mathbf{x}^{*},\mathbf{x}^{*}\end{pmatrix}$ . By an identical argument as in the proof of Thm. 5.1 and noting that the condition number of $\boldsymbol{\Theta}$ is $\sqrt{p}$ ,

[TABLE]

after exponential transients of rate $\gamma$ and $\lambda+k$ . Note that $\lambda_{\text{min}}(\mathbf{M})=1$ . Hence we can take $\beta=1$ in Cor. 2.1, and

[TABLE]

after exponential transients of rate $\gamma$ . Combining these results via the triangle inequality and noting that $\mathbf{x}^{\bullet},\tilde{\mathbf{x}}$ is a solution to the $\mathbf{y}^{3},\tilde{\mathbf{y}}^{3}$ virtual system, we find that after exponential transients of rate $\gamma$ ,

[TABLE]

where $\mathbf{z}=\begin{pmatrix}\mathbf{x}^{\bullet},\tilde{\mathbf{x}}\end{pmatrix}$ . ∎

Thm. 5.2 demonstrates an explicit bound on the expected deviation of both the center of mass variable $\mathbf{x}^{\bullet}$ and the quorum variable $\tilde{\mathbf{x}}$ from the global minimizer of a strongly convex function. As in the discussion after Thm. 5.1, the results will still hold with state- and time-dependent couplings of the form $k=k(\tilde{\mathbf{x}},t)$ , and the same ideas suggested for QSGD based on increasing $k$ over time can be used to eliminate the effect of the first term in the bound.

Thm. 5.2 is strictly weaker than Thm. 5.1. The metric transformation used adds a factor of $\sqrt{p}$ to the first quantity in the bound, and the assumption $Tr(\mathbf{B}^{T}\mathbf{M}\mathbf{B})\leq C(p)$ now depends on $p$ through the factor of $p$ in the top-left block of $\mathbf{M}$ . Indeed, writing the matrix $\mathbf{B}$ in $n\times n$ block form, $Tr(\mathbf{B}^{T}\mathbf{M}\mathbf{B})=C+(p-1)Tr(\mathbf{B}_{11}^{T}\mathbf{B}_{11}+\mathbf{B}_{12}^{T}\mathbf{B}_{12})$ where $C=Tr(\mathbf{B}^{T}\mathbf{B})$ as in Thm. 5.1. Thus, the dependence of $C(p)$ on $p$ is in general linear.

Because of this linear dependence on $p$ , the first term in the bound scales like $p^{3/2}$ , while the second is asymptotically independent of $p$ . This is not the case in Thm. 5.1, where the first term is asymptotically independent of $p$ , and the second term scales like $\frac{1}{p}$ . The unfavorable scaling of the bound in Thm. 5.2 with $p$ implies that higher values of $p$ do not improve convergence for EASGD as they do for QSGD. These issues can be avoided by reformulating Lemma 5.2 in the Euclidean metric, but this leads to the fairly strong restriction $k<\frac{4\lambda p}{(p-1)^{2}}$ .

These observations highlight potential convergence issues for EASGD with large $p$ which are not present with QSGD. In line with these theoretical conclusions, we will empirically find stricter stability conditions on $k$ for EASGD when compared to QSGD for training deep networks in Sec. 6. Nevertheless, in the context of nonconvex optimization, higher values of $p$ can still lead to improved performance by affording increased parallelization of the problem and exploration of the landscape

Less significantly, unlike in Thm. 5.1, the bound in Thm. 5.2 is applied to the combined vector $\mathbf{z}$ rather than the quorum variable $\tilde{\mathbf{x}}$ itself, and the contraction rate $\gamma$ is used rather than $\lambda$ in the virtual system bounds888The factor of $\lambda+k$ in the first term remains, as this factor originates in the derivation of the bound on $\mathbb{E}\left[\|\boldsymbol{\epsilon}\|\right]$ , where the synchronization rate is $\lambda+k$ .. Both of these facts weaken the result when compared to Thm. 5.1. $\gamma$ will in general be less than $\lambda$ , as exemplified by the lower bound (50).

5.3 QSGD with momentum convergence analysis

We now present a proof of convergence for the QSGD algorithm with momentum. We first prove a lemma demonstrating convergence to the global minimum of a strongly convex, $\bar{\lambda}$ -smooth function. We consider the case of coupling only in the position variables; coupling additionally through the momentum variables is similar. We also restrict to the case of constant momentum coefficient for simplicity.

Lemma 5.3.

Consider the deterministic continuous-time QSGD with momentum algorithm

[TABLE]

with $\mathbf{x}^{i}_{j}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ . Assume that $f$ is $\underline{\lambda}$ -strongly convex and $\bar{\lambda}$ -smooth. For $\mu>2\sqrt{\underline{\lambda}+\bar{\lambda}-2\sqrt{\underline{\lambda}\bar{\lambda}}}$ and $k>\frac{1}{4\mu}\max\left((1-\bar{\lambda})^{2},(1-\underline{\lambda})^{2}\right)$ , all agents globally exponentially converge to the unique minimum with zero velocity, $(\mathbf{x}^{i}_{1},\mathbf{x}^{i}_{2})\rightarrow(\mathbf{x}^{*},0)$ for all $i$ . The exponential convergence rate $\kappa$ can be lower bounded as

[TABLE]

with $\delta=\delta(\mu)\in(0,1)$ .

Proof.

By Lemma 3.2 and according to the assumption on $k$ , the agents will globally exponentially synchronize with rate $\xi$ , where $\xi$ may be lower bounded as in (35). On the synchronized subspace, the overall system can be described by the virtual system

[TABLE]

where the superscript has been omitted and the coupling term vanishes. Note that this system admits the particular solution $(\mathbf{x}_{1},\mathbf{x}_{2})=(\mathbf{x}^{*},0)$ . This system has Jacobian

[TABLE]

which is clearly not contracting. Define the metric transformation $\boldsymbol{\Theta}=\begin{pmatrix}a\mathbf{I}&0\\ \delta\mu\mathbf{I}&\mathbf{I}\end{pmatrix}$ with $0<\delta<1$ and $a\in\mathbb{R}$ . The resulting symmetric part of the generalized Jacobian is given by

[TABLE]

For contraction, we require that

[TABLE]

Choosing

[TABLE]

ensures that the two arguments of the max are equal. For $a$ to be real, we require that $\mu<\sqrt{\frac{\underline{\lambda}+\bar{\lambda}}{2(1-\delta)\delta}}$ . The condition for contraction then reads that

[TABLE]

leading to the condition on $\mu$ ,

[TABLE]

The lower bound is always real and positive by the arithmetic-geometric mean inequality. There is always a gap between the lower and upper bound, regardless of which argument of the $\min$ is chosen in the upper bound. The lower bound is minimized for $\delta=\frac{1}{2}$ , leading to the condition that $\mu>2\sqrt{\underline{\lambda}+\bar{\lambda}-2\sqrt{\bar{\lambda}\underline{\lambda}}}$ . With $\mu$ satisfying this minimal lower bound, the valid range of $\mu$ can be shifted arbitrarily large by choosing

[TABLE]

with $\alpha>0$ an arbitrarily small positive constant, thus eliminating the upper bound. The lower bound on the contraction rate $\kappa$ of the system can be obtained by application of the result in Slotine [2003], Example 3.8. ∎

Note that in general, so long as $\mu$ is chosen to satisfy the lower bound of the preceding lemma, the QSGD with momentum system will be contracting in some metric. The given metric will depend on the value of $\delta(\mu)$ , for example chosen as suggested in the proof.

With Lemma 5.3 in hand, we can now state a convergence result for QSGD with momentum.

Theorem 5.3.

Consider the continuous-time QSGD with momentum algorithm,

[TABLE]

for $i=1,\ldots,p$ . Assume that the conditions of Lemma 5.3 are satisfied, and that the conditions of Assumption 2.1 are met. Let $\kappa$ denote the contraction rate of the deterministic fully synchronized QSGD with momentum system as lower bounded in (52) and let $\xi$ denote the synchronization rate of the QSGD with momentum system as lower bounded in (35). Further assume that $Tr(\mathbf{B}^{T}\mathbf{M}\mathbf{B})\leq C$ with $C>0$ where $\mathbf{M}=\boldsymbol{\Theta}^{T}\boldsymbol{\Theta}$ and $\boldsymbol{\Theta}$ is the metric transformation from Lemma 5.3. Let $\psi=\frac{a+\delta^{2}\mu^{2}-1-\sqrt{(a-1)^{2}+2(a+1)\delta^{2}\mu^{2}+\delta^{4}\mu^{4}}}{2\delta\mu}$ denote the minimum eigenvalue of $\mathbf{M}$ with $a$ given by (53). Then, after exponential transients of rate $\kappa$ and $\xi$ , with $\mathbf{z}=\begin{pmatrix}\mathbf{x}^{\bullet}_{1},\mathbf{x}^{\bullet}_{2}\end{pmatrix}$ and $\mathbf{z}^{*}=\begin{pmatrix}\mathbf{x}^{*},0\end{pmatrix}$

[TABLE]

Proof.

Summing the agent dynamics, the center of mass trajectory follows

[TABLE]

with the usual definition of $\boldsymbol{\epsilon}$ and $\mathbf{T}$ . Consider an analogous hierarchy of virtual systems as in Thms. 5.1 and 5.2,

[TABLE]

The first system is contracting towards the global minimum with zero velocity and will arrive after exponential transients of rate $\kappa$ by the assumptions of the theorem and by Lemma 5.3. The second system is contracting for any external input $\boldsymbol{\epsilon}$ , and as argued in Sec. 3, the bound on $\mathbb{E}\left[\|\boldsymbol{\epsilon}\|\right]$ can be applied as-is to the momentum system with a suitable replacement of contraction rates. As in Thm. 5.1, and noting that the condition number of $\boldsymbol{\Theta}$ is $a$ as given in (53),

[TABLE]

after exponential transients of rate $\kappa$ and $\xi$ . Similarly, an application of Cor. 2.1 gives

[TABLE]

after exponential transients of rate $\kappa$ , where we have noted that $\mathbf{x}^{T}\mathbf{M}\mathbf{x}\geq\psi\|\mathbf{x}\|^{2}$ . An application of the triangle inequality leads to the result. ∎

(54) is similar to the results for EASGD and QSGD. The bound is closer in spirit to the bound for QSGD without momentum, in that the two terms do not have poor dependencies on $p$ as they do for EASGD. However, the statement of the theorem is complicated by the expressions for the contraction rates $\kappa$ and $\xi$ , the expression for the minimum eigenvalue of the metric $\psi$ , and the expression for $a$ in the metric transformation. Together, these four quantities create a more complicated dependence of the bound on hyperparameters such as $\mu$ and $k$ . Nevertheless, the spirit is still the same as Thm. 5.1, in that the first term originates from the $\boldsymbol{\epsilon}$ disturbance and can be eliminated with synchronization, while the second term originates from the additive noise and can be eliminated by including additional agents.

5.4 Extensions to other distributed structures

Similar results can be derived for many other possible distributed structures in an identical manner. We present one general formalism here, involving local state- and time-dependent couplings.

Lemma 5.4.

The state-dependent all-to-all coupled system

[TABLE]

will globally exponentially synchronize with rate

[TABLE]

whenever this value is positive.

Proof.

The weighted sum $\sum_{j}k_{j}(\mathbf{x}^{j},t)\mathbf{x}^{j}$ now plays the role of the quorum variable, so that one has

[TABLE]

The virtual system

[TABLE]

shows that the individual $\mathbf{x}^{i}$ trajectories globally exponentially synchronize if the conditions of the theorem are met. ∎

We note that the condition (56) is independent of the number of agents. With noise, the center of mass of (55) satisfies

[TABLE]

where now $\boldsymbol{\epsilon}=\nabla f(\mathbf{x}^{\bullet})-\frac{1}{p}\sum_{i}\nabla f(\mathbf{x}^{i})+\sum_{j}k_{j}(\mathbf{x}^{j},t)\mathbf{x}^{j}-\mathbf{x}^{\bullet}\sum_{j}k_{j}(\mathbf{x}^{j},t)$ . As usual, $\boldsymbol{\epsilon}\rightarrow 0$ in the fully synchronized state.

Individually state-dependent couplings of the form (55) or its quorum-mediated equivalent (57) allow for individual gain schedules that depend on local cost values or other local performance measures. This can allow each agent to broadcast its current measure of success and shape the quorum variable accordingly. For example, the classification accuracy on a validation set for each $\mathbf{x}^{i}$ could be use to select the current best parameter vectors, and to increase the corresponding $k_{i}$ values to pull other agents towards them.

5.5 Specialization to a multivariate quadratic objective

In the original discrete-time analysis of EASGD in Zhang et al. [2015], it was proven that iterate averaging [Polyak and Juditsky, 1992] of $\tilde{\mathbf{x}}$ leads to an optimal variance around the minimum of a quadratic objective. We now derive an identical result in continuous-time for the QSGD algorithm, demonstrating that this optimality is independent of the additional dynamics in the EASGD algorithm.

For a multivariate quadratic $f(\mathbf{x})=\mathbf{x}^{T}\mathbf{A}\mathbf{x}$ with $\mathbf{A}$ symmetric and positive definite, the stochastic dynamics of each agent can be written

[TABLE]

To make the optimal result more clear, we group the factor of $\sqrt{\frac{\eta}{b}}$ into the definition of $\mathbf{B}$ , unlike in previous sections. We furthermore relax the state-dependence of $\mathbf{B}$ in this section, and assume it to be a constant matrix; this matches the case handled in Zhang et al. [2015].

The assumption of state-independence can be justified in several ways. Theoretical analyses have demonstrated that the specific form of positive semi-definite $\mathbf{B}$ does not affect the $\mathcal{O}(\eta)$ weak accuracy of the approximating SDE (2) for SGD [Li et al., 2018, Feng et al., 2018, Hu et al., 2017], though it does affect the constant999The state-dependent version used earlier in this work has been empirically shown to have a lower constant [Li et al., 2018], and is closer to the $\mathcal{O}(\eta^{2})$ approximating SDE, which is why it has been utilized up to this point.. For relevance to general nonconvex optimization, we can assume that all agents have arrived sufficiently close to a minimum of the loss function that it can be approximately represented as a quadratic, and that the noise covariance is approximately constant [Mandt et al., 2016, 2017]. For deep networks, the noise covariance has been empirically shown to align with the Hessian of the loss [Sagun et al., 2017, Zhu et al., 2018], with theoretical justification for when this is valid provided in Appendix A of Jastrzȩbski et al. [2017]. For all agents in an approximately quadratic basin of a local minimum of a deep network, $\mathbf{B}$ can then be taken to be constant such that $\mathbf{B}\mathbf{B}^{T}=\mathbf{A}$ , where $\mathbf{A}$ is the approximately state-independent Hessian.

With this assumption, $\mathbf{x}^{\bullet}$ satisfies

[TABLE]

This is a multivariate Ornstein-Uhlenbeck process with solution

[TABLE]

By assumption, $-\mathbf{A}$ is negative definite, so that the stationary expectation $\lim_{t\rightarrow\infty}\mathbb{E}[\mathbf{x}^{\bullet}(t)]=0$ . The stationary variance $\mathbf{V}$ is given by

[TABLE]

(see, for example, Gardiner [2009], p.107). We now define

[TABLE]

and can immediately state the following lemma.

Lemma 5.5.

The averaged variable $\mathbf{z}(t)$ converges weakly to a normal distribution with mean zero and standard deviation $\frac{1}{p}\mathbf{A}^{-1}\boldsymbol{\Sigma}\mathbf{A}^{-T}$ ,

[TABLE]

In particular, for the single-variable case with $\mathbf{A}=h$ and $\boldsymbol{\Sigma}=\sigma^{2}$ ,

[TABLE]

Proof.

From (58),

[TABLE]

The mean of which is asymptotically zero. In computing the variance, only the stochastic integral remains. Interchanging the order of integration,

[TABLE]

After an application of Ito’s Isometry, the variance is given by

[TABLE]

In the limit, the only nonvanishing quantity after the computation of the integral is the linear term $\boldsymbol{\Sigma}t$ . Then,

[TABLE]

∎

As in the discrete-time EASGD analysis, (59) is optimal in the sense of achieving the Fisher information lower bound, and is independent of the coupling strength $k$ [Zhang et al., 2015, Polyak and Juditsky, 1992]. The lack of dependence on the coupling $k$ is less surprising in this case, as it is not present in the $\mathbf{x}^{\bullet}$ dynamics. The optimality of this result, together with the comparison of Thms. 5.1 and 5.2, suggests that the extra $\tilde{\mathbf{x}}$ dynamics may not provide any benefit over coupling simply through the spatial average variable $\mathbf{x}^{\bullet}$ from the perspective of convex optimization. However, in Sec. 6, we will show through numerical experiments on deep networks that EASGD tends to find networks which generalize better than QSGD. The benefits of EASGD must then go beyond basic optimization, and the extra dynamics may have a regularizing effect.

We can also make a slightly stronger statement about (59), as in Mandt et al. [2017]101010A similar continuous-time analysis for the averaging scheme considered here was performed in Mandt et al. [2017] for the non-distributed case; the derivation here is simpler and provides asymptotic results.. If we precondition the stochastic gradients for each agent by the same constant invertible matrix $\mathbf{Q}$ , then the stationary variance remains optimal. To see this, note that we can account for this preconditioning simply by modifying the derivation so that $\mathbf{A}\rightarrow\mathbf{Q}\mathbf{A}$ and $\mathbf{B}\rightarrow\mathbf{Q}\mathbf{B}$ . Then,

[TABLE]

If different agents are preconditioned by different matrices $\mathbf{Q}^{i}$ , this result will not hold. Using adaptive algorithms based on past iterations for each agent such as AdaGrad [Duchi et al., 2011] thus may eliminate the optimality, as each agent would compute a different preconditioner.

6 Deep network simulations

We now turn to evaluate EASGD, QSGD, and one possible state-dependent variant of QSGD (57) as learning algorithms for training deep neural networks on the CIFAR-10 dataset. A significant goal of the section is to understand the role of synchronization and noise in training deep neural networks. We also seek to test the extensions proposed throughout the paper – such as multiple learning rates, synchronization bounds allowing for independent initial conditions of the agents, and state-dependent coupling.

We obtain two primary results. The first is that less synchronization, when it still leads to reliable convergence of the quorum variable, results in the best generalization capabilities of the learned network. This is similar to the results of the model experiments performed in Sec. 4.3, though those experiments revealed this to be true for general optimization rather than generalization. The observation of better generalization with reduced synchronization is in line with the comments of Sec. 3.3 regarding noise and generalization in deep networks.

Our second primary result is the observation of an interesting regularizing property of EASGD, even in the single-agent case. Unlike QSGD, with a single agent EASGD does not reduce to standard SGD. We find that EASGD without momentum outperforms SGD with momentum and EASGD with momentum in the non-distributed setting.

6.1 Experimental setup

We utilize a three-layer convolutional neural network based on the experiments in Zhang et al. [2015]; each layer consists of a two-dimensional convolution, a ReLU nonlinearity, $2\times 2$ max-pooling with a stride of two, and BatchNorm [Ioffe and Szegedy, 2015] with batch statistics in both training and evaluation. The first convolutional layer has kernel size nine, the second has kernel size five, and the third has kernel size three. All convolutions use a stride of one and zero padding. Following the three convolutional layers there is a single fully-connected layer to which we apply dropout with a probability of $0.5$ . The input data is normalized to have mean zero and standard deviation one in each channel in both the training and test sets. Because we are interested in qualitative trends rather than state of the art performance, we do not employ any data augmentation strategies. We use an 80/20 training/validation set split, and we use the cross-entropy loss. The stochastic gradient is computed using mini-batches of size 128. The learning rate is set to $\eta=0.05$ initially unless otherwise specified. This value was chosen as the highest initial value of $\eta$ that remained stable throughout training for most values of $p$ , and the qualitative trends presented here were robust to the choice of learning rate (further simulations demonstrating this robustness are available in the SI). We decrease the learning rate three times when the validation loss stalls111111More precisely, we keep track of the validation loss for each agent at a reference point, beginning with the validation loss at the first epoch. If the validation loss at the next epoch changes by greater than $1\%$ of the reference point, the reference loss is set to the newly computed validation loss. If the validation loss changes by less than $1\%$ , the reference point is unchanged. When the reference point has been unchanged for five epochs, we decrease the learning rate.: first by a factor of five, then a factor of two the second and third times. This is done on an agent basis, i.e., the agents are allowed to maintain different learning rates. As we are focused on the behavior of the algorithms, rather than efficiency from the standpoint of a parallel implementation, the agents communicate with the quorum variable after each update.

In all methods, we use a Nesterov-based momentum scheme unless otherwise specified [Nesterov, 1983, 2004] with a momentum parameter $\delta=0.9$ unless otherwise specified and coupling only in the position variables. For EASGD, this takes the form [Zhang et al., 2015],

[TABLE]

where $\mathbf{g}$ is the stochastic gradient. The equivalent form for QSGD can be obtained by the replacement $\tilde{\mathbf{x}}_{t}\rightarrow\mathbf{x}^{\bullet}_{t}$ and by dropping the dynamics for $\tilde{\mathbf{x}}$ . The update for SD-QSGD is similar,

[TABLE]

In SD-QSGD, we use state-dependent gains $k_{j}=k_{j}(\mathbf{x}^{j},t)$ inspired by a spiking winner-take-all formalism [Wang and Slotine, 2006, Denève and Machens, 2016]. At the start of each epoch, we find the agent with the current minimum validation loss value. Denoting the index of this agent by $j^{*}$ , we define

[TABLE]

for $t<t_{f}$ , with $k$ , $\tau$ , $t_{f}$ and $M\geq 1$ fixed constants, and where $t$ is reset to zero at the start of each epoch. (60) and (61) shape the quorum variable to be entirely composed of the single best agent instantaneously at the start of an epoch. The constant $M$ is a magnification factor and sets the size of the force all other agents feel in the direction of the best agent. The gains relax exponentially back to the QSGD formalism, which is obtained when $k_{j}=k/p$ for all $j$ . The constant $\tau$ sets the speed of relaxation and $t_{f}$ defines the duration of the spike. At $t=t_{f}$ , all $k_{j}$ will have relaxed back to the original value $\frac{k}{p}$ for all $j\neq j^{*}$ , and with proper choice of $\tau$ , $k_{j^{*}}$ will be very close. We introduce a small discontinuity measured by the magnitude of $(Mp-1)\frac{k}{p}e^{-t_{f}/\tau}$ and simply set $k_{j^{*}}=\frac{k}{p}$ at $t=t_{f}$ . We use a value of $M=10$ , choose $t_{f}=N_{b}/4$ where $N_{b}$ is the number of batches in an epoch, and choose $\tau=t_{f}/16$ , corresponding to a rather rapid spike.121212Another option would be to set $k_{j}=(k-k_{j^{*}})/(p-1)$ when this is positive, and zero otherwise. This ensures, outside of the initial spiking period, that the total sum of the $k_{j}$ is constant. We found similar empirical results with both choices.

In each of the following simulations, the fully connected weights and biases are initialized randomly and uniformly $W_{ij},\ b_{i}\sim\mathcal{U}(-\frac{1}{\sqrt{m}},\frac{1}{\sqrt{m}})$ where $m$ is the number of inputs. The convolutional weights use Kaiming initialization [He et al., 2015]. In each comparison, the methods are initialized from the same points in parameter space, but the agents are not required to be initialized at the same location. In QSGD and SD-QSGD, the quorum variable is exponentially weighted $\bar{\mathbf{x}}_{t+1}=\gamma\mathbf{x}^{\bullet}_{t}+(1-\gamma)\bar{\mathbf{x}}_{t}$ with $\gamma=.1$ , and we test convergence of $\bar{\mathbf{x}}$ . Note that because this variable is not coupled to the dynamics of the individual agents, this is still distinct from EASGD. Because we use momentum in nearly all experiments, we will refer simply to QSGD and EASGD. The non-momentum variant of EASGD, when used, will be referred to as EASGD-WM (EASGD without momentum).

6.2 Experimental Results

We first analyze the effect of $k$ on classification performance. We find that the best performance is obtained for the lowest possible fixed values of $k$ that still lead to convergence of the quorum variable. This is demonstrated in Fig. 7 for the EASGD algorithm with $\eta=0.05$ initially and $p=8$ , where we observe the general trend that test accuracy improves as the coupling gain is decreased. Note that $k=0.01$ and $k=0.02$ , as well as $k=0$ (not shown) have too little synchronization for the quorum variable to reflect a meaningful average, and hence do not lead to good performance. Similar results hold for QSGD (not shown). We found not only the best performance for low, fixed $k$ , but also the best scaling with the number of agents131313The improvement in test accuracy and in the minimization of the test loss with increasing number of agents is demonstrated in later plots. We found that this trend was maximized with lower values of $k$ ..

There are several plausible explanations for the observation of improved generalization with reduced coupling. Lower values of $k$ allow for greater exploration of the optimization landscape, which intuitively should lead to better performance. As the measure of synchronization in Fig. 7(d) tends to zero, the $\boldsymbol{\epsilon}$ term in the $\mathbf{x}^{\bullet}$ dynamics will also tend to zero, and synchronization will begin to reduce the amount of noise felt by the individual agents. In neural networks, it is expected that this noise reduction will favor convergence to minima that do not generalize as well as those obtained with higher amounts of noise, as is seen in Fig. 7(c).

Results for a comparison of QSGD and SD-QSGD are shown in Fig. 8 for $p=1,4,8,16,32$ , and $64$ with $k=0.04$ . QSGD is shown in solid lines while SD-QSGD is shown in dashed; color indicates the number of agents (see legend in Fig. 8(a)). Note that $p=1$ simply corresponds to SGD for both SD-QSGD and QSGD, as the coupling term vanishes for a single agent. In both cases, we see significant improvement in accuracy as the number of agents increases, most likely due to an improved ability of the agents to explore the landscape along with a decrease in synchronization. The test loss and test error curves display interesting differences between the two algorithms: for $p=8$ and $p=16$ , the state-dependent formalism obtains mildly improved generalization relative to QSGD, as expected by the bias towards minima with lower validation loss. QSGD performs better for $p=32$ and $p=64$ ; SD-QSGD does not converge for $p=64$ .

We display a comparison of QSGD and EASGD in Fig. 9, again for $k=0.04$ . QSGD tends to decrease the training loss further and more rapidly than EASGD; this is in line with earlier comments that, from an optimization perspective, the extra dynamics of the quorum variable offer no clear theoretical benefit. However, consistently across all experiments except for $p=16$ where it does not converge, EASGD generalizes better: the test loss is driven lower, and the test accuracy is higher than QSGD. A particularly interesting result is the single-agent case, where EASGD actually performs better than SGD with momentum141414Note that unlike QSGD with a single agent, EASGD with a single agent is a different algorithm than basic SGD. It can be seen as SGD coupled in feedback to a low-pass filter of its output.. These observations suggest that the extra dynamics of the quorum variable may impose a form of implicit regularization which, to our knowledge, has not been observed before.

Motivated by this observation, we now compare the $p=1$ EASGD algorithm with momentum, without momentum, and basic SGD with momentum in Fig. 10 across a range of initial learning rates. Each algorithm is initialized from the same location and each curve represents an average over three runs to eliminate stochastic variability. The momentum algorithms use $\delta=0.9$ and the two EASGD variants use $k=0.054$ . In general, EASGD with and without momentum (dashed and solid lines respectively) both achieve higher test accuracy than SGD with momentum (dotted lines). Surprisingly, EASGD without momentum often performs better than EASGD with momentum.

To show that this trend is not an artifact of incorrectly choosing the momentum parameter, we have compiled additional data in Tab. 1 over a range of momentum parameters and learning rates. Each data point reported is again the result of an average over three independent runs, and each algorithm is initialized from the same location in each run. For simplicity, we simply report the testing loss and testing error, rather than the results on the training data. For all but one choice of $\eta$ and $\delta$ , EASGD-WM outperforms both EASGD and MSGD in classification accuracy, demonstrating that the trend is robust to choice of learning rate and momentum value.

Much like SGD with momentum, single-agent EASGD-WM is a second-order system in-time. It also maintains a similar computational complexity, and only requires storing one extra set of parameters for the quorum variable.

Indeed, this motivates a new class of second-order in-time algorithms for non-distributed optimization given by the feedback interconnection

[TABLE]

where $\mathbf{g}$ represents arbitrary dynamics for the quorum variable [Russo and Slotine, 2010], and in general might be chosen as a nonlinear filter. The simple linear filter $\mathbf{g}(\tilde{\mathbf{x}},\mathbf{x})=k(\mathbf{x}-\tilde{\mathbf{x}})$ recovers EASGD. Fig. 9 shows that, while EASGD obtains better performance than QSGD, QSGD maintains better stability properties. Designing nonlinear filters $\mathbf{g}$ that can combine the regularization of EASGD with the stability of QSGD is an interesting direction of future research.

Returning to the distributed case, Fig. 9(d) shows that EASGD and QSGD respond differently to the choice of $k$ 151515Fig. 9(d) shows the distance from $\tilde{\mathbf{x}}$ for EASGD. The distance from $\mathbf{x}^{\bullet}$ for EASGD is nearly identical.. EASGD is less synchronized than QSGD in all cases. Hence, in the context of Fig. 7, a possible explanation for the improved performance of EASGD when compared to QSGD is simply the observation that it tends to remain less synchronized.

To answer this question, we use a scaling factor $k_{EASGD}=r\times k_{QSGD}$ to roughly match the levels of synchronization between EASGD and QSGD. Results for $r=1.35$ are shown in Fig. 11, and the synchronization curves are either approximately equal or EASGD remains more synchronized across all values of $p$ . Additional values of $p=32$ and $p=64$ are shown, and EASGD now converges for all attempted values of $p<64$ . QSGD continues to perform worse than EASGD on the test data due to an increased tendency to overfit. As the number of agents is increased, QSGD improves up to $p=32$ ; $p=64$ obtains roughly the same test performance. EASGD improves up to around $p=16$ and does not converge for $p=64$ (see Fig. 11(a) – the curves in (b) and (d) are covered by the insets, but EASGD obtains roughly 55% testing accuracy). In general, EASGD with $p$ agents obtains roughly the same performance as QSGD with $2p$ agents. Interestingly, Fig. 11(d) shows that the high $p$ stability issues for EASGD are not simply due to a lack of synchronization, as EASGD actually remains more synchronized than QSGD for $p=64$ for much of the training time. We offer a simple possible explanation for these stability issues in the SI by analyzing discrete-time optimization of a one-dimensional quadratic objective. Another explanation is afforded by Thms. 5.1 and 5.2, which reveal poor scaling with $p$ of both terms in the bound for EASGD when compared to QSGD. Together, these observations highlight stability issues in both continuous- and discrete-time.

As discussed in the text and the description of the experimental setup, our theory allows for the agents to be initialized in different locations, and for the agents to use distinct learning rates through individual learning rate schedules. In the original work on EASGD, it was postulated that starting the agents at different locations would break symmetry and lead to instability [Zhang et al., 2015]. Similarly, a single learning rate was used for all agents. The above simulations demonstrate that starting from distinct locations and decreasing the learning rate on an individual basis is non-problematic. We show in Fig. 12 that starting from a single location leads to decreased performance. Surprisingly, Fig. 12 also highlights that initializing the agents from multiple locations is critical for optimal improvement as the number of agents is increased.

7 Conclusion

In this paper, we presented a continuous-time analysis of distributed stochastic gradient algorithms within the framework of stochastic nonlinear contraction theory. Through analogy with quorum sensing mechanisms, we analyzed the effect of synchronization of the individual SGD agents on the noise generated by their stochastic gradient approximations. We demonstrated that synchronization can effectively reduce the noise felt by each of the individual agents and by their spatial mean. We further demonstrated that synchronization can be seen to reduce the amount of smoothing imposed by SGD on the loss function. Through simulations on model non-convex optimization problems, we provided insight into how the distributed and coupled setting affects convergence to minima of the smoothed loss and the true loss. We introduced a new distributed algorithm, QSGD, and proved convergence results for a strongly convex objective for QSGD, QSGD with momentum, and EASGD. We further introduced a state-dependent variant of QSGD and constructed one specific example of the algorithm to show how the formalism can be used to bias exploration. We presented experiments on deep neural networks, and compared the properties of QSGD, SD-QSGD, and EASGD for generalization performance. We noted an interesting regularizing property of EASGD even in the single-agent case, and compared it to basic SGD with momentum, showing that it can lead to improved generalization. Research into similar higher-order in time optimization algorithms formed as coupled dynamical systems is an interesting direction of future work.

Acknowledgments

N. M. Boffi was supported by a Department of Energy Computational Science Graduate Fellowship. We graciously thank the reviewers for helpful feedback and for suggestions to improve the work.

Appendix A Interaction between synchronization and noise: extra quorum dynamics

We now provide a mathematical characterization of how synchronization reduces the noise felt by the agents with arbitrary quorum dynamics. This is a generalization of what was shown in Sec. 3.2, and does not depend on the dynamics of the quorum variable. In addition to the assumptions stated in Sec. 2.6, we require that the gradient workers are stochastically contracting with rate $\lambda=k-\bar{\lambda}$ and bound $\frac{\eta}{b}C$ , so that the synchronization condition (22) derived in Sec. 3.1 can be applied. For completeness, we consider,

[TABLE]

As in the main text, we define $x^{\bullet}=\frac{1}{p}\sum_{i}x_{i}$ . Adding up the stochastic dynamics in (65), we find

[TABLE]

We then define

[TABLE]

so that we can rewrite

[TABLE]

Applying the Taylor formula with integral remainder to the components of the gradient $\left(-\nabla f(x)\right)_{j}$ , we have, with $\mathbf{F}_{j}$ denoting the gradient of $\left(-\nabla f(x)\right)_{j}$ , and $\mathbf{H}_{j}$ denoting its Hessian,

[TABLE]

Summing over $i$ and applying the assumed bound $\mathbf{H}_{j}\leq Q\mathbf{I}$ leads to the inequality

[TABLE]

The left-hand side of the above inequality is $p|\boldsymbol{\epsilon}_{j}|$ . Squaring both sides and summing over $j$ provides a bound on $p^{2}\|\boldsymbol{\epsilon}\|^{2}$ . Squaring both sides, performing this sum, noting that $j$ runs from $1$ to $n$ , taking a square root, taking an expectation over the noise, and using the synchronization condition in (22),

[TABLE]

As a sum of $p$ independent Gaussian random variables with mean zero and standard deviations $\frac{\eta}{bp^{2}}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ , the quantity

[TABLE]

can be rewritten as a single Gaussian random variable with $\mathbf{T}\mathbf{T}^{T}=\sum_{i}\boldsymbol{\Sigma}(\mathbf{x}^{i})$ as in the main text. Thus, for a given noise covariance $\boldsymbol{\Sigma}$ and corresponding bound $C$ , the difference between the dynamics followed by $\mathbf{x}^{\bullet}$ and the noise-free dynamics

[TABLE]

tends to zero almost surely as $k\rightarrow\infty$ and $p\rightarrow\infty$ . The limit $k\rightarrow\infty$ is needed to increase the degree of synchronization to eliminate the effect of $\boldsymbol{\epsilon}$ on $\mathbf{x}^{\bullet}$ , while the limit $p\rightarrow\infty$ is needed to eliminate the effect of the additive noise.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bach and Moulines [2013] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural Information Processing Systems 26 , pages 773–781. 2013.
2Banburski et al. [2019] Andrzej Banburski, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Bob Liang, Jack Hidary, and Tomaso A. Poggio. Theory III: dynamics and generalization in deep networks. Co RR , abs/1903.04991, 2019.
3Betancourt et al. [2018] Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On Symplectic Optimization. ar Xiv e-prints , 2018. ISSN 1041-1135. doi: 10.1109/LPT.2005.844008 .
4Bottou [1998] Léon Bottou. On-line learning in neural networks. chapter On-line Learning and Stochastic Approximations, pages 9–42. New York, NY, USA, 1998. ISBN 0-521-65263-4.
5Bottou [2010] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) , pages 177–187, Paris, France, 2010. Springer.
6Bouvrie and Slotine [2013] Jake Bouvrie and Jean-Jacques Slotine. Synchronization and noise: A mechanism for regularization in neural systems. ar Xiv e-prints , 2013.
7Boyd et al. [2010] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning , pages 1–122, 2010.
8Chaudhari and Soatto [2018] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA) , pages 1–10, 2018. doi: 10.1109/ITA.2018.8503224 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A continuous-time analysis of distributed stochastic gradient

Abstract

1 Introduction

2 Mathematical preliminaries

2.1 Convex optimization

Definition 2.1**.**

Definition 2.2**.**

2.2 Stochastic gradient descent in discrete-time

2.3 Stochastic gradient descent in continuous-time

2.4 EASGD in continuous-time

2.5 Background on nonlinear contraction theory

Definition 2.3**.**

Lemma 2.1**.**

Proof.

Theorem 2.1**.**

Corollary 2.1**.**

Definition 2.4**.**

Theorem 2.2**.**

Proof.

2.6 Assumptions

Assumption 2.1**.**

Assumption 2.2**.**

3 Synchronization and noise

3.1 A measure of synchronization

Theorem 3.1**.**

Proof.

Definition 3.1**.**

Lemma 3.1**.**

Proof.

3.2 Reduction of noise due to synchronization

Theorem 3.2**.**

Proof.

3.3 Discussion

3.4 Extension to multiple learning rates

3.5 Extension to momentum methods

Lemma 3.2**.**

Proof.

4 An alternative view of distributed stochastic gradient descent

4.1 The effect of synchronization on the convolution scaling

4.2 Discussion

4.3 Numerical simulations in nonconvex optimization

5 Convergence analysis

5.1 QSGD convergence analysis

Lemma 5.1**.**

Proof.

Remark 5.1*.*

Remark 5.2*.*

Remark 5.3*.*

Theorem 5.1**.**

Proof.

5.2 EASGD convergence analysis

Lemma 5.2**.**

Proof.

Theorem 5.2**.**

Proof.

5.3 QSGD with momentum convergence analysis

Lemma 5.3**.**

Proof.

Theorem 5.3**.**

Proof.

5.4 Extensions to other distributed structures

Lemma 5.4**.**

Proof.

5.5 Specialization to a multivariate quadratic objective

Lemma 5.5**.**

Proof.

6 Deep network simulations

6.1 Experimental setup

6.2 Experimental Results

7 Conclusion

Acknowledgments

Appendix A Interaction between synchronization and noise: extra quorum dynamics

Definition 2.1.

Definition 2.2.

Definition 2.3.

Lemma 2.1.

Theorem 2.1.

Corollary 2.1.

Definition 2.4.

Theorem 2.2.

Assumption 2.1.

Assumption 2.2.

Theorem 3.1.

Definition 3.1.

Lemma 3.1.

Theorem 3.2.

Lemma 3.2.

Lemma 5.1.

*Remark 5.1**.*

*Remark 5.2**.*

*Remark 5.3**.*

Theorem 5.1.

Lemma 5.2.

Theorem 5.2.

Lemma 5.3.

Theorem 5.3.

Lemma 5.4.

Lemma 5.5.