# Fitting ReLUs via SGD and Quantized SGD

**Authors:** Seyed Mohammadreza Mousavi Kalan, Mahdi Soltanolkotabi, and A. Salman, Avestimehr

arXiv: 1901.06587 · 2019-04-02

## TL;DR

This paper demonstrates that stochastic gradient descent efficiently finds the optimal ReLU weights in a planted Gaussian model, and introduces a quantized SGD method that reduces communication costs without sacrificing convergence speed.

## Contribution

It proves geometric convergence of SGD for ReLU fitting in a planted model and introduces a quantized SGD scheme for distributed training with minimal accuracy loss.

## Key findings

- SGD converges geometrically to the planted model with optimal sample complexity.
- Quantized SGD reduces communication costs significantly.
- Distributed implementation on Amazon EC2 confirms theoretical results.

## Abstract

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form $\mathbf{x}\rightarrow \max(0,\langle\mathbf{w},\mathbf{x}\rangle)$ with $\mathbf{w}\in\mathbb{R}^d$ denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.06587/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1901.06587/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1901.06587/full.md

---
Source: https://tomesphere.com/paper/1901.06587