# Stochastic Training of Neural Networks via Successive Convex   Approximations

**Authors:** Simone Scardapane, Paolo Di Lorenzo

arXiv: 1706.04769 · 2017-06-16

## TL;DR

This paper introduces a novel stochastic successive convex approximation algorithm for neural network training that leverages first-order information, leading to faster convergence and better minima, with easy parallelization.

## Contribution

The paper presents a new SCA-based training method for neural networks that uses only first-order stochastic approximations, improving convergence speed and scalability.

## Key findings

- Outperforms state-of-the-art training techniques in convergence speed.
- Effectively parallelizable across multiple computational units.
- Demonstrates success on medium-sized and large-scale datasets.

## Abstract

This paper proposes a new family of algorithms for training neural networks (NNs). These are based on recent developments in the field of non-convex optimization, going under the general name of successive convex approximation (SCA) techniques. The basic idea is to iteratively replace the original (non-convex, highly dimensional) learning problem with a sequence of (strongly convex) approximations, which are both accurate and simple to optimize. Differently from similar ideas (e.g., quasi-Newton algorithms), the approximations can be constructed using only first-order information of the neural network function, in a stochastic fashion, while exploiting the overall structure of the learning problem for a faster convergence. We discuss several use cases, based on different choices for the loss function (e.g., squared loss and cross-entropy loss), and for the regularization of the NN's weights. We experiment on several medium-sized benchmark problems, and on a large-scale dataset involving simulated physical data. The results show how the algorithm outperforms state-of-the-art techniques, providing faster convergence to a better minimum. Additionally, we show how the algorithm can be easily parallelized over multiple computational units without hindering its performance. In particular, each computational unit can optimize a tailored surrogate function defined on a randomly assigned subset of the input variables, whose dimension can be selected depending entirely on the available computational power.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.04769/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1706.04769/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/1706.04769/full.md

---
Source: https://tomesphere.com/paper/1706.04769