On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu; Yuanzhi Li; Zhao Song

arXiv:1810.12065·cs.LG·May 28, 2019·47 cites

On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

PDF

Open Access

TL;DR

This paper provides theoretical analysis demonstrating that stochastic gradient descent can efficiently train recurrent neural networks with ReLU activations, showing their ability to memorize data and avoid common training issues.

Contribution

It extends convergence analysis to multi-layer RNNs, offering new tools for understanding ReLU networks and their training dynamics.

Findings

01

SGD achieves linear convergence in training RNNs with sufficiently many neurons.

02

ReLU activations prevent exponential gradient explosion or vanishing.

03

Theoretical evidence of RNNs' ability to memorize data.

Abstract

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the $same$ recurrent unit is repeatedly applied across the entire time horizon of length $L$ , which is analogous to feedforward networks of depth $L$ . We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$ , then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Stochastic Gradient Descent