Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

Francois Chaubard; Mykel Kochenderfer

arXiv:2505.17852·cs.LG·May 26, 2025

Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

Francois Chaubard, Mykel Kochenderfer

PDF

TL;DR

This paper introduces a zero-order optimization method for training large recurrent neural networks efficiently, reducing memory usage and surpassing traditional backpropagation through time in convergence and generalization.

Contribution

The authors demonstrate that zero-order optimization can effectively train billion-parameter RNNs, offering a scalable alternative to BPTT with improved convergence and regularization.

Findings

01

Zero-order optimization matches or exceeds BPTT in convergence rates.

02

The method reduces memory requirements significantly during training.

03

Models trained with ZOO generalize as well or better than BPTT-trained models.

Abstract

During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.