Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization
Francois Chaubard, Mykel Kochenderfer

TL;DR
This paper introduces a zero-order optimization method for training large recurrent neural networks efficiently, reducing memory usage and surpassing traditional backpropagation through time in convergence and generalization.
Contribution
The authors demonstrate that zero-order optimization can effectively train billion-parameter RNNs, offering a scalable alternative to BPTT with improved convergence and regularization.
Findings
Zero-order optimization matches or exceeds BPTT in convergence rates.
The method reduces memory requirements significantly during training.
Models trained with ZOO generalize as well or better than BPTT-trained models.
Abstract
During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
