Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
Lorenzo Livi

TL;DR
This paper reveals how gating mechanisms in RNNs influence effective learning rates and gradient flow, acting as data-driven preconditioners that enhance trainability by coupling state dynamics with parameter updates.
Contribution
It provides a theoretical framework linking gates to effective learning rates and gradient anisotropy, supported by empirical validation across sequence tasks.
Findings
Gates induce lag-dependent, direction-dependent effective learning rates.
Gates concentrate gradient flow into low-dimensional subspaces.
Gating acts as a data-driven preconditioner, improving trainability.
Abstract
We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
