Weight-Space Linear Recurrent Neural Networks
Roussel Desmond Nzoyem, Nawid Keshtmand, Enrique Crespo Fernandez, Idriss Tsayem, Raul Santos-Rodriguez, David A.W. Barton, Tom Deakin

TL;DR
WARP introduces a novel weight-space linear RNN that unifies weight learning with recurrence, enabling efficient adaptation, in-context learning, and superior performance on diverse sequence tasks, including physics-informed applications.
Contribution
The paper presents WARP, a new weight-space linear RNN model that explicitly parametrizes hidden states as weights, allowing for gradient-free adaptation and integration of domain priors, which outperforms existing methods.
Findings
WARP matches or surpasses state-of-the-art on classification tasks.
A physics-informed variant outperforms others by over 10x.
Ablation studies confirm the importance of key architectural components.
Abstract
We introduce WARP (Weight-space Adaptive Recurrent Prediction), a simple yet powerful model that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes its hidden state as the weights and biases of a distinct auxiliary neural network, and uses input differences to drive its recurrence. This brain-inspired formulation enables efficient gradient-free adaptation of the auxiliary network at test-time, in-context learning abilities, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, featuring in the top three in 4 out of 6 real-world challenging datasets. Furthermore, extensive…
Peer Reviews
Decision·ICLR 2026 Poster
S1. Original framing. The move to perform recurrence directly in parameter space is novel and quite elegant. It reads as a middle ground between hypernetworks and fast-weight RNNs, but with the analytical simplicity of a linear transition. S2. Range of results. The experiments span diverse domains - MNIST/CelebA completion, ETT and PEMS forecasting, DSR, and UEA time-series classification. The UEA section is particularly strong: comparisons include modern SSMs like S5, Mamba, S6, NRDE, and NCDE
W1. Benchmark depth. While broad, the benchmark is missing some of the newer SSMs that define the current frontier. In particular, LinOSS (Rusch & Rus, 2024)—an oscilla- tory, long-sequence SSM—is cited but not compared. Given that LinOSS, FACTS, and Griffin all outperform S4 and Mamba on long forecasting tasks, excluding them makes the SoTA claim weaker. W2. Scalability. The transition matrix $A \in \mathbb{R}^{D_\omega \times D_\omega}$ scales quadratically with the size of the decoder, which
The paper builds on well-established analyses of linear networks and extends them naturally to recurrent settings using a spectral-decomposition framework (Schur- and SVD-based). Derivations are internally consistent and clearly documented. The proposed linear recurrence view elegantly bridges RNNs, residual-RNNs, and diagonal SSMs, helping clarify connections between recent model families. Experiments across several dynamical-system tasks (MSD, LV, traffic flow forecasting) confirm the predic
1. The analytic results rest on linear, Gaussian assumptions; nonlinear recurrence effects and gating dynamics are only discussed qualitatively. As such, predictive power for modern gated RNNs or structured SSMs is limited. 2. The analysis centers on the infinite-width, overparameterized limit; it does not quantify where the asymptotic predictions break down for finite models. 3. The authors reference Saxe et al. (2014) but omit more recent theoretical works on curriculum and transfer in RNNs
The core idea of parametrising RNN hidden states as weights of an auxiliary neural network is conceptually interesting and, to the best of my knowledge, novel. The authors test their method on a diverse set of domains and perform a large range of ablations to show the necessity of design choices. The writing is generally accessible, with good intuitive explanations, and goes far to place itself in the larger test-time adaptation literature.
Claims are sometimes overstated and/or imprecise. For example, phrases like "transformative paradigm for adaptive machine intelligence" (Abstract, Conclusion) and "redefine sequence modeling" (Abstract) are not well-supported. The empirical results show WARP is competitive but not uniformly superior. "Brain-inspired formulation" (Abstract, page 2) refers only to using input differences, with citation to synaptic plasticity [16], but the connection is somewhat superficial - there is a rich litera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Industrial Technology and Control Systems
