Improved state mixing in higher-order and block diagonal linear recurrent networks
Igor Dubinin, Antonio Orvieto, Felix Effenberger

TL;DR
This paper introduces structured linear recurrent networks that enhance expressivity through richer state mixing, achieving competitive performance with improved efficiency over traditional diagonal LRNNs and nonlinear models.
Contribution
The paper proposes two novel architectures, H-LRU and BD-LRU, that increase expressivity of LRNNs via higher-order and block-diagonal state mixing while maintaining efficiency.
Findings
BD-LRU matches or exceeds performance of SSMs and LSTMs in synthetic tasks
H-LRU is highly parameter-efficient in compression tasks
Structured state mixing enhances expressivity without sacrificing efficiency
Abstract
Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed architectural parameterizations (H-LRU, BD-LRU) are conceptually elegant and clearly motivated. The authors show that higher-order recurrence and block structure are natural ways to increase mixing without resorting to full dense transitions. Both architectures incorporate input-dependent selective gates with L1-normalization (per-channel or per-row), ensuring forward-pass stability and bounded dynamics. 2. The authors provide a theoretical justification (Proposition 1) that thi
1. **Limited evaluations**: The tasks considered in the empirical evaluations are only synthetic ones. While the use of the MAD benchmark is very useful, demonstrating competitive performance on large-scale language or long-sequence datasets (e.g., LRA, language modeling). This makes it hard to assess if the structural benefits of the proposed architectures transfer to more practical usage. The authors acknowledge this as future work though. 2. The core contributions are about the structure of
The motivation is clear. Diagonal LRNNs are efficient but expressively limited. Structured non-diagonal mixing can close the gap while retaining much of the efficiency. The normalisation is well motivated and supported by the ablation in Figure 2. Several benchmarks support the move to H-LRUs and BD-LRUs. The jump from m=1 to m=2 is notable, plausibly due to complex eigenvalues (as the authors note). Permutation tasks show an advantage for higher m as task complexity increases, especially for
The use of H-LRU and BD-LRU themselves is not novel, which is why the efficient implementation and normalization are so important. The main text briefly states how block-diagonal structure reduces the cost of the parallel scan; more detail is deferred to the appendix/code. A short sketch in the main text would help.
The presentation of the architecture is clear, and easy to follow. The notations and formulas are well defined which makes the reading smooth. The proposed method is novel which has not been proposed before. There are extensive experiments to evaluate the proposed architecture.
See Questions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques
