Improved state mixing in higher-order and block diagonal linear recurrent networks

Igor Dubinin; Antonio Orvieto; Felix Effenberger

arXiv:2602.12021·cs.LG·March 3, 2026

Improved state mixing in higher-order and block diagonal linear recurrent networks

Igor Dubinin, Antonio Orvieto, Felix Effenberger

PDF

Open Access 3 Reviews

TL;DR

This paper introduces structured linear recurrent networks that enhance expressivity through richer state mixing, achieving competitive performance with improved efficiency over traditional diagonal LRNNs and nonlinear models.

Contribution

The paper proposes two novel architectures, H-LRU and BD-LRU, that increase expressivity of LRNNs via higher-order and block-diagonal state mixing while maintaining efficiency.

Findings

01

BD-LRU matches or exceeds performance of SSMs and LSTMs in synthetic tasks

02

H-LRU is highly parameter-efficient in compression tasks

03

Structured state mixing enhances expressivity without sacrificing efficiency

Abstract

Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The proposed architectural parameterizations (H-LRU, BD-LRU) are conceptually elegant and clearly motivated. The authors show that higher-order recurrence and block structure are natural ways to increase mixing without resorting to full dense transitions. Both architectures incorporate input-dependent selective gates with L1-normalization (per-channel or per-row), ensuring forward-pass stability and bounded dynamics. 2. The authors provide a theoretical justification (Proposition 1) that thi

Weaknesses

1. **Limited evaluations**: The tasks considered in the empirical evaluations are only synthetic ones. While the use of the MAD benchmark is very useful, demonstrating competitive performance on large-scale language or long-sequence datasets (e.g., LRA, language modeling). This makes it hard to assess if the structural benefits of the proposed architectures transfer to more practical usage. The authors acknowledge this as future work though. 2. The core contributions are about the structure of

Reviewer 02Rating 8Confidence 3

Strengths

The motivation is clear. Diagonal LRNNs are efficient but expressively limited. Structured non-diagonal mixing can close the gap while retaining much of the efficiency. The normalisation is well motivated and supported by the ablation in Figure 2. Several benchmarks support the move to H-LRUs and BD-LRUs. The jump from m=1 to m=2 is notable, plausibly due to complex eigenvalues (as the authors note). Permutation tasks show an advantage for higher m as task complexity increases, especially for

Weaknesses

The use of H-LRU and BD-LRU themselves is not novel, which is why the efficient implementation and normalization are so important. The main text briefly states how block-diagonal structure reduces the cost of the parallel scan; more detail is deferred to the appendix/code. A short sketch in the main text would help.

Reviewer 03Rating 2Confidence 3

Strengths

The presentation of the architecture is clear, and easy to follow. The notations and formulas are well defined which makes the reading smooth. The proposed method is novel which has not been proposed before. There are extensive experiments to evaluate the proposed architecture.

Weaknesses

See Questions

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques