RotRNN: Modelling Long Sequences with Rotations
Kai Biegun, Rares Dolga, Jake Cunningham, David Barber

TL;DR
RotRNN introduces a linear recurrent model leveraging rotation matrices to improve simplicity, efficiency, and robustness in long sequence modelling, achieving competitive results with state-of-the-art models.
Contribution
The paper proposes RotRNN, a novel linear recurrent neural network using rotation matrices, simplifying initialization and normalization while maintaining competitive performance.
Findings
RotRNN offers a robust normalization procedure.
RotRNN achieves competitive performance on long sequence benchmarks.
The model simplifies the implementation of linear recurrent networks.
Abstract
Linear recurrent neural networks, such as State Space Models (SSMs) and Linear Recurrent Units (LRUs), have recently shown state-of-the-art performance on long sequence modelling benchmarks. Despite their success, their empirical performance is not well understood and they come with a number of drawbacks, most notably their complex initialisation and normalisation schemes. In this work, we address some of these issues by proposing RotRNN -- a linear recurrent model which utilises the convenient properties of rotation matrices. We show that RotRNN provides a simple and efficient model with a robust normalisation procedure, and a practical implementation that remains faithful to its theoretical derivation. RotRNN also achieves competitive performance to state-of-the-art linear recurrent models on several long sequence modelling datasets.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* Good theoretical justification of parametrizing recurrent state matrix as rotation matrix: * Show how rotations can be easily decomposed for efficient matrix power computation * Show how orthogonality of rotation matrices is used for normalization, leading to almost constant hidden state norms * Very effective normalization enabled by the orthogonality of rotation matrices (as seen in figure 3) ensuring that hidden state norms do not vanish/explode across long sequences, which is imp
* The proposed approach does not seem to improve the performance on the LRA benchmarks. As shown in the table 1, the proposed approach is better than baselines only on text and that too with a very small margin. While on other benchmarks and on average, it performs significantly worse (upto 10 percent points in case of Path-X). * Also on speech commands classification task of table 2, the proposed approach is not better than any of the shown baselines. * Although the proposed implementations u
While I didn't look at the detailed linear algebra proofs, they seemed correct to me mathematically and made sound intuitive sense. There are also results showing that norms of the state are well preserved when the model is run, so that part of the proposal also seems to work well. The provided Jax code also makes it clear to see how the implementation matches the technical details of the paper.
While I am moved by the simplicity of their parameterization compared to prior works, I am not sure if the contribution is enough to merit a paper in ICRL with the kind of experimental exploration performed. I think a proper paper would run much further with the proposed method than the author(s) have done here. Speech commands is quite a small dataset and the results on it, and on LRA shed little light into the details of their method. And the results on these datasets are not necessarily bette
The factorization of the linear recurrence matrix into cosine and sine rotations is elegant, and was a pleasure to read.
The key weakness of this paper is a minor oversight in the analysis of LRU, which calls into question the value of this paper's contribution. The proposed algorithm is very similar to LRU, except that it forces the eigenvalues of the recurrence matrix to come in complex conjugate pairs. The manuscript notes this as a weakness of LRU: that LRU does not require the eigenvalues to come in complex conjugate pairs, and instead, LRU simply takes the real part of the output of the linear layer. It
Rigorous mathematical background: The math backgrounds behind the rotation matrix-based parameterization and explicit normalization method are proved with easy-to-read derivations. In addition, those backgrounds lead the simple implementation of RotRNN. In-depth comparison between former architectures: The theoretic comparisons between RotRNN and (LRU/SSM) are helpful to posit RotRNN within this field. Strong, latest baselines: This paper compares RotRNN with latest and state-of-the-art basel
Majors: - Limitation of rotation matrix parameterization: I think there would be drawback with constraining state transformation matrix to be rotation matrix, which might limit expression power of the model. - Potential drawback of explicit normalization method: It is unclear that whether the explicit normalization method is beneficial for performance. I understand that this method constrains the operation to target a specific range of dependency based on the trained value of $gamma$, so it loo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismic Imaging and Inversion Techniques · Computational Physics and Python Applications · Medical Image Segmentation Techniques
