TL;DR
This paper revisits simplified recurrent neural networks, showing they can be fully parallelizable and perform competitively with Transformers on various sequence modelling tasks.
Contribution
The authors derive minimal versions of LSTMs and GRUs that are simpler, more parameter-efficient, and fully parallelizable, challenging the dominance of Transformers.
Findings
Minimal RNN variants use fewer parameters.
Simplified RNNs achieve performance comparable to Transformers.
Fully parallelizable RNNs perform well across tasks.
Abstract
The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully…
Peer Reviews
Decision·Submitted to ICLR 2025
The core idea of creating minimal versions of LSTM and GRU models for efficient parallel training is compelling. I believe this contribution is novel and could be highly useful. The benchmark results are encouraging and the comparisons made with other models seem appropriate.
As the authors point out, computational restrictions prevent them from providing large-scale experiments. I would be curious to see performance on additional benchmarks, such as WikiText103, Pile or the long range arena. However, I still believe the submission is strong without them. ---- Small typo: line 370, should read '... recurrent sequence models that can *be* trained in parallel ...' in B.1, parallel_scan_log: log_x0_plus_b_star is not defined, should this be log_h0_plus_b_star? in B
- The approach effectively repurposes older RNNs by leveraging simplifications that enable parallel training, providing an interesting contrast to complex, modern architectures. - The proposed models significantly reduce training time, achieving speed improvements of up to 175x for sequence lengths of 512, which is a notable practical advantage. - This work challenges the abandonment of RNNs in favor of more recent architectures and suggests potential for simpler, more interpretable models in
*Insufficient Comparison Context:* From the current text, it is not clear whether the computational comparisons (Figure 1) were carried on considering the fact that the proposed models potentially require more layer to achieve competitor quantitative performances on tasks (Table 1). This fact potentially skews the results if competitors require fewer layers for comparable performance. *Limited Dataset Representativeness:* The model's evaluation primarily relies on synthetic, simplified datasets
The paper's technical exposition is clear, offering good motivation for removing parts of the RNN architectures. The authors provide thorough motivation for each architectural decision, especially regarding the time-independence properties of their models. The connection to parallel scan algorithms makes sense. The architectural innovations demonstrate consideration of practical implementation challenges. The rescaling mechanism in minLSTM ensures time-independence while maintaining model expre
The most significant concern is the paper's relationship to prior work, specifically Martin & Cundy (ParalleIizing Linear Recurrent Neural Nets over Sequence Length, ICLR 2018), which appears to have developed very similar ideas. The proposed minGRU architecture appears mathematically equivalent to their GILR architecture when \(g_t = 1 - z_t\) and \(i_t = \tilde{h}_t\) in the authors' notation. The parallel scan approach for training is also very similar. While the LSTM rescaling mechanism appe
Code & Models
Videos
Were RNNs All We Needed? (Paper Explained)· youtube
Taxonomy
TopicsNursing Education, Practice, and Leadership
MethodsSix Ways to Call How can i speak to someone at Metamask Customer Care: A Step by Step Guide · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
