Were RNNs All We Needed?

Leo Feng; Frederick Tung; Mohamed Osama Ahmed; Yoshua Bengio; Hossein; Hajimirsadeghi

arXiv:2410.01201·cs.LG·December 2, 2024·5 cites

Were RNNs All We Needed?

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein, Hajimirsadeghi

PDF

Open Access 3 Repos 2 Models 1 Video 3 Reviews

TL;DR

This paper revisits simplified recurrent neural networks, showing they can be fully parallelizable and perform competitively with Transformers on various sequence modelling tasks.

Contribution

The authors derive minimal versions of LSTMs and GRUs that are simpler, more parameter-efficient, and fully parallelizable, challenging the dominance of Transformers.

Findings

01

Minimal RNN variants use fewer parameters.

02

Simplified RNNs achieve performance comparable to Transformers.

03

Fully parallelizable RNNs perform well across tasks.

Abstract

The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

The core idea of creating minimal versions of LSTM and GRU models for efficient parallel training is compelling. I believe this contribution is novel and could be highly useful. The benchmark results are encouraging and the comparisons made with other models seem appropriate.

Weaknesses

As the authors point out, computational restrictions prevent them from providing large-scale experiments. I would be curious to see performance on additional benchmarks, such as WikiText103, Pile or the long range arena. However, I still believe the submission is strong without them. ---- Small typo: line 370, should read '... recurrent sequence models that can *be* trained in parallel ...' in B.1, parallel_scan_log: log_x0_plus_b_star is not defined, should this be log_h0_plus_b_star? in B

Reviewer 02Rating 6Confidence 4

Strengths

- The approach effectively repurposes older RNNs by leveraging simplifications that enable parallel training, providing an interesting contrast to complex, modern architectures. - The proposed models significantly reduce training time, achieving speed improvements of up to 175x for sequence lengths of 512, which is a notable practical advantage. - This work challenges the abandonment of RNNs in favor of more recent architectures and suggests potential for simpler, more interpretable models in

Weaknesses

*Insufficient Comparison Context:* From the current text, it is not clear whether the computational comparisons (Figure 1) were carried on considering the fact that the proposed models potentially require more layer to achieve competitor quantitative performances on tasks (Table 1). This fact potentially skews the results if competitors require fewer layers for comparable performance. *Limited Dataset Representativeness:* The model's evaluation primarily relies on synthetic, simplified datasets

Reviewer 03Rating 3Confidence 4

Strengths

The paper's technical exposition is clear, offering good motivation for removing parts of the RNN architectures. The authors provide thorough motivation for each architectural decision, especially regarding the time-independence properties of their models. The connection to parallel scan algorithms makes sense. The architectural innovations demonstrate consideration of practical implementation challenges. The rescaling mechanism in minLSTM ensures time-independence while maintaining model expre

Weaknesses

The most significant concern is the paper's relationship to prior work, specifically Martin & Cundy (ParalleIizing Linear Recurrent Neural Nets over Sequence Length, ICLR 2018), which appears to have developed very similar ideas. The proposed minGRU architecture appears mathematically equivalent to their GILR architecture when \(g_t = 1 - z_t\) and \(i_t = \tilde{h}_t\) in the authors' notation. The parallel scan approach for training is also very similar. While the LSTM rescaling mechanism appe

Code & Models

Repositories

Models

Videos

Were RNNs All We Needed? (Paper Explained)· youtube

Taxonomy

TopicsNursing Education, Practice, and Leadership

MethodsSix Ways to Call How can i speak to someone at Metamask Customer Care: A Step by Step Guide · Mamba: Linear-Time Sequence Modeling with Selective State Spaces