GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Tobias Katsch

TL;DR
GateLoop introduces a data-controlled linear recurrence model that generalizes previous models, offering improved sequence modeling capabilities with efficient modes and providing new insights into attention mechanisms for transformers.
Contribution
The paper develops GateLoop, a novel sequence model that generalizes linear recurrent models with data-controlled state transitions, enhancing performance and theoretical understanding.
Findings
Outperforms existing models in auto-regressive language tasks
Provides efficient $O(l)$ and $O(l \log_{2} l)$ modes for sequence processing
Reveals implications for Transformer architectures through surrogate attention modes
Abstract
Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost recurrent mode and an efficient parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The main method introduced is a very simple but motivated modification of prior work. 2. The background work is explained well as well as the connections to related models such as S4/S4D/S5, Hyena, and RetNet. 3. The surrogate attention mode is a nice connection. Although it could have explained the connection to RetNet better (and even the original Linear Attention paper by Katharopoulos) which has the same attention representation (up to the distinction of their $a$ transition being fixed).
1. Only one empirical result is provided (WikiText-103). While the result is very strong, evaluations on more settings would strengthen the paper. 2. The paper does not comment on potential limitations of the method (see point 4 and 5 in Questions below). 3. The presentation feels rushed; there are several broken references. Some details of the model definition are not explained (see questions). 4. Overall the submission is sparse in content; it is substantially under the page limit, and large p
I like the illustrations and the overall presentation strategy. I enjoyed the discussion, and I think this paper has great potential but needs more work. I would say it is a week away from being an excellent submission.
The problem with this paper is that it is quite obviously half-cooked. 1) notation problems: equation 7 (the main equation) has non-matching matrix dimensions for multiplications and additions. This hurts clarity quite a bit. There is also a "section ?" which indicates authors did not have time for a proper final pass. 2) the paper is short: the content is, being generous, max three pages. While the idea is simple, it needs a proper comparison with existing architecture, especially on toy tasks
The idea is simple and clearly contextualized in the landscape of gated-recurrence or gated-convolution models.
The technical contribution is limited, and the experiments incomplete.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Linear Layer · Softmax · Byte Pair Encoding · Adam · Dense Connections · Layer Normalization · Dropout
