GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

Tobias Katsch

arXiv:2311.01927·cs.LG·January 30, 2024·2 cites

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

Tobias Katsch

PDF

Open Access 3 Repos 3 Reviews

TL;DR

GateLoop introduces a data-controlled linear recurrence model that generalizes previous models, offering improved sequence modeling capabilities with efficient modes and providing new insights into attention mechanisms for transformers.

Contribution

The paper develops GateLoop, a novel sequence model that generalizes linear recurrent models with data-controlled state transitions, enhancing performance and theoretical understanding.

Findings

01

Outperforms existing models in auto-regressive language tasks

02

Provides efficient $O(l)$ and $O(l \log_{2} l)$ modes for sequence processing

03

Reveals implications for Transformer architectures through surrogate attention modes

Abstract

Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O (l)$ recurrent mode and an efficient $O (l lo g_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O (l^{2})$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. The main method introduced is a very simple but motivated modification of prior work. 2. The background work is explained well as well as the connections to related models such as S4/S4D/S5, Hyena, and RetNet. 3. The surrogate attention mode is a nice connection. Although it could have explained the connection to RetNet better (and even the original Linear Attention paper by Katharopoulos) which has the same attention representation (up to the distinction of their $a$ transition being fixed).

Weaknesses

1. Only one empirical result is provided (WikiText-103). While the result is very strong, evaluations on more settings would strengthen the paper. 2. The paper does not comment on potential limitations of the method (see point 4 and 5 in Questions below). 3. The presentation feels rushed; there are several broken references. Some details of the model definition are not explained (see questions). 4. Overall the submission is sparse in content; it is substantially under the page limit, and large p

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

I like the illustrations and the overall presentation strategy. I enjoyed the discussion, and I think this paper has great potential but needs more work. I would say it is a week away from being an excellent submission.

Weaknesses

The problem with this paper is that it is quite obviously half-cooked. 1) notation problems: equation 7 (the main equation) has non-matching matrix dimensions for multiplications and additions. This hurts clarity quite a bit. There is also a "section ?" which indicates authors did not have time for a proper final pass. 2) the paper is short: the content is, being generous, max three pages. While the idea is simple, it needs a proper comparison with existing architecture, especially on toy tasks

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

The idea is simple and clearly contextualized in the landscape of gated-recurrence or gated-convolution models.

Weaknesses

The technical contribution is limited, and the experiments incomplete.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Linear Layer · Softmax · Byte Pair Encoding · Adam · Dense Connections · Layer Normalization · Dropout