xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian P\"oppel, Markus Spanring, Andreas Auer,, Oleksandra Prudnikova, Michael Kopp, G\"unter Klambauer, Johannes, Brandstetter, Sepp Hochreiter

TL;DR
This paper introduces xLSTM, an extended LSTM architecture with exponential gating and modified memory structures, enabling scalable language modeling that rivals Transformers and State Space Models in performance.
Contribution
The paper presents novel extensions to LSTM, including exponential gating and new memory structures, to improve scalability and performance in large-scale language modeling.
Findings
xLSTM outperforms traditional LSTMs in large-scale tasks.
Exponential gating enhances stability and capacity.
Modified memory structures improve scalability.
Abstract
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Sigmoid Activation · Tanh Activation · Dropout · Label Smoothing · Residual Connection · Long Short-Term Memory · Softmax · Position-Wise Feed-Forward Layer · Multiplicative LSTM
