Echo State Transformer: Attention Over Finite Memories
Yannis Bendi-Ouis (Mnemosyne), Xavier Hinaut (Mnemosyne)

TL;DR
The paper introduces Echo State Transformers (EST), a hybrid model combining Transformer attention with reservoir computing to create an efficient, fixed-size memory system that overcomes quadratic complexity, achieving state-of-the-art results in time series tasks.
Contribution
EST is the first hybrid architecture integrating Transformer attention with reservoir computing to enable linear complexity and improved performance on time series classification and detection.
Findings
EST ranks first in two categories of the Time Series Library benchmark.
EST outperforms state-of-the-art baselines on classification and anomaly detection.
EST maintains competitive performance on short-term forecasting.
Abstract
While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language, nor how it leverages working memory. Furthermore, Transformers encounters a computational limitation: quadratic complexity growth with sequence length. Motivated by these limitations, we aim to design architectures that leverage efficient working memory dynamics to overcome standard computational barriers. We introduce Echo State Transformers (EST), a hybrid architecture that resolves this challenge while demonstrating state of the art performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with nodes from Reservoir Computing to create a fixed-size memory system. Drawing inspiration from Echo State Networks, our approach leverages several…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed approach makes use of ideas from the reservoir computing literature in the design of transformer blocks, which sheds light on how different fields can be leveraged together in a meaningful way. - The proposed architecture is simple and intuitive, and is relatively easy to follow. - The proposed approach is well motivated, and the paper is generally well presented.
**Technical novelty.** While incorporating ideas from reservoir computing into transformer design is interesting, the way the combination is done is relatively straightforward. In addition, no theoretical results or analysis have been provided to justify the proposed architecture in a more rigorous manner. Both render the technical novelty of the paper somewhat limited. **Empirical performance.** The experiments remain somewhat limited and unconvincing. - My understanding is that part of the
1. The experiments effectively consider a variety of tasks. 2. Figure 1 clearly illustrates the differences between the proposed EST model and the traditional Transformer architecture.
The primary idea of this work is to introduce a reservoir-based recurrent neural network memory to avoid the self-attention computations associated with the original sequence length. However, **this idea is not new**, as there are already numerous studies addressing the integration of RNNs or improving transformers through downsampling, clustering, and other strategies. The authors do not provide a thorough discussion or comparative analysis with these related works, which diminishes the paper's
- Interesting fusion of reservoir computing with attention: using several parallel reservoir units as finite working memory over which attention is applied feels novel compared to token- or patch-based memories. - Strong and broad evaluation on time series tasks (69 tasks; 5 categories) with clear metrics, following the benchmark’s training protocol. Strong results in classification and anomaly detection. - Baselines considered are recent and challenging (e.g., Transformer, PatchTST, Reforme
- EST selects the best of 10 configurations per task; it is not fully clear whether competing methods in TSL were re-tuned to a comparable extent or simply reused benchmark defaults. - The paper claims novelty in learning reservoir dynamics (spectral radius / leak) and in the adaptive leak softmax. However, as far as I could see, there is no ablation removing learned spectral radius, adaptive leak, and self-attention over memory units. Such ablations would quantify each ingredient’s contributi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
