Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation
Itamar Zimerman, Ameen Ali, Lior Wolf

TL;DR
This paper presents a unified implicit attention framework for modern gated-linear RNNs, enhancing explainability and demonstrating competitive results with state-of-the-art methods in sequence modeling.
Contribution
It introduces a unified implicit attention formulation for gated RNNs, enabling better explainability and comparison across models.
Findings
Attention matrices and attribution methods outperform previous formulations.
The framework is effective and competitive with Transformer explainability methods.
The approach applies broadly to various gated RNN architectures.
Abstract
Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout
