Transformers are Multi-State RNNs
Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz

TL;DR
This paper reveals that decoder-only transformers can be viewed as multi-state RNNs and introduces a novel, training-free cache compression method called TOVA, significantly improving efficiency while maintaining performance.
Contribution
The work demonstrates the conceptual equivalence between transformers and multi-state RNNs and proposes TOVA, a new cache compression policy that enhances throughput without retraining.
Findings
TOVA outperforms baseline compression policies.
Achieves near full-model performance with only 1/8 of cache size.
Provides a new perspective linking transformers to RNNs.
Abstract
Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - oken mission ia ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only of the original cache size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning
