Transformers are Multi-State RNNs

Matanel Oren; Michael Hassid; Nir Yarden; Yossi Adi; Roy Schwartz

arXiv:2401.06104·cs.CL·June 19, 2024·1 cites

Transformers are Multi-State RNNs

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that decoder-only transformers can be viewed as multi-state RNNs and introduces a novel, training-free cache compression method called TOVA, significantly improving efficiency while maintaining performance.

Contribution

The work demonstrates the conceptual equivalence between transformers and multi-state RNNs and proposes TOVA, a new cache compression policy that enhances throughput without retraining.

Findings

01

TOVA outperforms baseline compression policies.

02

Achieves near full-model performance with only 1/8 of cache size.

03

Provides a new perspective linking transformers to RNNs.

Abstract

Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $bounded$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $T$ oken $O$ mission $V$ ia $A$ ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schwartz-lab-nlp/tova
pytorchOfficial

Videos

Transformers are Multi-State RNNs· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning