TL;DR
This paper introduces a method to convert pretrained transformers into efficient recurrent models by replacing their attention mechanism and finetuning, achieving better efficiency-accuracy tradeoffs without retraining from scratch.
Contribution
It proposes a swap-then-finetune procedure to transform pretrained transformers into recurrent models, enhancing efficiency while preserving accuracy.
Findings
Improved efficiency-accuracy tradeoff over standard transformers.
Lower training cost compared to training recurrent models from scratch.
Effective conversion method applicable to large-scale pretrained models.
Abstract
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax
