Finetuning Pretrained Transformers into RNNs

Jungo Kasai; Hao Peng; Yizhe Zhang; Dani Yogatama; Gabriel Ilharco,; Nikolaos Pappas; Yi Mao; Weizhu Chen; Noah A. Smith

arXiv:2103.13076·cs.CL·September 21, 2021

Finetuning Pretrained Transformers into RNNs

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco,, Nikolaos Pappas, Yi Mao, Weizhu Chen, Noah A. Smith

PDF

2 Repos

TL;DR

This paper introduces a method to convert pretrained transformers into efficient recurrent models by replacing their attention mechanism and finetuning, achieving better efficiency-accuracy tradeoffs without retraining from scratch.

Contribution

It proposes a swap-then-finetune procedure to transform pretrained transformers into recurrent models, enhancing efficiency while preserving accuracy.

Findings

01

Improved efficiency-accuracy tradeoff over standard transformers.

02

Lower training cost compared to training recurrent models from scratch.

03

Effective conversion method applicable to large-scale pretrained models.

Abstract

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax