Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network
Chengyi Wang, Shuangzhi Wu, Shujie Liu

TL;DR
This paper introduces a hybrid model combining self-attention and RNNs to significantly accelerate Transformer decoding while maintaining comparable translation quality, addressing inference speed limitations.
Contribution
The paper proposes a novel hybrid network of self-attention and RNNs that speeds up decoding and retains translation quality, improving inference efficiency.
Findings
Hybrid model decodes 4 times faster than standard Transformer.
Achieves comparable translation quality to Transformer with knowledge distillation.
Systematic analysis of time costs of Transformer and RNN components.
Abstract
Due to the highly parallelizable architecture, Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. However, at inference time, each output word requires all the hidden states of the previously generated words, which limits the parallelization capability, and makes it much slower than RNN-based ones. In this paper, we systematically analyze the time cost of different components of both the Transformer and RNN-based model. Based on it, we propose a hybrid network of self-attention and RNN structures, in which, the highly parallelizable self-attention is utilized as the encoder, and the simpler RNN structure is used as the decoder. Our hybrid network can decode 4-times faster than the Transformer. In addition, with the help of knowledge distillation, our hybrid network achieves comparable translation quality to the original Transformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
