Lite Transformer with Long-Short Range Attention
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han

TL;DR
Lite Transformer introduces Long-Short Range Attention to create an efficient mobile NLP model that balances local and global context modeling, significantly reducing computation while maintaining high performance across multiple language tasks.
Contribution
The paper proposes a novel Long-Short Range Attention mechanism and a lightweight transformer architecture optimized for mobile devices, outperforming existing models without extensive architecture search.
Findings
Outperforms transformer on WMT'14 English-French translation with fewer MACs.
Reduces transformer computation by 2.5x with minimal BLEU score loss.
Further compresses model size by 18.2x using pruning and quantization.
Abstract
Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
