Lite Transformer with Long-Short Range Attention

Zhanghao Wu; Zhijian Liu; Ji Lin; Yujun Lin; Song Han

arXiv:2004.11886·cs.CL·April 27, 2020·130 cites

Lite Transformer with Long-Short Range Attention

Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han

PDF

Open Access 2 Repos

TL;DR

Lite Transformer introduces Long-Short Range Attention to create an efficient mobile NLP model that balances local and global context modeling, significantly reducing computation while maintaining high performance across multiple language tasks.

Contribution

The paper proposes a novel Long-Short Range Attention mechanism and a lightweight transformer architecture optimized for mobile devices, outperforming existing models without extensive architecture search.

Findings

01

Outperforms transformer on WMT'14 English-French translation with fewer MACs.

02

Reduces transformer computation by 2.5x with minimal BLEU score loss.

03

Further compresses model size by 18.2x using pruning and quantization.

Abstract

Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam