A Length-Extrapolatable Transformer
Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon, Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei

TL;DR
This paper introduces a novel Transformer design that enhances length extrapolation capabilities by improving attention resolution through relative position embeddings and blockwise causal attention, enabling better performance on longer sequences.
Contribution
The paper proposes two new methods to improve length extrapolation in Transformers, focusing on attention resolution with relative position embeddings and blockwise causal attention.
Findings
Enhanced performance in length extrapolation tasks
Improved attention resolution metrics
Effective in language modeling for longer sequences
Abstract
Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TheBloke/LLongMA-2-7B-GPTQmodel· 12 dl· ♡ 1212 dl♡ 12
- 🤗TheBloke/LLongMA-2-7B-GGMLmodel· 3 dl· ♡ 213 dl♡ 21
- 🤗emozilla/LLongMA-2-7b-flashmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗CofeAI/FLM-101Bmodel· 31 dl· ♡ 9231 dl♡ 92
- 🤗TheBloke/LLongMA-2-7B-GGUFmodel· 188 dl· ♡ 1188 dl♡ 1
- 🤗TheBloke/LLongMA-2-7B-AWQmodel· 8 dl· ♡ 18 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout
