A Length-Extrapolatable Transformer

Yutao Sun; Li Dong; Barun Patra; Shuming Ma; Shaohan Huang; Alon; Benhaim; Vishrav Chaudhary; Xia Song; Furu Wei

arXiv:2212.10554·cs.CL·December 21, 2022·5 cites

A Length-Extrapolatable Transformer

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon, Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei

PDF

Open Access 5 Repos 6 Models

TL;DR

This paper introduces a novel Transformer design that enhances length extrapolation capabilities by improving attention resolution through relative position embeddings and blockwise causal attention, enabling better performance on longer sequences.

Contribution

The paper proposes two new methods to improve length extrapolation in Transformers, focusing on attention resolution with relative position embeddings and blockwise causal attention.

Findings

01

Enhanced performance in length extrapolation tasks

02

Improved attention resolution metrics

03

Effective in language modeling for longer sequences

Abstract

Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout