The Impact of Positional Encoding on Length Generalization in   Transformers

Amirhossein Kazemnejad; Inkit Padhi; Karthikeyan Natesan Ramamurthy,; Payel Das; Siva Reddy

arXiv:2305.19466·cs.CL·November 8, 2023·33 cites

The Impact of Positional Encoding on Length Generalization in Transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy,, Payel Das, Siva Reddy

PDF

Open Access 2 Repos 5 Models 1 Video

TL;DR

This paper systematically compares different positional encoding schemes in Transformer models and finds that omitting positional encoding (NoPE) can lead to better length generalization, challenging common assumptions.

Contribution

The study provides a comprehensive empirical evaluation of five positional encoding methods, revealing that NoPE often outperforms explicit schemes in length generalization tasks.

Findings

01

NoPE outperforms other positional encoding methods in length generalization.

02

Explicit positional encodings like ALiBi, Rotary, and APE are less effective for extrapolation.

03

Scratchpad format impacts model performance and is not always beneficial.

Abstract

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

The Impact of Positional Encoding on Length Generalization in Transformers· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention with Linear Biases · Stochastic Gradient Descent