The Impact of Positional Encoding on Length Generalization in Transformers
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy,, Payel Das, Siva Reddy

TL;DR
This paper systematically compares different positional encoding schemes in Transformer models and finds that omitting positional encoding (NoPE) can lead to better length generalization, challenging common assumptions.
Contribution
The study provides a comprehensive empirical evaluation of five positional encoding methods, revealing that NoPE often outperforms explicit schemes in length generalization tasks.
Findings
NoPE outperforms other positional encoding methods in length generalization.
Explicit positional encodings like ALiBi, Rotary, and APE are less effective for extrapolation.
Scratchpad format impacts model performance and is not always beneficial.
Abstract
Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention with Linear Biases · Stochastic Gradient Descent
